Comprehensive data protection for all workloads
Locked
Gostev
Chief Product Officer
Posts: 31814
Liked: 7302 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS 4k horror story

Post by Gostev »

tsightler wrote:I think the important thing to remember is that it's a Microsoft test "hotfix". I'm pretty amazed that Veeam support is the one sharing it to begin with as they can't really answer any detailed questions on how it works or what it does, only pass along the same information they already have (the same that has been posted in this forum).
Perhaps this is because of complexities associated with opening a support case with Microsoft? There were a number of comments about this a few pages back. We don't mind doing this as long as this accelerates the resolution for our customers, even when the issue is outside of the Veeam code.

In any case, we have good collaboration with the ReFS team and they are currently very engaged. The current fix seems to be solving the big part of the issue, so there's definitely some progress. More importantly, in most cases it makes the system stable enough to remove the need to migrate off of ReFS, so those customers can continue doing backups while ReFS team is looking at the remaining issue.

I will keep everyone posted on all material updates.
tsightler
VP, Product Management
Posts: 6035
Liked: 2860 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: REFS 4k horror story

Post by tsightler » 1 person likes this post

Gostev wrote:In any case, we have good collaboration with the ReFS team and they are currently very engaged. The current fix seems to be solving the big part of the issue, so there's definitely some progress. More importantly, in most cases it makes the system stable enough to remove the need to migrate off of ReFS, so those customers can continue doing backups while ReFS team is looking at the remaining issue.
Yeah, I'm not sure if I was really clear in my post but my point was that Veeam sharing such a Microsoft hotfix only highlights how closely Veeam and Microsoft are working together on the issue, but that it may still be difficult to get because it's certainly not a final hotfix ready for widespread adoption, it's truly a test hotfix where Veeam+Microsoft are gathering data on the results.
kb1ibt
Influencer
Posts: 14
Liked: never
Joined: Apr 24, 2015 1:40 pm
Contact:

Re: REFS 4k horror story

Post by kb1ibt »

Another reason that it is only going out to a few people is the driver is currently being distributed with the full debugging headers so that way they can continue to troubleshoot the issue. Especially helpful for cases like mine where I am still running into the problem. So the test driver is not ready for production (and wide scale distribution) since it isn't stable on all systems.
thomas.raabo
Service Provider
Posts: 28
Liked: 11 times
Joined: Oct 31, 2016 6:27 pm
Full Name: Thomas Raabo
Location: infrastructure guy
Contact:

Re: REFS 4k horror story

Post by thomas.raabo » 1 person likes this post

Gostev wrote:
In any case, we have good collaboration with the ReFS team and they are currently very engaged. The current fix seems to be solving the big part of the issue, so there's definitely some progress. More importantly, in most cases it makes the system stable enough to remove the need to migrate off of ReFS, so those customers can continue doing backups while ReFS team is looking at the remaining issue.

I will keep everyone posted on all material updates.
100% agree.

We have seen our system go from daily problems to only 1 small issue after the update. This is 100% sure progress
Cicadymn
Enthusiast
Posts: 26
Liked: 12 times
Joined: Jan 30, 2017 7:42 pm
Full Name: Sam
Contact:

Re: REFS 4k horror story

Post by Cicadymn »

I got the ReFS fix from Veeam support set up on my backup copy host. I started having problems with the CPU maxing out after a minute or two and contacted support. After the fix everything was looking better, but once two synthetic fulls started going at the same time it immediately locked out via CPU again. Now I'm back to crashing via CPU shortly after boot.

I'm happy to pull any logs, Windows or Veeam to help give some more data to Microsoft. I've also updated my support case (02186811) with this information.
mkretzer
Veeam Legend
Posts: 1203
Liked: 417 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS 4k horror story

Post by mkretzer »

kubimike
Veteran
Posts: 391
Liked: 56 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

Cicadymn wrote:I got the ReFS fix from Veeam support set up on my backup copy host. I started having problems with the CPU maxing out after a minute or two and contacted support. After the fix everything was looking better, but once two synthetic fulls started going at the same time it immediately locked out via CPU again. Now I'm back to crashing via CPU shortly after boot.

I'm happy to pull any logs, Windows or Veeam to help give some more data to Microsoft. I've also updated my support case (02186811) with this information.
I have the file too, Im about to get it installed on my server. Which keys did you end up setting and what values ?
kubimike
Veteran
Posts: 391
Liked: 56 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

mkretzer wrote:Will https://support.microsoft.com/en-us/help/4022715 help with REFS?
"Refs.sys","10.0.14393.1198" which is what we have now.
Cicadymn
Enthusiast
Posts: 26
Liked: 12 times
Joined: Jan 30, 2017 7:42 pm
Full Name: Sam
Contact:

Re: REFS 4k horror story

Post by Cicadymn »

kubimike wrote:
I have the file too, Im about to get it installed on my server. Which keys did you end up setting and what values ?
Quick warning to anyone else reading, I wouldn't try to apply any of these without Veeam's approval/instruction. I'm not including the registry locations just in case.

Code: Select all

RefsDisableCachedPins (DWORD) = 1
RefsProcessedDeleteQueueEntryCountThreshold (DWORD) = 2048 (decimal) 
TimeOutValue (DWORD) = 120 (decimal value)
They mentioned maybe changing RefsProcessed... to 1024 or 512 if needed, I don't know if that means it's more aggressive, or less aggressive. I have the previous registry keys that we thought helped from that MS KB applied as well.
lepphce1
Enthusiast
Posts: 31
Liked: 2 times
Joined: Jun 28, 2016 4:40 pm
Contact:

Re: REFS 4k horror story

Post by lepphce1 »

I also checked versions and realized that I have an older version of ReFS that was included in the April 2017 Roll-up (10.0.14393.953). The ReFS version released yesterday (June 2017) is the same as released in May 2017 as mentioned earlier. I'm about to update my servers to the June 2017 release but I am wondering if anybody noticed any improvements after the May 2017 update.

As a side, I've had a ticket open with Microsoft for some time now, and frankly they've been less than helpful / responsive. They seemed to immediately know what was going on after I submitted my crash dump but have been unwilling to provide any information up to this point. After having my main Veeam server stuck in a boot loop all weekend, I'm close to bypassing MS for now and reaching out to Veeam support for the experimental fix. FWIW, I've not had CPU or memory issues. But server goes onto these "reboot fits" it seems when there is some kind of block cloning operation (as others have said, perhaps when it is doing a massive amount of deletes).
kubimike
Veteran
Posts: 391
Liked: 56 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

@Cicadymn Ah, I was told to just try the keys you listed above first then if that didn't work like you mentioned reducing 2048 to 1024 etc to then try the original keys microsoft released. Run that past your tech support guy
kubimike
Veteran
Posts: 391
Liked: 56 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

well, microsoft wrote me back finally. here is what they said.

Code: Select all

Hello Michael,

I have seen this issue and went on the call with the customer. This is the reference case where we are having issues with “IO Deadlock when using Block Clone operation with Veeam Backup Software” where we have explained the customer that MS is aware about the issue where DPM is used instead of the Veeam. However we are not sure if that fix will help is resolving issues with the Veeam applications. 

Cicadymn
Enthusiast
Posts: 26
Liked: 12 times
Joined: Jan 30, 2017 7:42 pm
Full Name: Sam
Contact:

Re: REFS 4k horror story

Post by Cicadymn »

Cicadymn wrote:I got the ReFS fix from Veeam support set up on my backup copy host. I started having problems with the CPU maxing out after a minute or two and contacted support. After the fix everything was looking better, but once two synthetic fulls started going at the same time it immediately locked out via CPU again. Now I'm back to crashing via CPU shortly after boot.

I'm happy to pull any logs, Windows or Veeam to help give some more data to Microsoft. I've also updated my support case (02186811) with this information.
One strange thing to note on this, my backup copy host only crashes (via CPU) when the larger backup copy jobs try to do synthetic fulls. I've got all jobs but my file server jobs running and many of them ran synthetic fulls without issue. However the file servers jobs (ranging from 3-6TB+ VM sizes) caused the server to lock up.

May be that the fix has trouble with larger file sizes.
kubimike
Veteran
Posts: 391
Liked: 56 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

@cicadymn are you able to tell if the job is finishing but simply hanging when deleting the older restore points? My issue is deleting large restore points, 4TB+
kubimike
Veteran
Posts: 391
Liked: 56 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

Anyone see this from the DPM post ?

Code: Select all

So i disabled Device Guard by turning off Intel-VT Features in Bios. Now the Server is running fine for 4 Days.
https://social.technet.microsoft.com/Fo ... ionmanager
Cicadymn
Enthusiast
Posts: 26
Liked: 12 times
Joined: Jan 30, 2017 7:42 pm
Full Name: Sam
Contact:

Re: REFS 4k horror story

Post by Cicadymn »

kubimike wrote:@cicadymn are you able to tell if the job is finishing but simply hanging when deleting the older restore points? My issue is deleting large restore points, 4TB+
The job hangs at whatever % the job was when the server locks up at 99% CPU. Interesting enough after a hard reboot and if I leave it alone. The job may increase a few % until it locks back out again. They've never completed however, eventually failing. Followed by me disabling them to try to get the server stable again. This always locks up during the creating synthetic full stage.
kubimike
Veteran
Posts: 391
Liked: 56 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

How much ram do you have ?I had 16 but when to 192GB just to be safe.
Cicadymn
Enthusiast
Posts: 26
Liked: 12 times
Joined: Jan 30, 2017 7:42 pm
Full Name: Sam
Contact:

Re: REFS 4k horror story

Post by Cicadymn »

32GB on this one. Which should be overkill for what it's processing.
mkretzer
Veeam Legend
Posts: 1203
Liked: 417 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS 4k horror story

Post by mkretzer »

How much do you process? We see RAM usage peaks of 100+ GB when there is a merge of a 3 TB backup file. Our system only stopped crashing since we increased to 384 GB.
Cicadymn
Enthusiast
Posts: 26
Liked: 12 times
Joined: Jan 30, 2017 7:42 pm
Full Name: Sam
Contact:

Re: REFS 4k horror story

Post by Cicadymn »

WOW! That's a lot of ram! Our primary backup host had the RAM issue, but for some reason our backup copy host is having it max out our CPU with RAM sticking at around 50% usage during the ordeal.

When it locks out it's usually processing two jobs, merging backup files that are both 4.5TB.
mkretzer
Veeam Legend
Posts: 1203
Liked: 417 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS 4k horror story

Post by mkretzer »

Here you can see how the RAM is used when there is a bigger merge going:

http://imgur.com/GtfjmH8

The problem is that when the RAM went down to this point all of the following merges, backups and so on get extremly slow...
Hauke
Influencer
Posts: 23
Liked: 4 times
Joined: Apr 16, 2015 11:25 am
Full Name: Hauke Ihnen
Contact:

Re: REFS 4k horror story

Post by Hauke »

My issues are now solved, I give ReFS a second choice.
I temporary moved all data to another NAS, deleted the ReFS volume, recreated it (with 64k of course), moved the data back to ReFS, did an active full to activate fast clone again - and now the issues are gone.
Speed is back to normal, full 2GBit, (thanks to Veeam that it is able to fully load LACP etherchannels), no memory or CPU issues.
Lets see if it's working now for a longer time.
First ReFS was directly created with a stock 2016 server, now it is fully patched and created than - maybe that's a difference.

The ReFS is powered by a SuperMicro Board with 8 core Intel Xeon, and only 24GB RAM, Areca 1880 RAID controller, 16 Bay box, 10x WD Red 6TB SATA Raid6.
(yep, I now that you will cry because of that low memory, but it went very well for the first 2 month without any issues - and common, having more than 100GB RAM to keep a FS stable is crazy)
thomas.raabo
Service Provider
Posts: 28
Liked: 11 times
Joined: Oct 31, 2016 6:27 pm
Full Name: Thomas Raabo
Location: infrastructure guy
Contact:

Re: REFS 4k horror story

Post by thomas.raabo » 1 person likes this post

lepphce1 wrote:I also checked versions and realized that I have an older version of ReFS that was included in the April 2017 Roll-up (10.0.14393.953). The ReFS version released yesterday (June 2017) is the same as released in May 2017 as mentioned earlier. I'm about to update my servers to the June 2017 release but I am wondering if anybody noticed any improvements after the May 2017 update.

As a side, I've had a ticket open with Microsoft for some time now, and frankly they've been less than helpful / responsive. They seemed to immediately know what was going on after I submitted my crash dump but have been unwilling to provide any information up to this point. After having my main Veeam server stuck in a boot loop all weekend, I'm close to bypassing MS for now and reaching out to Veeam support for the experimental fix. FWIW, I've not had CPU or memory issues. But server goes onto these "reboot fits" it seems when there is some kind of block cloning operation (as others have said, perhaps when it is doing a massive amount of deletes).
The FIX does work in our system.

We had issues every day before applying the exp patch - and now our system is running 100% as expected.

The new ReFS file is the right way to go.

ReFS.sys - build 14939.1100
mkretzer
Veeam Legend
Posts: 1203
Liked: 417 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS 4k horror story

Post by mkretzer »

@thomas.raabo
Did it also fix the merge performance issues or did you not have any issues like that?
thomas.raabo
Service Provider
Posts: 28
Liked: 11 times
Joined: Oct 31, 2016 6:27 pm
Full Name: Thomas Raabo
Location: infrastructure guy
Contact:

Re: REFS 4k horror story

Post by thomas.raabo »

mkretzer wrote:@thomas.raabo
Did it also fix the merge performance issues or did you not have any issues like that?
That is also fixed.

Yes had that one also.
nmdange
Veteran
Posts: 528
Liked: 144 times
Joined: Aug 20, 2015 9:30 pm
Contact:

Re: REFS 4k horror story

Post by nmdange »

thomas.raabo wrote: The FIX does work in our system.

We had issues every day before applying the exp patch - and now our system is running 100% as expected.

The new ReFS file is the right way to go.

ReFS.sys - build 14939.1100
Just thought I'd mention that I just applied the June cumulative update and Refs.sys on my systems is at build 14939.1198 dated 4/27/2017. I wonder if that means the "experimental" fix has been included in the June updates? I've been holding off on moving my backup repositories to ReFS but maybe it's finally ready :)
kubimike
Veteran
Posts: 391
Liked: 56 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

@nmdange I have the 4/27/2017 file on another fully patched windows 2016 box. The version is 14393.1198 .. Is that a type-o ?

The experimental file from msft has version 14393.1100
nmdange
Veteran
Posts: 528
Liked: 144 times
Joined: Aug 20, 2015 9:30 pm
Contact:

Re: REFS 4k horror story

Post by nmdange »

1198 is higher than 1100 so that's why I'm asking. Usually higher build numbers include fixes from lower build numbers.
kubimike
Veteran
Posts: 391
Liked: 56 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

you mentioned you applied the June update and your file version is now 14939 which I find unusual.
nmdange
Veteran
Posts: 528
Liked: 144 times
Joined: Aug 20, 2015 9:30 pm
Contact:

Re: REFS 4k horror story

Post by nmdange »

Sorry yes that was a typo, I was just looking at the last 4 digits :)
Locked

Who is online

Users browsing this forum: amirshnurman, Bing [Bot], Google [Bot] and 56 guests