REFS issues (server lockups, high CPU, high RAM)

Post by **Gostev** » Jun 09, 2017 5:17 pm this post

tsightler wrote:I think the important thing to remember is that it's a Microsoft test "hotfix". I'm pretty amazed that Veeam support is the one sharing it to begin with as they can't really answer any detailed questions on how it works or what it does, only pass along the same information they already have (the same that has been posted in this forum).

Perhaps this is because of complexities associated with opening a support case with Microsoft? There were a number of comments about this a few pages back. We don't mind doing this as long as this accelerates the resolution for our customers, even when the issue is outside of the Veeam code.

In any case, we have good collaboration with the ReFS team and they are currently very engaged. The current fix seems to be solving the big part of the issue, so there's definitely some progress. More importantly, in most cases it makes the system stable enough to remove the need to migrate off of ReFS, so those customers can continue doing backups while ReFS team is looking at the remaining issue.

I will keep everyone posted on all material updates.

Jun 09, 2017 6:56 pm

Gostev wrote:In any case, we have good collaboration with the ReFS team and they are currently very engaged. The current fix seems to be solving the big part of the issue, so there's definitely some progress. More importantly, in most cases it makes the system stable enough to remove the need to migrate off of ReFS, so those customers can continue doing backups while ReFS team is looking at the remaining issue.

Yeah, I'm not sure if I was really clear in my post but my point was that Veeam sharing such a Microsoft hotfix only highlights how closely Veeam and Microsoft are working together on the issue, but that it may still be difficult to get because it's certainly not a final hotfix ready for widespread adoption, it's truly a test hotfix where Veeam+Microsoft are gathering data on the results.

kb1ibt · Post by **kb1ibt** » Jun 11, 2017 4:15 am this post

Another reason that it is only going out to a few people is the driver is currently being distributed with the full debugging headers so that way they can continue to troubleshoot the issue. Especially helpful for cases like mine where I am still running into the problem. So the test driver is not ready for production (and wide scale distribution) since it isn't stable on all systems.

Jun 12, 2017 1:17 pm

Gostev wrote:
In any case, we have good collaboration with the ReFS team and they are currently very engaged. The current fix seems to be solving the big part of the issue, so there's definitely some progress. More importantly, in most cases it makes the system stable enough to remove the need to migrate off of ReFS, so those customers can continue doing backups while ReFS team is looking at the remaining issue.

I will keep everyone posted on all material updates.

100% agree.

We have seen our system go from daily problems to only 1 small issue after the update. This is 100% sure progress

Cicadymn · Post by **Cicadymn** » Jun 14, 2017 4:12 pm this post

I got the ReFS fix from Veeam support set up on my backup copy host. I started having problems with the CPU maxing out after a minute or two and contacted support. After the fix everything was looking better, but once two synthetic fulls started going at the same time it immediately locked out via CPU again. Now I'm back to crashing via CPU shortly after boot.

I'm happy to pull any logs, Windows or Veeam to help give some more data to Microsoft. I've also updated my support case (02186811) with this information.

Post by **mkretzer** » Jun 14, 2017 4:38 pm this post

Will https://support.microsoft.com/en-us/help/4022715 help with REFS?

kubimike · Post by **kubimike** » Jun 14, 2017 6:21 pm this post

Cicadymn wrote:I got the ReFS fix from Veeam support set up on my backup copy host. I started having problems with the CPU maxing out after a minute or two and contacted support. After the fix everything was looking better, but once two synthetic fulls started going at the same time it immediately locked out via CPU again. Now I'm back to crashing via CPU shortly after boot.

I'm happy to pull any logs, Windows or Veeam to help give some more data to Microsoft. I've also updated my support case (02186811) with this information.

I have the file too, Im about to get it installed on my server. Which keys did you end up setting and what values ?

kubimike · Post by **kubimike** » Jun 14, 2017 6:23 pm this post

mkretzer wrote:Will https://support.microsoft.com/en-us/help/4022715 help with REFS?

"Refs.sys","10.0.14393.1198" which is what we have now.

Cicadymn · Post by **Cicadymn** » Jun 14, 2017 6:34 pm this post

kubimike wrote:
I have the file too, Im about to get it installed on my server. Which keys did you end up setting and what values ?

Quick warning to anyone else reading, I wouldn't try to apply any of these without Veeam's approval/instruction. I'm not including the registry locations just in case.

Code: Select all

RefsDisableCachedPins (DWORD) = 1
RefsProcessedDeleteQueueEntryCountThreshold (DWORD) = 2048 (decimal) 
TimeOutValue (DWORD) = 120 (decimal value)

They mentioned maybe changing RefsProcessed... to 1024 or 512 if needed, I don't know if that means it's more aggressive, or less aggressive. I have the previous registry keys that we thought helped from that MS KB applied as well.

lepphce1 · Post by **lepphce1** » Jun 14, 2017 6:37 pm this post

I also checked versions and realized that I have an older version of ReFS that was included in the April 2017 Roll-up (10.0.14393.953). The ReFS version released yesterday (June 2017) is the same as released in May 2017 as mentioned earlier. I'm about to update my servers to the June 2017 release but I am wondering if anybody noticed any improvements after the May 2017 update.

As a side, I've had a ticket open with Microsoft for some time now, and frankly they've been less than helpful / responsive. They seemed to immediately know what was going on after I submitted my crash dump but have been unwilling to provide any information up to this point. After having my main Veeam server stuck in a boot loop all weekend, I'm close to bypassing MS for now and reaching out to Veeam support for the experimental fix. FWIW, I've not had CPU or memory issues. But server goes onto these "reboot fits" it seems when there is some kind of block cloning operation (as others have said, perhaps when it is doing a massive amount of deletes).

kubimike · Post by **kubimike** » Jun 14, 2017 6:45 pm this post

@Cicadymn Ah, I was told to just try the keys you listed above first then if that didn't work like you mentioned reducing 2048 to 1024 etc to then try the original keys microsoft released. Run that past your tech support guy

kubimike · Post by **kubimike** » Jun 14, 2017 6:46 pm this post

well, microsoft wrote me back finally. here is what they said.

Code: Select all

Hello Michael,

I have seen this issue and went on the call with the customer. This is the reference case where we are having issues with “IO Deadlock when using Block Clone operation with Veeam Backup Software” where we have explained the customer that MS is aware about the issue where DPM is used instead of the Veeam. However we are not sure if that fix will help is resolving issues with the Veeam applications.

Cicadymn · Post by **Cicadymn** » Jun 15, 2017 2:29 pm this post

Cicadymn wrote:I got the ReFS fix from Veeam support set up on my backup copy host. I started having problems with the CPU maxing out after a minute or two and contacted support. After the fix everything was looking better, but once two synthetic fulls started going at the same time it immediately locked out via CPU again. Now I'm back to crashing via CPU shortly after boot.

I'm happy to pull any logs, Windows or Veeam to help give some more data to Microsoft. I've also updated my support case (02186811) with this information.

One strange thing to note on this, my backup copy host only crashes (via CPU) when the larger backup copy jobs try to do synthetic fulls. I've got all jobs but my file server jobs running and many of them ran synthetic fulls without issue. However the file servers jobs (ranging from 3-6TB+ VM sizes) caused the server to lock up.

May be that the fix has trouble with larger file sizes.

kubimike · Post by **kubimike** » Jun 15, 2017 3:15 pm this post

@cicadymn are you able to tell if the job is finishing but simply hanging when deleting the older restore points? My issue is deleting large restore points, 4TB+

kubimike · Post by **kubimike** » Jun 15, 2017 5:52 pm this post

Anyone see this from the DPM post ?

Code: Select all

So i disabled Device Guard by turning off Intel-VT Features in Bios. Now the Server is running fine for 4 Days.

https://social.technet.microsoft.com/Fo ... ionmanager

Cicadymn · Post by **Cicadymn** » Jun 15, 2017 7:11 pm this post

kubimike wrote:@cicadymn are you able to tell if the job is finishing but simply hanging when deleting the older restore points? My issue is deleting large restore points, 4TB+

The job hangs at whatever % the job was when the server locks up at 99% CPU. Interesting enough after a hard reboot and if I leave it alone. The job may increase a few % until it locks back out again. They've never completed however, eventually failing. Followed by me disabling them to try to get the server stable again. This always locks up during the creating synthetic full stage.

kubimike · Post by **kubimike** » Jun 15, 2017 7:21 pm this post

How much ram do you have ?I had 16 but when to 192GB just to be safe.

Cicadymn · Post by **Cicadymn** » Jun 15, 2017 8:02 pm this post

32GB on this one. Which should be overkill for what it's processing.

Post by **mkretzer** » Jun 15, 2017 8:41 pm this post

How much do you process? We see RAM usage peaks of 100+ GB when there is a merge of a 3 TB backup file. Our system only stopped crashing since we increased to 384 GB.

Cicadymn · Post by **Cicadymn** » Jun 16, 2017 2:24 pm this post

WOW! That's a lot of ram! Our primary backup host had the RAM issue, but for some reason our backup copy host is having it max out our CPU with RAM sticking at around 50% usage during the ordeal.

When it locks out it's usually processing two jobs, merging backup files that are both 4.5TB.

Post by **mkretzer** » Jun 18, 2017 9:16 am this post

Here you can see how the RAM is used when there is a bigger merge going:

http://imgur.com/GtfjmH8

The problem is that when the RAM went down to this point all of the following merges, backups and so on get extremly slow...

Hauke · Post by **Hauke** » Jun 18, 2017 10:29 am this post

My issues are now solved, I give ReFS a second choice.
I temporary moved all data to another NAS, deleted the ReFS volume, recreated it (with 64k of course), moved the data back to ReFS, did an active full to activate fast clone again - and now the issues are gone.
Speed is back to normal, full 2GBit, (thanks to Veeam that it is able to fully load LACP etherchannels), no memory or CPU issues.
Lets see if it's working now for a longer time.
First ReFS was directly created with a stock 2016 server, now it is fully patched and created than - maybe that's a difference.

The ReFS is powered by a SuperMicro Board with 8 core Intel Xeon, and only 24GB RAM, Areca 1880 RAID controller, 16 Bay box, 10x WD Red 6TB SATA Raid6.
(yep, I now that you will cry because of that low memory, but it went very well for the first 2 month without any issues - and common, having more than 100GB RAM to keep a FS stable is crazy)

Jun 20, 2017 8:17 am

lepphce1 wrote:I also checked versions and realized that I have an older version of ReFS that was included in the April 2017 Roll-up (10.0.14393.953). The ReFS version released yesterday (June 2017) is the same as released in May 2017 as mentioned earlier. I'm about to update my servers to the June 2017 release but I am wondering if anybody noticed any improvements after the May 2017 update.

As a side, I've had a ticket open with Microsoft for some time now, and frankly they've been less than helpful / responsive. They seemed to immediately know what was going on after I submitted my crash dump but have been unwilling to provide any information up to this point. After having my main Veeam server stuck in a boot loop all weekend, I'm close to bypassing MS for now and reaching out to Veeam support for the experimental fix. FWIW, I've not had CPU or memory issues. But server goes onto these "reboot fits" it seems when there is some kind of block cloning operation (as others have said, perhaps when it is doing a massive amount of deletes).

The FIX does work in our system.

We had issues every day before applying the exp patch - and now our system is running 100% as expected.

The new ReFS file is the right way to go.

ReFS.sys - build 14939.1100

Post by **mkretzer** » Jun 20, 2017 8:49 am this post

@thomas.raabo
Did it also fix the merge performance issues or did you not have any issues like that?

Post by **thomas.raabo** » Jun 20, 2017 12:34 pm this post

mkretzer wrote:@thomas.raabo
Did it also fix the merge performance issues or did you not have any issues like that?

That is also fixed.

Yes had that one also.

nmdange · Post by **nmdange** » Jun 20, 2017 1:43 pm this post

thomas.raabo wrote: The FIX does work in our system.

We had issues every day before applying the exp patch - and now our system is running 100% as expected.

The new ReFS file is the right way to go.

ReFS.sys - build 14939.1100

Just thought I'd mention that I just applied the June cumulative update and Refs.sys on my systems is at build 14939.1198 dated 4/27/2017. I wonder if that means the "experimental" fix has been included in the June updates? I've been holding off on moving my backup repositories to ReFS but maybe it's finally ready

kubimike · Post by **kubimike** » Jun 20, 2017 2:26 pm this post

@nmdange I have the 4/27/2017 file on another fully patched windows 2016 box. The version is 14393.1198 .. Is that a type-o ?

The experimental file from msft has version 14393.1100

nmdange · Post by **nmdange** » Jun 20, 2017 2:54 pm this post

1198 is higher than 1100 so that's why I'm asking. Usually higher build numbers include fixes from lower build numbers.

kubimike · Post by **kubimike** » Jun 20, 2017 3:17 pm this post

you mentioned you applied the June update and your file version is now 14939 which I find unusual.

nmdange · Post by **nmdange** » Jun 20, 2017 3:31 pm this post

Sorry yes that was a typo, I was just looking at the last 4 digits

R&D Forums

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Who is online