tsightler wrote:I think the important thing to remember is that it's a Microsoft test "hotfix". I'm pretty amazed that Veeam support is the one sharing it to begin with as they can't really answer any detailed questions on how it works or what it does, only pass along the same information they already have (the same that has been posted in this forum).
Perhaps this is because of complexities associated with opening a support case with Microsoft? There were a number of comments about this a few pages back. We don't mind doing this as long as this accelerates the resolution for our customers, even when the issue is outside of the Veeam code.
In any case, we have good collaboration with the ReFS team and they are currently very engaged. The current fix seems to be solving the big part of the issue, so there's definitely some progress. More importantly, in most cases it makes the system stable enough to remove the need to migrate off of ReFS, so those customers can continue doing backups while ReFS team is looking at the remaining issue.
I will keep everyone posted on all material updates.
Gostev wrote:In any case, we have good collaboration with the ReFS team and they are currently very engaged. The current fix seems to be solving the big part of the issue, so there's definitely some progress. More importantly, in most cases it makes the system stable enough to remove the need to migrate off of ReFS, so those customers can continue doing backups while ReFS team is looking at the remaining issue.
Yeah, I'm not sure if I was really clear in my post but my point was that Veeam sharing such a Microsoft hotfix only highlights how closely Veeam and Microsoft are working together on the issue, but that it may still be difficult to get because it's certainly not a final hotfix ready for widespread adoption, it's truly a test hotfix where Veeam+Microsoft are gathering data on the results.
Another reason that it is only going out to a few people is the driver is currently being distributed with the full debugging headers so that way they can continue to troubleshoot the issue. Especially helpful for cases like mine where I am still running into the problem. So the test driver is not ready for production (and wide scale distribution) since it isn't stable on all systems.
Gostev wrote:
In any case, we have good collaboration with the ReFS team and they are currently very engaged. The current fix seems to be solving the big part of the issue, so there's definitely some progress. More importantly, in most cases it makes the system stable enough to remove the need to migrate off of ReFS, so those customers can continue doing backups while ReFS team is looking at the remaining issue.
I will keep everyone posted on all material updates.
100% agree.
We have seen our system go from daily problems to only 1 small issue after the update. This is 100% sure progress
I got the ReFS fix from Veeam support set up on my backup copy host. I started having problems with the CPU maxing out after a minute or two and contacted support. After the fix everything was looking better, but once two synthetic fulls started going at the same time it immediately locked out via CPU again. Now I'm back to crashing via CPU shortly after boot.
I'm happy to pull any logs, Windows or Veeam to help give some more data to Microsoft. I've also updated my support case (02186811) with this information.
Cicadymn wrote:I got the ReFS fix from Veeam support set up on my backup copy host. I started having problems with the CPU maxing out after a minute or two and contacted support. After the fix everything was looking better, but once two synthetic fulls started going at the same time it immediately locked out via CPU again. Now I'm back to crashing via CPU shortly after boot.
I'm happy to pull any logs, Windows or Veeam to help give some more data to Microsoft. I've also updated my support case (02186811) with this information.
I have the file too, Im about to get it installed on my server. Which keys did you end up setting and what values ?
kubimike wrote:
I have the file too, Im about to get it installed on my server. Which keys did you end up setting and what values ?
Quick warning to anyone else reading, I wouldn't try to apply any of these without Veeam's approval/instruction. I'm not including the registry locations just in case.
They mentioned maybe changing RefsProcessed... to 1024 or 512 if needed, I don't know if that means it's more aggressive, or less aggressive. I have the previous registry keys that we thought helped from that MS KB applied as well.
I also checked versions and realized that I have an older version of ReFS that was included in the April 2017 Roll-up (10.0.14393.953). The ReFS version released yesterday (June 2017) is the same as released in May 2017 as mentioned earlier. I'm about to update my servers to the June 2017 release but I am wondering if anybody noticed any improvements after the May 2017 update.
As a side, I've had a ticket open with Microsoft for some time now, and frankly they've been less than helpful / responsive. They seemed to immediately know what was going on after I submitted my crash dump but have been unwilling to provide any information up to this point. After having my main Veeam server stuck in a boot loop all weekend, I'm close to bypassing MS for now and reaching out to Veeam support for the experimental fix. FWIW, I've not had CPU or memory issues. But server goes onto these "reboot fits" it seems when there is some kind of block cloning operation (as others have said, perhaps when it is doing a massive amount of deletes).
@Cicadymn Ah, I was told to just try the keys you listed above first then if that didn't work like you mentioned reducing 2048 to 1024 etc to then try the original keys microsoft released. Run that past your tech support guy
Hello Michael,
I have seen this issue and went on the call with the customer. This is the reference case where we are having issues with “IO Deadlock when using Block Clone operation with Veeam Backup Software” where we have explained the customer that MS is aware about the issue where DPM is used instead of the Veeam. However we are not sure if that fix will help is resolving issues with the Veeam applications.
Cicadymn wrote:I got the ReFS fix from Veeam support set up on my backup copy host. I started having problems with the CPU maxing out after a minute or two and contacted support. After the fix everything was looking better, but once two synthetic fulls started going at the same time it immediately locked out via CPU again. Now I'm back to crashing via CPU shortly after boot.
I'm happy to pull any logs, Windows or Veeam to help give some more data to Microsoft. I've also updated my support case (02186811) with this information.
One strange thing to note on this, my backup copy host only crashes (via CPU) when the larger backup copy jobs try to do synthetic fulls. I've got all jobs but my file server jobs running and many of them ran synthetic fulls without issue. However the file servers jobs (ranging from 3-6TB+ VM sizes) caused the server to lock up.
May be that the fix has trouble with larger file sizes.
@cicadymn are you able to tell if the job is finishing but simply hanging when deleting the older restore points? My issue is deleting large restore points, 4TB+
kubimike wrote:@cicadymn are you able to tell if the job is finishing but simply hanging when deleting the older restore points? My issue is deleting large restore points, 4TB+
The job hangs at whatever % the job was when the server locks up at 99% CPU. Interesting enough after a hard reboot and if I leave it alone. The job may increase a few % until it locks back out again. They've never completed however, eventually failing. Followed by me disabling them to try to get the server stable again. This always locks up during the creating synthetic full stage.
How much do you process? We see RAM usage peaks of 100+ GB when there is a merge of a 3 TB backup file. Our system only stopped crashing since we increased to 384 GB.
WOW! That's a lot of ram! Our primary backup host had the RAM issue, but for some reason our backup copy host is having it max out our CPU with RAM sticking at around 50% usage during the ordeal.
When it locks out it's usually processing two jobs, merging backup files that are both 4.5TB.
My issues are now solved, I give ReFS a second choice.
I temporary moved all data to another NAS, deleted the ReFS volume, recreated it (with 64k of course), moved the data back to ReFS, did an active full to activate fast clone again - and now the issues are gone.
Speed is back to normal, full 2GBit, (thanks to Veeam that it is able to fully load LACP etherchannels), no memory or CPU issues.
Lets see if it's working now for a longer time.
First ReFS was directly created with a stock 2016 server, now it is fully patched and created than - maybe that's a difference.
The ReFS is powered by a SuperMicro Board with 8 core Intel Xeon, and only 24GB RAM, Areca 1880 RAID controller, 16 Bay box, 10x WD Red 6TB SATA Raid6.
(yep, I now that you will cry because of that low memory, but it went very well for the first 2 month without any issues - and common, having more than 100GB RAM to keep a FS stable is crazy)
lepphce1 wrote:I also checked versions and realized that I have an older version of ReFS that was included in the April 2017 Roll-up (10.0.14393.953). The ReFS version released yesterday (June 2017) is the same as released in May 2017 as mentioned earlier. I'm about to update my servers to the June 2017 release but I am wondering if anybody noticed any improvements after the May 2017 update.
As a side, I've had a ticket open with Microsoft for some time now, and frankly they've been less than helpful / responsive. They seemed to immediately know what was going on after I submitted my crash dump but have been unwilling to provide any information up to this point. After having my main Veeam server stuck in a boot loop all weekend, I'm close to bypassing MS for now and reaching out to Veeam support for the experimental fix. FWIW, I've not had CPU or memory issues. But server goes onto these "reboot fits" it seems when there is some kind of block cloning operation (as others have said, perhaps when it is doing a massive amount of deletes).
The FIX does work in our system.
We had issues every day before applying the exp patch - and now our system is running 100% as expected.
thomas.raabo wrote:
The FIX does work in our system.
We had issues every day before applying the exp patch - and now our system is running 100% as expected.
The new ReFS file is the right way to go.
ReFS.sys - build 14939.1100
Just thought I'd mention that I just applied the June cumulative update and Refs.sys on my systems is at build 14939.1198 dated 4/27/2017. I wonder if that means the "experimental" fix has been included in the June updates? I've been holding off on moving my backup repositories to ReFS but maybe it's finally ready