Comprehensive data protection for all workloads
Locked
kubimike
Veteran
Posts: 373
Liked: 41 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

If this fix works will it be part of a normal windows update ? "Patch Tuesday's"
suprnova
Enthusiast
Posts: 38
Liked: never
Joined: Apr 08, 2016 5:15 pm
Contact:

Re: REFS 4k horror story

Post by suprnova »

suprnova wrote: I did test out the test Microsoft fix, but this did not help last night. I do not have CPU or memory problems, but my WMI monitoring has large gaps in my repository data. It's tough to say what causes it, when the instability started, there was only one merge running. Overall, I think at this point I need to turn off block cloning, move back to NTFS, or start using the block clone synthetic fulls.
Just an update on my tests. Unfortunately support has not been helpful (month old ticket and they are still asking basic questions), so I am sharing my personal progress here in case it helps someone.

115 backed up VMs
80TB repository size across 3 repos
64k journal size
RefsEnableLargeWorkingSetTrim set to 1
I have the experimental Microsoft refs.sys driver (Veeam support is unable to report on what exactly this does)

Bad with ReFS:
Fast clone incremental merges for large .vib files: creates repository instability and 24 hour+ merges, I haven't been able to fix this, so I had no choice but to enable synthetic fulls. Small .vib file merges are fine (although not any quicker than NTFS merge).
Forever forward incremental: about a month into the chain incremental performance is awful, job sits at Hard Disk (0.0 KB) 0.0 KB read at 0 KB/s [CBT] for hours before transferring at normal speeds. I have noticed that while the job appears to be doing nothing it is actually reading vib and vbk files.

I have switched to running weekly synthetic fulls with block cloning, I am hoping this will work around my issue without having to reseed all the backups. So far it is helping my jobs start quicker, but it is too early to tell.
mkretzer
Veeam Legend
Posts: 1140
Liked: 387 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS 4k horror story

Post by mkretzer »

@suprnova How much RAM do you have? Are you absolutely sure you have no RAM issue? I am monitoring our system for quite some time now and the RAM spikes are sometimes EXTREMLY fast (100+ GB used in a minute and then gone again). In fact, before we increased RAM we also did not think we had a RAM issue and the system just hang. Now the system continues running and i am able to see these extreme spikes...
suprnova
Enthusiast
Posts: 38
Liked: never
Joined: Apr 08, 2016 5:15 pm
Contact:

Re: REFS 4k horror story

Post by suprnova »

mkretzer wrote:@suprnova How much RAM do you have? Are you absolutely sure you have no RAM issue? I am monitoring our system for quite some time now and the RAM spikes are sometimes EXTREMLY fast (100+ GB used in a minute and then gone again). In fact, before we increased RAM we also did not think we had a RAM issue and the system just hang. Now the system continues running and i am able to see these extreme spikes...
16GB on each repo. We heavily monitor these repos and have never seen over 90% memory usage, even under extreme load in all scenarios (multiple merge or synthetic fulls).
graham8
Enthusiast
Posts: 59
Liked: 20 times
Joined: Dec 14, 2016 1:56 pm
Contact:

Re: REFS 4k horror story

Post by graham8 »

Okay, this will be (I think) my final update on my own issue. My Microsoft case is now closed.

I sent Microsoft numerous manually-initiated-with-keyboard-hotkey-trigger memory dumps (because this issue causes resource exhaustion and "locks" rather than BSODing which would generate a dump). Microsoft took a very long time on the issue. This wasn't a test server, and I couldn't just let it linger for months in an at-risk-of-data-loss situation (I nearly lost the entire pool at several points), so when I saw that this was going to be a protracted situation I started working to get it reloaded with known-stable technology (ZFS).

Microsoft finally got the memory dumps analyzed. They told me that they have seen ReFS-related issues with Microsoft's own backup software (DPM), and that they have an experimental patch to the ReFS filesystem driver that resolved whatever issues it had. They wanted to try that experimental patch with my server, but since I reloaded it, and my only other ReFS server is my primary Veeam destination and is (currently) stable, I'm not in a position to alpha-test any experimental patches. I'm holding on to a copy of this file, just in case my primary ReFS server begins to die as well, but I'm not in a position to use it proactively as an experiment on a currently-working server.

The patched filesystem driver reads the following new registry values:
  • ReFSDisableCachedPins
  • ProcessedDeleteQueueEntryCountThreshold
  • TimeOutValue
The "DeleteQueue" one in particular sounds suspiciously on-target. I've seen this issue exhibit itself immediately following manually deleting large block-cloned VBK files. They also mentioned that they had had to have the DPM team reduce the block cloning granularity from 2GB to 100MB, but that that is something that Veeam would have to change in its code rather than something which is controlled by the ReFS driver. Which of these, if any, are applicable to this issue? That's undetermined still, but these look like some promising things to try if anyone is in a position to do so.

Okay, all of that said - I'm not going to post the driver file of course, since that would be reckless (it's not even a hotfix - just a literal patched refs.sys file, and it hasn't been confirmed as being a fix for anything but a DPM-related problem). If anyone else is in a position to test this, though, it might help you to make some headway with Microsoft on the issue quicker by referencing my case ID: 117040315547198 ... it should provide a template for MS frontline support that the issue needs to be addressed by establishing settings to manually trigger memory dumps, and a pointer to the suspected fix.
kubimike
Veteran
Posts: 373
Liked: 41 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

@graham8
good stuff, looks like those keys you mention are new. Never heard of them before.

For those that are running the newer refs.sys, are you also using the registry keys released the in the earlier fixes ?
antipolis
Enthusiast
Posts: 73
Liked: 9 times
Joined: Oct 26, 2016 9:17 am
Contact:

Re: REFS 4k horror story

Post by antipolis »

*plot thickens* :?
suprnova
Enthusiast
Posts: 38
Liked: never
Joined: Apr 08, 2016 5:15 pm
Contact:

Re: REFS 4k horror story

Post by suprnova »

kubimike wrote:@graham8
good stuff, looks like those keys you mention are new. Never heard of them before.

For those that are running the newer refs.sys, are you also using the registry keys released the in the earlier fixes ?
These keys are, in theory, only used with the experimental driver, but Veeam support has been slow to respond on what exactly these do. I am using RefsEnableLargeWorkingSetTrim set to 1.
kubimike
Veteran
Posts: 373
Liked: 41 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

starting to feel like this is Contra on the original NES . I wonder what the final key combination will be for all these registry entries lol. UP-UP-DOWN-DOWN! I can't possibly see Microsoft leaving these tune-able keys. Getting a bit out of hand.
graham8
Enthusiast
Posts: 59
Liked: 20 times
Joined: Dec 14, 2016 1:56 pm
Contact:

Re: REFS 4k horror story

Post by graham8 »

kubimike wrote:starting to feel like this is Contra on the original NES . I wonder what the final key combination will be for all these registry entries lol. UP-UP-DOWN-DOWN! I can't possibly see Microsoft leaving these tune-able keys. Getting a bit out of hand.
Mike, that's a grossly inaccurate thing to say.

...Contra was fun.

Really though, I imagine MS will default these tuneables to better values in a future patch if and when it's established that one of them ends up fixing the issues.
kubimike
Veteran
Posts: 373
Liked: 41 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

@graham8 Yup still play it :) Did you end up reinstalling veeam + refs, or just use something else. Sounded like you were on the fence. Would be cool to hear your feedback on that one of driver you have.
graham8
Enthusiast
Posts: 59
Liked: 20 times
Joined: Dec 14, 2016 1:56 pm
Contact:

Re: REFS 4k horror story

Post by graham8 »

kubimike wrote:@graham8 Yup still play it :) Did you end up reinstalling veeam + refs, or just use something else. Sounded like you were on the fence. Would be cool to hear your feedback on that one of driver you have.
It was my offsite repo that had the most problems. I ended up reloading it with *nix, using ZFS, and setting up snapshotting (combined with scheduled file syncs of important data to it). It's not ideal in some respects, but it's a reliable solution I can count on as a fallback, and one that gives me longterm point-in-time retention with almost no space inflation beyond just the size of added data.

I'm still running Veeam against the primary backup server which is still running ReFS. That server has been more stable, though, which is why I'm not putting this patched driver in place on it.

If the whole ReFS thing ends up fixed and stable down the line in a year or so, I'll probably switch the offsite back around to Windows+ReFS+Veeam. Otherwise, I'll probably segment data a bit more and switch back to Shadowprotect, since we can't afford the DriveSpace*Infinity that Veeam on NTFS (without dedupe) seems to require for maintaining long term point-in-time retention without ReFS/block clone.
kubimike
Veteran
Posts: 373
Liked: 41 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

So the server that's still running veeam is it running the latest round of microsoft patches + reg mods ?
graham8
Enthusiast
Posts: 59
Liked: 20 times
Joined: Dec 14, 2016 1:56 pm
Contact:

Re: REFS 4k horror story

Post by graham8 »

kubimike wrote:So the server that's still running veeam is it running the latest round of microsoft patches + reg mods ?
It's running the latest updates, but I didn't enable any of the registry mods in the primary server (I did in the copy repo - didn't help).

I had disabled parallel processing in Veeam in it way back when, though. I did that in the copy repo as well. That helped in both, but the problem still occurred sometimes in the copy repo. In the primary repo it hasn't reoccurred since I made that change though, which is why I'm just not breathing too hard around it until Microsoft fixes something.
kubimike
Veteran
Posts: 373
Liked: 41 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

well looks like the bug finally got me . In an attempt to save on disk space I reduced one of my jobs restore points. It currently had 150, I wanted to cap it at 140. The job ran completed and when pruning the old jobs the server bombed. I pinged Gostev to see if I can get my hands on the test update. :cry:

Also before rescanning the repository under backup jobs it did show the reduced restore points. However after rescanning the repository that number jumped back to 150. I browsed the drive looking for the files it claimed it deleted and they are still there. Im going to set the job back to a higher value perhaps 155 so that less restore points are deleted at once. Perhaps that will help in the mean time.
kubimike
Veteran
Posts: 373
Liked: 41 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

Just opened a ticket #02183479
antipolis
Enthusiast
Posts: 73
Liked: 9 times
Joined: Oct 26, 2016 9:17 am
Contact:

Re: REFS 4k horror story

Post by antipolis »

kubimike wrote:starting to feel like this is Contra on the original NES . I wonder what the final key combination will be for all these registry entries lol. UP-UP-DOWN-DOWN! I can't possibly see Microsoft leaving these tune-able keys. Getting a bit out of hand.
https://en.wikipedia.org/wiki/Konami_Code

anyone tried this ?
kb1ibt
Influencer
Posts: 14
Liked: never
Joined: Apr 24, 2015 1:40 pm
Contact:

Re: REFS 4k horror story

Post by kb1ibt »

While we haven't heard back from thomas.raabo in a few days, I also am on the test refs.sys and registry combo graham8 mentioned. Unfortunately for me, the issue is still reproducing, in fact I have even run into it with a job that wasn't triggering it with the stock refs.sys. Altho I was waiting for it to start on that job on the stock as well.
graham8
Enthusiast
Posts: 59
Liked: 20 times
Joined: Dec 14, 2016 1:56 pm
Contact:

Re: REFS 4k horror story

Post by graham8 »

kb1ibt wrote:I also am on the test refs.sys and registry combo graham8 mentioned
Just to clarify, since there's the public patch... you mean the private one that Microsoft has which is just the refs.sys file itself, which they were testing as a fix internally to ReFS-related problems they noticed with DPM?
kubimike
Veteran
Posts: 373
Liked: 41 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

@kb1ibt, does the crash occur when its block cloning or when its removing older restore points ?
Can you point out how exactly the keys are used in KB4016173 and in what combination with examples? Their article is a bit vague. Example, if I create 'RefsEnableLargeWorkingSetTrim' and set to '1' can I pair that with 'RefsNumberOfChunksToTrim' ? How is the value for 'RefsNumberOfChunksToTrim' calculated ?
kb1ibt
Influencer
Posts: 14
Liked: never
Joined: Apr 24, 2015 1:40 pm
Contact:

Re: REFS 4k horror story

Post by kb1ibt »

@graham8, yes the private one

@kubimike, since my issue is in result of the removing the older points even though Veeam gets a response that the files are deleted, ReFS hasn't finished deleting them. So while Veeam has started the block copy, the repo and ReFS is still processing the cleanup of the old restore points in the background and it then breaks and locks the system up.

With this morning's crash I was watching it to report back to support since they asked if Explorer was showing the correct free space and if dir was showing the proper file tree. This morning the deletes finished at 3:32am
even before the 100% CPU starts the free space stops being shown and the “dir” command does not produce a result. It took from 3:33am until 3:45am to show the results of that dir and refresh. However after that finished loading I refreshed both again and it delayed from 3:46 until 3:53, again from 3:53 until the the lockup at 4:01am. As you can notice in the first 4:01 screenshot it shows the System process first going to ~15% then a few moments later the rest of the spike kicks in and at that point I lose the RDP session.
kubimike
Veteran
Posts: 373
Liked: 41 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

@Kb1ibt YES this is exactly what I saw yesterday. I posted earlier ! Veeam says OK files are deleted, however the OS is hosed and stuck. Upon next reboot the files are still there. So did you not use the registry keys in KB4016173 along with the private refs.sys driver from MSFT ?
kb1ibt
Influencer
Posts: 14
Liked: never
Joined: Apr 24, 2015 1:40 pm
Contact:

Re: REFS 4k horror story

Post by kb1ibt »

@kubimike, I still have the RefsEnableLargeWorkingSetTrim and RefsEnableInlineTrim keys in place, but do not have the RefsNumberOfChunksToTrim set. However the private refs.sys also has 2 more keys
  • ReFSDisableCachedPins
  • ReFSProcessedDeleteQueueEntryCountThreshold
kubimike
Veteran
Posts: 373
Liked: 41 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

@kb1ibt, ok thanks. Were you advised not to set 'RefsNumberOfChunksToTrim' ? Graham8 mentioned there is another key called 'Delete Queue' veeam-backup-replication-f2/refs-4k-hor ... ml#p243688. That sounds like something that might cure our issue. Has Microsoft mentioned that key to you ?

So when Veeam is deleting older restore points how large are the jobs? I noticed this only happens with large ones. Has no issues removing smaller restore points.
kb1ibt
Influencer
Posts: 14
Liked: never
Joined: Apr 24, 2015 1:40 pm
Contact:

Re: REFS 4k horror story

Post by kb1ibt »

@kubimike, The Tuesday BC GFS job that just started triggering it this week is 11 VMs totaling 1.32TB, the Thursday BC GFS job is 48 VMs @ 2.55TB, and the Friday BC GFS job is 3 VMs @ 6.71TB
bgbGsy
Influencer
Posts: 18
Liked: 3 times
Joined: Sep 10, 2009 7:40 pm
Full Name: Brendan Bougourd
Contact:

Re: REFS 4k horror story

Post by bgbGsy »

Case #02178125. I am experiencing the same symptoms as kb1ibt. Everything has been going very well until after about two months I have reached the stage where older increments are scheduled for deletion. The backup finishes fine, and indicates that the old restore points have been deleted. However at that point my 2016 REFS repo goes to 100% CPU. I have no choice but to power off the VM at this stage. I cannot do anything inside the VM via RDP or from the console.

I am eager to hear from Veeam on this before considering going back to NTFS for the present. I was told when logging the call that the experimental driver is not appropriate for these symptoms. Any suggestions gratefully received. My VM hall updates applied and has 4 cores, 16gb memory. It only processes one job at a time.

Last night both jobs (which follow one another) did exactly the same thing. Also, as the 'deleted' files still seem to be present in the repo folder, is there a process to clean these up?

Thanks in advance.
thomas.raabo
Service Provider
Posts: 28
Liked: 11 times
Joined: Oct 31, 2016 6:27 pm
Full Name: Thomas Raabo
Location: infrastructure guy
Contact:

Re: REFS 4k horror story

Post by thomas.raabo »

graham8 wrote:Okay, this will be (I think) my final update on my own issue. My Microsoft case is now closed.

I sent Microsoft numerous manually-initiated-with-keyboard-hotkey-trigger memory dumps (because this issue causes resource exhaustion and "locks" rather than BSODing which would generate a dump). Microsoft took a very long time on the issue. This wasn't a test server, and I couldn't just let it linger for months in an at-risk-of-data-loss situation (I nearly lost the entire pool at several points), so when I saw that this was going to be a protracted situation I started working to get it reloaded with known-stable technology (ZFS).

Microsoft finally got the memory dumps analyzed. They told me that they have seen ReFS-related issues with Microsoft's own backup software (DPM), and that they have an experimental patch to the ReFS filesystem driver that resolved whatever issues it had. They wanted to try that experimental patch with my server, but since I reloaded it, and my only other ReFS server is my primary Veeam destination and is (currently) stable, I'm not in a position to alpha-test any experimental patches. I'm holding on to a copy of this file, just in case my primary ReFS server begins to die as well, but I'm not in a position to use it proactively as an experiment on a currently-working server.

The patched filesystem driver reads the following new registry values:
  • ReFSDisableCachedPins
  • ProcessedDeleteQueueEntryCountThreshold
  • TimeOutValue
The "DeleteQueue" one in particular sounds suspiciously on-target. I've seen this issue exhibit itself immediately following manually deleting large block-cloned VBK files. They also mentioned that they had had to have the DPM team reduce the block cloning granularity from 2GB to 100MB, but that that is something that Veeam would have to change in its code rather than something which is controlled by the ReFS driver. Which of these, if any, are applicable to this issue? That's undetermined still, but these look like some promising things to try if anyone is in a position to do so.

Okay, all of that said - I'm not going to post the driver file of course, since that would be reckless (it's not even a hotfix - just a literal patched refs.sys file, and it hasn't been confirmed as being a fix for anything but a DPM-related problem). If anyone else is in a position to test this, though, it might help you to make some headway with Microsoft on the issue quicker by referencing my case ID: 117040315547198 ... it should provide a template for MS frontline support that the issue needs to be addressed by establishing settings to manually trigger memory dumps, and a pointer to the suspected fix.
This is the fix i´m running with and our system is much more stable.

If anybody else want to try the fix i think you should reach out to veeam support
mkretzer
Veeam Legend
Posts: 1140
Liked: 387 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS 4k horror story

Post by mkretzer »

Veeam support seems to only give out the update if we have the specific crash issue. Also they want new tests with diskspd... Case 02179620.

I do not know why i need new tests... We are having various REFS issues for 5 months now and we know REFS is the issue here.

Is the new driver not "production-ready"? If so why is it so difficult to get it?
kubimike
Veteran
Posts: 373
Liked: 41 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

bgbGsy wrote:Case #02178125. I am experiencing the same symptoms as kb1ibt. Everything has been going very well until after about two months I have reached the stage where older increments are scheduled for deletion. The backup finishes fine, and indicates that the old restore points have been deleted. However at that point my 2016 REFS repo goes to 100% CPU. I have no choice but to power off the VM at this stage. I cannot do anything inside the VM via RDP or from the console.

I am eager to hear from Veeam on this before considering going back to NTFS for the present. I was told when logging the call that the experimental driver is not appropriate for these symptoms. Any suggestions gratefully received. My VM hall updates applied and has 4 cores, 16gb memory. It only processes one job at a time.

Last night both jobs (which follow one another) did exactly the same thing. Also, as the 'deleted' files still seem to be present in the repo folder, is there a process to clean these up?

Thanks in advance.
Same here, actually Im about to hit the threshold for restore points to get deleted. The only way I've been able to keep it from freezing is to bump up the restore points (156 to 158 for example) Im running out of disk space and need a way to prune these old jobs!! :shock:
tsightler
VP, Product Management
Posts: 6009
Liked: 2843 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: REFS 4k horror story

Post by tsightler » 1 person likes this post

I think the important thing to remember is that it's a Microsoft test "hotfix". I'm pretty amazed that Veeam support is the one sharing it to begin with as they can't really answer any detailed questions on how it works or what it does, only pass along the same information they already have (the same that has been posted in this forum).
Locked

Who is online

Users browsing this forum: Mildur and 158 guests