REFS issues (server lockups, high CPU, high RAM)

Post by **WimVD** » Mar 17, 2017 12:27 pm this post

@Gostev, no problem, the KB actually contains most of the information we require.
Not sure if I missed it the first time I read the KB or if it has been revised in the meantime but there is some guidance on the registry keys:

Recommendation

If a large active working set causes poor performance, first try to set RefsEnableLargeWorkingSetTrim = 1.
If this setting doesn’t produce a satisfactory result, try different values for RefsNumberOfChunksToTrim, such as 8, 16, 32, and so on.
If this still doesn’t provide the desired effect, set RefsEnableInlineTrim = 1.

For the moment I have two 80TB ReFS proxies that are running for one week now.
They have been running flawlessly but I'm patching them now as we speak and going to proactively implement option 1: RefsEnableLargeWorkingSetTrim
Will report back should I encounter any issues.

graham8 · Mar 17, 2017 12:37 pm

I'm also implementing "option 1: RefsEnableLargeWorkingSetTrim" on a copy destination. We'll see how it goes.

Also, it's worth mentioning that people might want to check out the sysinternals "rammap" utility to keep an eye on the ReFS metadata memory usage before and after implementing this (and its various options), since driver memory usage like the ReFS metadata memory mapping doesn't show up in anything conventional like task manager:
https://technet.microsoft.com/en-us/sys ... ammap.aspx

The pink "metafile" segment is what you'll want to keep an eye on - specifically the "active" portion of it. When memory exhaustion has occurred, it's been because the "active" portion of the metadata mapping grows to a point where it's taking up nearly 100% and the system becomes unresponsive.

Post by **WimVD** » Mar 17, 2017 12:41 pm this post

Good info Graham!
I'm going to implement the regkey on only one of my two repositories so I can compare the metadata usage with/without it.
Will report back...

kubimike · Post by **kubimike** » Mar 17, 2017 12:44 pm this post

@Gostev you mentioned this fix was also for users of 64k size. I never experienced the high memory usage others have described in this post. If thats the case is it Microsofts recommendation to just install the update without registry tweaks ?

kubimike · Post by **kubimike** » Mar 17, 2017 2:14 pm this post

@graham8, NICE find on the tool. Running it now, going to monitor my memory situation. I know I mentioned above I didnt have memory issues but perhaps like you said you can't see it from task manager. Thanks again!

Mar 17, 2017 3:43 pm

I just patched one proxy and set the RefsEnableLargeWorkingSetTrim registry key.
Tried to run a testjob to simulate some metadata updates on both proxies.
Can confirm the update and key seems to work as expected.
In rammap the patched proxy is releasing memory from active to standby while the unpatched proxy releases almost nothing.
When my findings are confirmed during a full backup cycle this evening I will implement the key on both proxies.

Quick sidenote: During my patching of the host everything installed fine.
First reboot after install was okay also but when I set the registry key and rebooted the host it kept hanging on "Preparing Windows updates"
After 40 minutes it finally rebooted and everything seemed fine. Could not reproduce it with a third reboot.

kubimike · Post by **kubimike** » Mar 17, 2017 4:08 pm this post

@WimVD your repositories are configured with 64K ?

Post by **WimVD** » Mar 17, 2017 4:12 pm this post

Yes, they are

graham8 · Post by **graham8** » Mar 17, 2017 4:16 pm this post

Okay, I installed the update, set the "RefsEnableLargeWorkingSetTrim" registry key, rebooted, and after some time got an avalanche of errors that seemingly started with:

"Bus reset occured on storport adapter (Port Number: 2)" (from Source:StorPort)

Following that, a bunch of disks dropped out, Storage Spaces unmounted itself, and then a cascade of further errors kicked up. I rebooted the server, and everything looks normal again. This may have been a wild coincidence, but....doubtful. I'll wait and see if it reoccurs, but...yeah. Be careful, folks. For sure, don't set this on your primary servers until there's more feedback.

Incidentally, I'm using a custom view in event viewer that might be helpful. It's an easy view of everything refs/storagespaces/disk/etc related, only a small fraction of which shows up by default in the normal syslogs:

Event Viewer -> Custom Views -> Right Click -> Create Custom View...
Event Level: Critical,Error,Warning
Event logs: Microsoft-Windows-DataIntegrityScan/Admin,Microsoft-Windows-DataIntegrityScan/CrashRecovery,Microsoft-Windows-Storage-Disk/Admin,Microsoft-Windows-Storage-Disk/Operational,Microsoft-Windows-Ntfs/Operational,Microsoft-Windows-Ntfs/WHC,ReFS/Operational,Microsoft-Windows-StorageManagement/Operational,Microsoft-Windows-StorageSpaces-Driver/Diagnostic,Microsoft-Windows-StorageSpaces-Driver/Operational,Microsoft-Windows-StorageSpaces-ManagementAgent/WHC,Microsoft-Windows-StorageSpaces-SpaceManager/Diagnostic,Microsoft-Windows-Storage-ClassPnP/Admin,Microsoft-Windows-Storage-ClassPnP/Operational,Microsoft-Windows-Storage-Storport/Admin,Microsoft-Windows-Storage-Storport/Operational

Post by **Gostev** » Mar 19, 2017 7:31 pm this post

Hi Graham, these new registry parameters should not be related to this type of error. However, ReFS team kindly offered to look at your logs, just to be sure. Please use StorDiag to collect and package them, and PM me the download link. Thanks!

kubimike · Post by **kubimike** » Mar 20, 2017 2:54 am this post

took a quick peek at RamMap after a full synthetic ran, "Mapped File" was consuming just about all the memory on the machine. About 10 gigs active and 40 gigs in standby. Anyone else notice that??

j.forsythe · Post by **j.forsythe** » Mar 20, 2017 7:41 am this post

kubimike wrote:@j.forsynthe are you also running verifier ? Sad to say my box froze today, even with verifier on it failed to create a dump file. It was just frozen at the login screen (Press CTRL + ALT + DEL to login).

@kubimike No I am not running verifier. Would it help if I would run it?

So far my two "ReFS" jobs are running fine for almost two weeks.
I will install the patch and do the registry change and report later how it is behaving.

Cheers...

Post by **WimVD** » Mar 20, 2017 9:57 am this post

kubimike wrote:took a quick peek at RamMap after a full synthetic ran, "Mapped File" was consuming just about all the memory on the machine. About 10 gigs active and 40 gigs in standby. Anyone else notice that??

Yes, noticed exactly the same but active memory is a lot lower in my case: steady around 400MB so it never worried me.

Mar 20, 2017 10:16 am

Some further feedback comparing my patched and unpatched repository:

The metafile in rammap has been growing steadily on my unpatched host and is now around 3.2GB active.
As expected it doesn't seem to release much memory to standby.
The patched host fluctuates between 0.5GB and 1GB active and always quickly releases memory if it is rising.
So my metadata usage is low for the moment but I can see how we would run into issues in say 30 days or so on the unpatched host.

Performance does not seem to be impacted from the registry key.
Everything is stable and the fast clone is amazing: 200GB incremental merges complete in under 2 minutes

Our backup window has been reduced from nearly 24 hours to 7 hours.
New hardware was a big factor in this but the ReFS integration definitely helped to switch from reverse incremental to forever incremental without introducing long merges.
And together with integrity streams the ReFS integration is just too good to ignore.
After further validation in the field ReFS will definitely be my default choice for Veeam repositories

Pikok · Post by **Pikok** » Mar 20, 2017 1:08 pm this post

We have recently replaced our backup server with a Server 2016. As Veeam 9.5 had just been released we decided to try out the ReFS partition and, at first, experienced much higher backup speeds due to the various new features.
We experienced speeds of 150+ Mb/s and were very satisfied. However, after a while we noticed that the backup times increased and after investigation discovered that the ReFS partition wasn't performing as well as it used to. On that same machine an NTFS partition still performs at similar speeds.

After reading how Microsoft resolved various ReFS issues with the latest update, I applied it and performed the various registry changes. However I haven't noticed any changes in speed.
I first set the RefsEnableLargeWorkingSetTrim key and didn't notice any changes. I then set the RefsNumberOfChunksToTrim key to 8 and didn't notice any changes. I've just modified the RefsEnableInlineTrim key and haven't noticed any changes.
As I'm unsure what value to set the RefsNumberOfChunksToTrim key at, I believe I can still resolve my issue through setting up that key accordingly. However I'm unsure as to how I should define that value.

The partition is formatted with a 4k allocation size and the size of the partition is 15 TB.

Post by **Gostev** » Mar 20, 2017 1:25 pm this post

@Peter the discussed fix and registry keys are not supposed to change the performance in any way. All they do is prevent ReFS from consuming all available memory with its metadata cache - the problem that eventually resulted in the server hosting ReFS volume "locking up", becoming unresponsive for extended time periods.

graham8 · Post by **graham8** » Mar 20, 2017 1:31 pm this post

Update on the post-patch issue I had - it's only been a few days of course, but so far, I haven't had the disastrous issue again which I posted about previously. I got logs over to Gostev to forward to the ReFS team. I'll keep everyone updated.

@Pikok re performance:

I've noticed the same issue. Backup performance has gotten terribly slow...in initial tests it was easily doing 200-300MB/s, now it's ranging from 0.25MB-1MB (yes, seriously). I haven't been worrying about it as much since I've been more concerned about general stability, but it's something else I'll have to deal with at some point.

I see a lot of figures like "Data read: 100GB ... Transferred: 3GB with Dedupe values always being 1.0x, adn Compression ranging from ~1.5 - ~3.5". Not sure what's up with the huge Read vs Transferred discrepancy, but even if the backup amount was 100GB, the backup transfer rate would still be horribly slow over the 1-2 hours it generally runs.

Anyway, tangent, sorry - there are probably other threads for performance issues. But no, you're not alone in having issues there.

EDIT: Actually, I don't see any threads jumping out at me relating to performance degradation with ReFS on the forums here... unless I'm missing something, maybe I should call support and have them review these logs to make sure I'm not misunderstanding something, and then start a new thread...

SyNtAxx · Post by **SyNtAxx** » Mar 20, 2017 1:41 pm this post

I've been following the thread and there now seems to be a solution. I have a few questions.

1) Is the recommendation still to use 64k cluster sizes (my proxy server has 384gb ram)?
2) Is a ReFS repo fast enough to use as a primary tier of storage for backups, supporting Instant Restore, etc?
3) Any other pointers or best practices (pardon pun)?

Thanks,

Nick

Post by **WimVD** » Mar 20, 2017 3:02 pm this post

Just my 2 cents:

1) 64K seems like a logical choice to me considering a Veeam repository works with big files. There is a 10% space tradeoff to consider however.
2) Sure, haven't seen any comparison against NTFS but in my albeit brief personal experience with ReFS performance is really good.
3) Check out https://www.veeam.com/veeamlive/best-pr ... -refs.html It has some good info on the inner workings of ReFS

Post by **WimVD** » Mar 20, 2017 3:05 pm this post

@Graham8: What are your jobs indicating as the bottleneck?

graham8 · Mar 20, 2017 5:23 pm

WimVD wrote:@Graham8: What are your jobs indicating as the bottleneck?

Source 99% ... though, a Crystal Disk Mark against a network share on the same remote volume shows transfer speeds of ~300ish MB/s and ~3-4 MB/s 4k Q32T1 (which has gone down in speed by roughly half since it was first deployed).

I spoke with someone in support just now. Turns out I had disabled parallel processing a while ago, as a way of reducing the frequency of the server lockups due to the refs metadata memory exhaustion (and it did seem to help). With that set, of course, it's running everything single-threaded, so the performance I'm seeing probably makes sense - especially considering this is for incrementals where it's highly random disk access. Also, the 10:1 difference between "read" and "transferred" is likely due to the veeam blocksize that's being tracked verses the 4k filesystem block size.

In short, my bad - forgot a checkbox. I'll wait a week or two on the other server where I have the new refs patch installed and enabled, and if nothing else crops up, I'll reenable parallel processing, which I think will improve things.

kubimike · Post by **kubimike** » Mar 20, 2017 5:56 pm this post

@Pikok are you doing forever forwards? That might be the case if you're not chopping up your backup chains with synthetic fulls

Post by **Limey_005** » Mar 20, 2017 8:23 pm this post

Setup: Windows 2016 16 GB RAM, 4 Cores / Veeam 9.5 U1 - 36 TB ReFS [Running in a VM ESXi 6.5] - New installation / 4 Remote Proxies - All 10 Gb Network connectivity

I applied the same windows ReFS patch today with the RefsEnableLargeWorkingSetTrim regkey. I'm seeing crazy Transfer rates of 266 MB/s - 953 MB/s (50mins for 2.7 TB Processed, 1.2 TB Read, 1.4 TB Transferred (0.8x)) at this point the hardware reported an issue and dropped the RAID card. This server had been working fine until I applied this patch. I rebooted resolved the RAID array, but it seems highly coincidental I had only applied the patch an hour or so earlier and this was the first backup afterwards - I have since remove the regkey, so we'll see what happens next.......

It seems similar to graham8, maybe the flood gates opened and overwhelmed the hardware....?

Post by **Gostev** » Mar 20, 2017 11:11 pm this post

Well, this patch has 2 months worth of changes, so of course there may be unrelated bugs... however, I don't see how enabling working set trimming can have any impact on data transfer performance.

Mar 20, 2017 11:44 pm

Limey_005 wrote:Setup: Windows 2016 16 GB RAM, 4 Cores / Veeam 9.5 U1 - 36 TB ReFS [Running in a VM ESXi 6.5]

Can't see how a windows patch in a virtual machine would bring down a raid controller in the ESXi host...
But granted that would be very coincidental seeing kubimike reported similar issues.

Mar 20, 2017 11:58 pm

My repo's been working fine but after seeing this post about the issues being fixed I figured why not patch it so I can minimize any potential issues. However, after installing the patch and applying the "RefsEnableLargeWorkingSetTrim" reg settings I can no longer access my ReFS volume and the server crashes every 30-45 minutes!

Is there some sort of initial scan that ReFS does with this patch or reg change? I can still see read and writes happening on my volume even though I can't access it via explorer.

I'll be logging a ticket with MS shortly but figured I'd add info here in case I'm not the only one.

kubimike · Mar 21, 2017 12:39 am

My veeam box is hard down with a failed raid controller now. I can't catch a break. 35 grand in hardware HP doesn't have the part handy. So much for 4 hour sla

Mar 21, 2017 1:05 am

Didn't get far enough to log a ticket with MS before finding it appears my issues is being caused by our AV (Webroot) and something in this new patch. I will talk to Webroot support about it. But this may help others if they run into it.

Mar 21, 2017 2:42 am

I agree about not seeing how a windows patch could bring down a RAID controller in an ESXi host, but the only change made today was the windows patch and the regkey to enable it - I was seeing transfer rates of 266 MB/s which I though was great, but the I can't explain the super high transfer rates - my jobs are saying 91-99% proxy bottleneck, I limited them to 2vCPU/4GB ram.... Since removing the regkey the RAID hasn't failed again. Still seeing great transfer rates - 767 MB/s Read & 440 MB/s Transfer.....

HJAdams123 · Post by **HJAdams123** » Mar 21, 2017 5:12 am this post

So has anyone actually had success with this latest patch and the registry settings?

R&D Forums

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Who is online