REFS issues (server lockups, high CPU, high RAM)

Mar 21, 2017 5:39 am

Yes, up until now no big problems. 105 TB 64 k repo, but only 3 jobs with about 5 TB moved to that.

The only issue is that tape backup is very slow. Will perhaps contact support for that...

Post by **Mike Resseler** » Mar 21, 2017 5:46 am this post

@mkretzer

Yes, please do. Maybe create a new forum thread for it also (with Case ID and follow-up as always

)

Mar 21, 2017 10:31 am

HJAdams123 wrote:So has anyone actually had success with this latest patch and the registry settings?

No issues here, patched both my proxies and using the RefsEnableLargeWorkingSetTrim registry key.
Everything is stable and backups are lightning fast.
Then again my ReFS repository is only running for a week. (40TB in use of 170TB sobr)
Fingers crossed it stays that way.

VladV · Post by **VladV** » Mar 21, 2017 11:24 am this post

Stupid question, but I'm having some problems with our WSUS server so I want to check something with you guys. After applying the patch, do the registry keys need to be created manually or are they available after the update?

Thanks

Mar 21, 2017 11:26 am

create manually

kubimike · Post by **kubimike** » Mar 21, 2017 4:34 pm this post

new HBA installed, ReFS volume found. letting it sit idle to see if any issues crop up. BTW this issue occured w/o the latest KBs installed. HP Hardware error 0x13 "Previous Lockup Code"

[UPDATE] well, same error again. crap. back on the phone with HP

[UPDATE 2] Solution > Attn HP guys, big time bug in Smart Array firmware 4.52

j.forsythe · Mar 24, 2017 2:48 pm

HJAdams123 wrote:So has anyone actually had success with this latest patch and the registry settings?

Hi guys.

My system is running fine since the installation of the patch and the RefsEnableLargeWorkingSetTrim registry key.
Today I changed my setup and all of my jobs will write the backups to the two (local SAS and iSCSI) ReFS repositorys.
One thing I mentioned is the change of used RAM of Metafile at the tool RamMap.
Before I was using about 3.5 GB and after installing the patch it went down to 890 MB.

I just hope that the jobs keep running smoothly.

Cheers,
John

jimmycartrette · Mar 27, 2017 12:40 pm

I have a 4k 2016 ReFS repo and am experiencing the issues.
I essentially turned all jobs off except the small production one until the hotfix. Applied the hotfix, set the RefsEnableLargeWorkingSetTrim to 1, rebooted, jobs were running fine, had them all on all week.
Saturday (possibly caused by a 11TB 3VM job with the synthetic full), the repo was locked. Reset the repo, dead shortly after.

I came in this morning and set the RefsNumberOfChunksToTrim to 8. Still locks up. I've got the repo running and Veeam shut down for right now. Found some interesting events in the log...
An IO took more than 30000 ms to complete:

Process Id: 5152
Process name: VeeamAgent.exe
File name: 000000000000070F 00000000000003FD
File offset: 0
IO Type: Write: Paging, NonCached, Sync
IO Size: 4096 bytes
0 cluster(s) starting at cluster 0
Latency: 31884 ms

Volume Id: {b1f2e230-ca74-478a-ad8a-bca2eb274fbd}
Volume name: R:

Where are we as far as official guidance on this ReFS problem? I'm not in a position to reformat as 64k at this moment, but as we bought Ent Pro to move all of our production backup to Veeam this is starting to get very concerning.

I should mention my repo is around 16TB, I've got 32GB of RAM assigned to it...

Mar 27, 2017 12:45 pm

Hi JC,

I suggest you open a new support case so that our engineers are aware of this and can have a look. Post the case ID and follow-up after the case here also.
Thanks
Mike

EricJ · Post by **EricJ** » Mar 27, 2017 1:00 pm this post

We have 3 ReFS repos - 32 TB, 14 TB, 13 TB, all formatted 64k.

Had issues frequently until the reg key fix from MS. Applied the RefsEnableLargeWorkingSetTrim key. Backups ran fine all week with no issue, including some fast cloning of daily backup copy consolidation. However, for the 2nd week in a row, had the server freeze during the weekly synthetic backup jobs on Saturday. I guess our next step is to try Option 2. I'm also going to try to monitor this week with rammap.

graham8 · Post by **graham8** » Mar 27, 2017 1:20 pm this post

jimmycartrette wrote:I have a 4k 2016 ReFS repo and am experiencing the issues.... Applied the hotfix, set the RefsEnableLargeWorkingSetTrim to 1 ... Saturday ... the repo was locked. Reset the repo, dead shortly after. .. An IO took more than 30000 ms to complete

I'm glad you posted this. Almost exactly the same story here - I applied that update to our 2016 backup copy repo, applied RefsEnableLargeWorkingSetTrim = 1, rebooted, and almost immediately had a raid card reset be issued (never happened before) and the array drop out. After rebooting, everything was back online, but this weekend, I again got the "IO took more than 30000 ms to complete". As I mentioned in a previous post, I've been getting these on all these 2016/ReFS/Veeam servers. Those events don't always correlate to the lockups.

This weekend, the copy repo in question with the patch went into an IO/memory(presumably)-starvation state again and is inaccessible. Had someone on site repower it, and it checked back in briefly and is now inaccessible again...going to wait a few hours, since from past experience the post-crash scan can cause long IO lockups of the servers as well (Task Scheduler -> Microsoft -> Windows -> Data Integrity Scan -> Data Integrity Scan for Crash Recovery). Every time this happens it's incredibly nerve-wracking and makes me afraid the whole setup is going to burn down

The KB article about this (https://support.microsoft.com/en-us/hel ... windows-10) does mention trying the other two options. Maybe I should try Option3 ("RefsEnableInlineTrim")...

The fact that the KB lists different things to "try" makes me suspect that Microsoft really has no idea what's going on here. If they did, there wouldn't be any need for end-user tuning...if the system was about to go into an memory-starvation-spiral-of-death, it would detect as much and back off. The article mentions "memory pressure that can cause poor performance", but that's a gross misrepresentation since this essentially kills a server (ie, waiting 12 hours doesn't matter) when it occurs. Even a blue-screen would be better than this...easier to debug, certainly.

I'd be happy to send more logs to the ReFS team, but we don't want to take the risk that Microsoft will declare that they "have a 'fix'" (KB4013429) and charge for the incident...since I haven't tried all the options it listed yet.

EricJ · Post by **EricJ** » Mar 27, 2017 2:10 pm this post

graham8 wrote: This weekend, the copy repo in question with the patch went into an IO/memory(presumably)-starvation state again and is inaccessible. Had someone on site repower it, and it checked back in briefly and is now inaccessible again...going to wait a few hours, since from past experience the post-crash scan can cause long IO lockups of the servers as well (Task Scheduler -> Microsoft -> Windows -> Data Integrity Scan -> Data Integrity Scan for Crash Recovery). Every time this happens it's incredibly nerve-wracking and makes me afraid the whole setup is going to burn down

Same here. Lost it over the weekend during two large synthetic full jobs halfway into their fast clone process. I have now applied Option 2 (set to "32" - total guess since MS doesn't provide much guidance here). I am manually setting the jobs to run synthetic full on Monday (today) so I can run a backup and monitor the metafile usage with Rammap during the job.

So far I am noticing that the Metafile active memory climbs during a fast clone, but does get released after the job completes. However, my jobs encompassing 1-1.5 TB of VMs have caused the active usage to climb beyond 2.5 GB. Soon I will simulate what caused the failure this weekend - two large file servers (6.3 TB and 3.3 TB) fast cloning at the same time. I expect the metafile active usage will climb much higher - but we will see how the server handles it.

graham8 · Post by **graham8** » Mar 27, 2017 2:24 pm this post

Question - is everyone else getting the following event on a semi-regular basis? This exact event occurs on our 1.) Veeam Backup Copy Repo 2.) Veeam Primary Repo 3.) 2016+ReFS Hyper-V host which is being backed up .... since it's the exact same text on all these servers, and it seems to always succeed, I haven't been in a panic about it, but it would make me feel better to know other people are getting it.

Log: Microsoft-Windows-DataIntegrityScan/Admin
Source: DataIntegrityScan
EventID: 56
"Volume metadata inconsistency was detected and was repaired successfully.
Volume name: D:
Metadata reference: 0x204
Range offset: 0x0
Range length (in bytes): 0x0
Bytes repaired: 0x3000
Status: STATUS_SUCCESS"

Mar 27, 2017 2:27 pm

Here in our system it still works well after one week (64k, 110 TB Repo) but we did two things:

- Patch & Reg setting Option 1
- Increase RAM of the Repo server from 128 GB to 384 GB

One thing is very strange - the system can read from production faster than it can write to the backend. Now after the changes we see it reading with a extrem high speed and then dropping to 0 after 2-3 minutes and then continuing again. I checked the backend storage of the repo and saw a static stream of data - it looks as if all the data is going to RAM first and is then written out.... Which is kind of strange...

alesovodvojce · Post by **alesovodvojce** » Mar 27, 2017 2:47 pm this post

Our servers are having deadlocks still after the patch, Option1+2.
The frequency of deadlocks is lower, but still it is something that can't have place in any production site.
Rammap utility mentioned here shows steady grow of "Metafile" part, both Active and Total - Gigabyte after Gigabyte, until it uses all available memory. This shouldn't windows kernel drivers do.

So far we have tried:

Code: Select all

RefsEnableLargeWorkingSetTrim = 1 (Option 1)

Does not help. So we had

Code: Select all

RefsEnableLargeWorkingSetTrim = 1 (Option 1)
RefsNumberOfChunksToTrim=8 (Option 2)

Again today, after deadlock, we will do

Code: Select all

RefsEnableLargeWorkingSetTrim = 1 (Option 1)
RefsNumberOfChunksToTrim=32 (Option 2)
RefsEnableInlineTrim=1 (Option 3)

So we will have all three options in place. Most aggressive combination. Will keep you updated. Usually its less than week till it hits the rock.

Mar 27, 2017 2:51 pm

@alesovodvojce do you use 4KB or 64KB clusters, and how much RAM does your repository server have? Please also use RamMap to monitor the memory usage during your experiments. In case memory usage stays low but you are still having dead locks with all options enabled, I will check with the ReFS team on the next steps.

Post by **mkretzer** » Mar 27, 2017 3:15 pm this post

I just checked our RAM statistic from the last week. As i said i upgraded the RAM of the server to 384 GB. Normally we have about 330 GB free.

I was quite shocked that at full backup time free memory goes down as low as 150 GB! I checked and veeam agents only take up to 5 GB. This means the REFS driver takes up to ~180 GB of RAM! This perhaps explains why it does run well in our system but not so good with < 200 GB RAM...

Markus

graham8 · Mar 27, 2017 3:33 pm

Well, after it being repowered, hours later now, it's of course still deadlocked (was responding for a few minutes following it being repowered, then dead again due to the "crash recovery" operation). Going to go on site now, cut the power, pull all the freaking disks out, disable all backup jobs pointing to it on the network, try setting the most aggressive registry option, and then reinsert all the disks........... *sigh*

Microsoft Beta Testing Crew Member, signing off.

alesovodvojce · Post by **alesovodvojce** » Mar 27, 2017 5:40 pm this post

So even most agressive setting of registry variables does not work for us. Hence the ReFS even after MS fix is not reliable for us. Thats the sad part.
The better part for everyone is that we have now working crash scenario - and we can test on it any settings. If we run that VM with any current MSFT suggestions, the ReFS goes into troubles and it is unable to shutdown the machine cleanly (at least I havent waited more than 10 minutes). We can unattach the ReFS filesystem, tweak registry etc, restart the machine with ReFS attached to test if that helped, and watch ReFS driver on RamMap, trying to handle the situation.
This "crashtest VM" we can hold for limited time, as it is production one and we are loosing backups when it is not working.

@mkretzer You said you have 384GB RAM now. How big is the ReFS storage?

@Gostev ReFS 4k. VM RAM is 8-18GB. Veeam more than happy about 8GB, not raising up. I will put there 28GB. And buy more RAM if needed.
Rammap below. This is from already dead machine. As a artefact it also puts the clock out of sync, hence generating chaos in Veeam's SQL DB and elsewhere.

alesovodvojce · Post by **alesovodvojce** » Mar 27, 2017 5:49 pm this post

For this setting

Code: Select all

Value Name: RefsNumberOfChunksToTrim
Value Type: REG_DWORD
DEFAULT (if not set or 0): 4
Set it to any value: 8, 16, 32, and so on

Are the values in decimal or hex? Because regedit cares... we have used decimal 1024 (400 in hex)

Post by **mkretzer** » Mar 27, 2017 6:32 pm this post

alesovodvojce wrote: @mkretzer You said you have 384GB RAM now. How big is the ReFS storage?

~ 105 TB. But only ~ 25 TB written up until now.

What is interesting how fast the RAM free fluctuates:

http://imgur.com/a/Li8ws

Post by **Gostev** » Mar 27, 2017 7:07 pm this post

alesovodvojce wrote:@Gostev ReFS 4k. VM RAM is 8-18GB. Veeam more than happy about 8GB, not raising up. I will put there 28GB. And buy more RAM if needed.

Well, RAM is most certainly the issue in your case (and also the fact that you are using 4KB clusters does not help - 16 times more metadata for ReFS to deal with).

As per our system requirements, 8GB is a minimum for 64-bit backup repository (this amount only allows for a single concurrent job for the backup repository). So of course those additional GBs of RAM that ReFS needs are causing the issue for you (especially if you have more than one job pointing to this repository).

EricJ · Mar 27, 2017 7:43 pm

Over the weekend, our server crashed (froze) while processing two synthetic full fast clones for our two large file servers (6.4 TB & 3.2 TB). This was with Option 1 enabled.

This morning I made two changes (not very scientific, I know) - increased RAM from 16GB to 20GB, and added the Option 2 registry key with a setting of "32".

I edited the jobs to create Synthetic Fulls on Mondays (today), and then I re-ran those two large servers in an attempt to reproduce the error from this past weekend.

In the end, the jobs completed successfully. However, I was nervous as I watched as the Metafile usage climbed during the fast clone process. The first server completed after 20 minutes, and the second finished at 40 minutes. These times are similar to previous synthetic fulls in the past.

Here is an screencap animation of RamMap during the process, approximately every 45-60 seconds or so:

Surprisingly, the active Metafile usage seemed to level off very quickly. The standby (and total) kept climbing for the most part. If the jobs were any bigger and the rate of climb remained the same, I may have run out of memory before completion. Unfortunately I can't state how much Option 2 helped (if any), versus how much moving from 16GB to 20GB RAM helped.

If it would be helpful, and/or if Veeam/MS can't reproduce this in a lab easily, I could try reducing the RAM and running another synthetic full to try and force a failure. Let me know!

alesovodvojce · Post by **alesovodvojce** » Mar 27, 2017 8:03 pm this post

Gostev wrote: Well, RAM is most certainly the issue in your case (and also the fact that you are using 4KB clusters does not help - 16 times more metadata for ReFS to deal with).

As per our system requirements, 8GB is a minimum for 64-bit backup repository (this amount only allows for a single concurrent job for the backup repository). So of course those additional GBs of RAM that ReFS needs are causing the issue for you (especially if you have more than one job pointing to this repository).

We will buy more RAM very soon. Is it really RAM issue, I mean the meaning of word? The Veeam does not need hundreds of GB RAM - the Server 2016 itself is, and Microsoft says it needs just 2GB RAM at minimum.

mkretzer wrote: ~ 105 TB. But only ~ 25 TB written up until now.

What is interesting how fast the RAM free fluctuates:

http://imgur.com/a/Li8ws

this is pretty scary... so it can use about 140GB RAM for 25 TB of stored data. We have 22 TB of stored data, so we need at least another ~128GB RAM just for the beginning. The numbers are pretty uncertain of course.

Mar 27, 2017 8:24 pm

One more thing: Since we increased the RAM for the first time fast clone is REALLY fast.

Our production repo has 96 disks and our biggest backup needed 13 1/2 hours to merge. Now with our temporal refs "migration" repo we have only 24 disks but the merge completed in under two hours...

RAM really seems to help alot.

graham8 · Post by **graham8** » Mar 27, 2017 8:57 pm this post

In the course of trying to recover our (copy repo) server today, I noticed that after it was repowered, Veeam immediately kicked off more activity (as indicated by the "IO took more than 30 seconds" messages) on the server, which caused it to go down again. I think I had already disabled the copy jobs beforehand. At any rate, I repowered it and stopped and disabled the veeam services on the copy repo. On the next boot cycle, it ran for a while and then died again. I went back on site and found the Veeam services were running again, and Event Viewer was complaining that IO was starved to death with, among others, Veeam services being mentioned. This time I stopped the Veeam services and disabled them, and set an explicit DENY permission in the ACLs for Veeam in Program Files. I think the explicit DENY did the trick of preventing it from hammering the services back into the system. I also stopped+disabled all the Veeam services on the main Veeam server, though.

So anyway, long story short - if you're finding that the "crash recovery" data integrity scan is repeatedly nuking your server into IO deadlock after you repower it (after a synthetic blockclone forced you to), make sure all the Veeam services everywhere are stopped, copy jobs are disabled, set deny permissions to prevent it from being resurrected via RPC calls, etc.... in our case the scan has been happily running now for some hours, now that Veeam isn't trying to hammer it again with merge operations while it's trying to do its crash recovery (which runs for a long long time).

I mean, it's totally Microsoft's fault - the system should never just stop responding entirely to the point that even the clock in the system tray stops updating... but until the blockclone / refs metadata memory allocation / etc is resolved to act as a "fair citizen" and use proper IO prioritization and scheduling and metadata caching allocation like SQL Server, Windows File Caching, etc etc etc... it's apparently important that we don't ask more than one thing of it at any given time.

(and sorry for all these long posts, but hopefully if someone else finds themselves bashing their heads in desperately trying to raise a server from the "dead" after this occurs, this will help)

EricJ · Post by **EricJ** » Mar 27, 2017 9:03 pm this post

graham8 wrote:Question - is everyone else getting the following event on a semi-regular basis? This exact event occurs on our 1.) Veeam Backup Copy Repo 2.) Veeam Primary Repo 3.) 2016+ReFS Hyper-V host which is being backed up .... since it's the exact same text on all these servers, and it seems to always succeed, I haven't been in a panic about it, but it would make me feel better to know other people are getting it.

Log: Microsoft-Windows-DataIntegrityScan/Admin
Source: DataIntegrityScan
EventID: 56
"Volume metadata inconsistency was detected and was repaired successfully.
Volume name: D:
Metadata reference: 0x204
Range offset: 0x0
Range length (in bytes): 0x0
Bytes repaired: 0x3000
Status: STATUS_SUCCESS"

I just checked, and I am not seeing any of these errors. I know that doesn't help much, but I didn't want you to think nobody saw your question.

graham8 · Post by **graham8** » Mar 27, 2017 9:06 pm this post

EricJ wrote:I just checked, and I am not seeing any of these errors. I know that doesn't help much, but I didn't want you to think nobody saw your question.

Thanks. I'm thinking that maybe we're only getting these on reboots, actually... once I stop pulling my hair out over the backups and deadlock madness I'll try to do more testing.

kubimike · Post by **kubimike** » Mar 27, 2017 10:00 pm this post

Willing to bet anyone using the integrated storage option saving snapshots there is doing less veeam backups (synthetic fulls etc) doesn't have this issue. Seems it only comes up for folks doing more synthetics and higher amounts of backups stored and creating synthetic fulls everyday (higher retention points?) That being said I guess the memory requirements need to be vastly changed.

Post by **Gostev** » Mar 27, 2017 11:23 pm this post

@EricJ thanks for taking time to create the RAMMap animation! To me, it shows that the fix Microsoft provided is working perfectly. Based on your and other recent posts, I am more and more inclined to think that simply bumping RAM by a few GB is the way to go for everyone still having the issue even with the fix applied.

R&D Forums

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Who is online