Availability for the Always-On Enterprise
Post Reply
mkretzer
Expert
Posts: 400
Liked: 79 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS 4k horror story

Post by mkretzer » Mar 21, 2017 5:39 am 1 person likes this post

Yes, up until now no big problems. 105 TB 64 k repo, but only 3 jobs with about 5 TB moved to that.

The only issue is that tape backup is very slow. Will perhaps contact support for that...

Mike Resseler
Veeam Software
Posts: 4668
Liked: 498 times
Joined: Feb 08, 2013 3:08 pm
Full Name: Mike Resseler
Location: Belgium
Contact:

Re: REFS 4k horror story

Post by Mike Resseler » Mar 21, 2017 5:46 am

@mkretzer

Yes, please do. Maybe create a new forum thread for it also (with Case ID and follow-up as always ;-))

WimVD
Service Provider
Posts: 53
Liked: 18 times
Joined: Dec 23, 2014 4:04 pm
Contact:

Re: REFS 4k horror story

Post by WimVD » Mar 21, 2017 10:31 am 1 person likes this post

HJAdams123 wrote:So has anyone actually had success with this latest patch and the registry settings?
No issues here, patched both my proxies and using the RefsEnableLargeWorkingSetTrim registry key.
Everything is stable and backups are lightning fast.
Then again my ReFS repository is only running for a week. (40TB in use of 170TB sobr)
Fingers crossed it stays that way.

VladV
Expert
Posts: 216
Liked: 24 times
Joined: Apr 30, 2013 7:38 am
Full Name: Vlad Valeriu Velciu
Contact:

Re: REFS 4k horror story

Post by VladV » Mar 21, 2017 11:24 am

Stupid question, but I'm having some problems with our WSUS server so I want to check something with you guys. After applying the patch, do the registry keys need to be created manually or are they available after the update?

Thanks

WimVD
Service Provider
Posts: 53
Liked: 18 times
Joined: Dec 23, 2014 4:04 pm
Contact:

Re: REFS 4k horror story

Post by WimVD » Mar 21, 2017 11:26 am 1 person likes this post

create manually

kubimike
Expert
Posts: 312
Liked: 37 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike » Mar 21, 2017 4:34 pm

new HBA installed, ReFS volume found. letting it sit idle to see if any issues crop up. BTW this issue occured w/o the latest KBs installed. HP Hardware error 0x13 "Previous Lockup Code"

[UPDATE] well, same error again. crap. back on the phone with HP

[UPDATE 2] Solution > Attn HP guys, big time bug in Smart Array firmware 4.52

j.forsythe
Influencer
Posts: 14
Liked: 4 times
Joined: Jan 06, 2016 10:26 am
Full Name: John P. Forsythe
Contact:

Re: REFS 4k horror story

Post by j.forsythe » Mar 24, 2017 2:48 pm 1 person likes this post

HJAdams123 wrote:So has anyone actually had success with this latest patch and the registry settings?
Hi guys.

My system is running fine since the installation of the patch and the RefsEnableLargeWorkingSetTrim registry key.
Today I changed my setup and all of my jobs will write the backups to the two (local SAS and iSCSI) ReFS repositorys.
One thing I mentioned is the change of used RAM of Metafile at the tool RamMap.
Before I was using about 3.5 GB and after installing the patch it went down to 890 MB.

I just hope that the jobs keep running smoothly. :!:
Cheers,
John

jimmycartrette
Influencer
Posts: 14
Liked: 2 times
Joined: Feb 02, 2017 2:13 pm
Full Name: JC
Contact:

Re: REFS 4k horror story

Post by jimmycartrette » Mar 27, 2017 12:40 pm

I have a 4k 2016 ReFS repo and am experiencing the issues.
I essentially turned all jobs off except the small production one until the hotfix. Applied the hotfix, set the RefsEnableLargeWorkingSetTrim to 1, rebooted, jobs were running fine, had them all on all week.
Saturday (possibly caused by a 11TB 3VM job with the synthetic full), the repo was locked. Reset the repo, dead shortly after.

I came in this morning and set the RefsNumberOfChunksToTrim to 8. Still locks up. I've got the repo running and Veeam shut down for right now. Found some interesting events in the log...
An IO took more than 30000 ms to complete:

Process Id: 5152
Process name: VeeamAgent.exe
File name: 000000000000070F 00000000000003FD
File offset: 0
IO Type: Write: Paging, NonCached, Sync
IO Size: 4096 bytes
0 cluster(s) starting at cluster 0
Latency: 31884 ms

Volume Id: {b1f2e230-ca74-478a-ad8a-bca2eb274fbd}
Volume name: R:


Where are we as far as official guidance on this ReFS problem? I'm not in a position to reformat as 64k at this moment, but as we bought Ent Pro to move all of our production backup to Veeam this is starting to get very concerning.

I should mention my repo is around 16TB, I've got 32GB of RAM assigned to it...

Mike Resseler
Veeam Software
Posts: 4668
Liked: 498 times
Joined: Feb 08, 2013 3:08 pm
Full Name: Mike Resseler
Location: Belgium
Contact:

Re: REFS 4k horror story

Post by Mike Resseler » Mar 27, 2017 12:45 pm

Hi JC,

I suggest you open a new support case so that our engineers are aware of this and can have a look. Post the case ID and follow-up after the case here also.
Thanks
Mike

EricJ
Influencer
Posts: 18
Liked: 4 times
Joined: Jan 12, 2017 7:06 pm
Contact:

Re: REFS 4k horror story

Post by EricJ » Mar 27, 2017 1:00 pm

We have 3 ReFS repos - 32 TB, 14 TB, 13 TB, all formatted 64k.

Had issues frequently until the reg key fix from MS. Applied the RefsEnableLargeWorkingSetTrim key. Backups ran fine all week with no issue, including some fast cloning of daily backup copy consolidation. However, for the 2nd week in a row, had the server freeze during the weekly synthetic backup jobs on Saturday. I guess our next step is to try Option 2. I'm also going to try to monitor this week with rammap.

graham8
Enthusiast
Posts: 59
Liked: 20 times
Joined: Dec 14, 2016 1:56 pm
Contact:

Re: REFS 4k horror story

Post by graham8 » Mar 27, 2017 1:20 pm

jimmycartrette wrote:I have a 4k 2016 ReFS repo and am experiencing the issues.... Applied the hotfix, set the RefsEnableLargeWorkingSetTrim to 1 ... Saturday ... the repo was locked. Reset the repo, dead shortly after. .. An IO took more than 30000 ms to complete
I'm glad you posted this. Almost exactly the same story here - I applied that update to our 2016 backup copy repo, applied RefsEnableLargeWorkingSetTrim = 1, rebooted, and almost immediately had a raid card reset be issued (never happened before) and the array drop out. After rebooting, everything was back online, but this weekend, I again got the "IO took more than 30000 ms to complete". As I mentioned in a previous post, I've been getting these on all these 2016/ReFS/Veeam servers. Those events don't always correlate to the lockups.

This weekend, the copy repo in question with the patch went into an IO/memory(presumably)-starvation state again and is inaccessible. Had someone on site repower it, and it checked back in briefly and is now inaccessible again...going to wait a few hours, since from past experience the post-crash scan can cause long IO lockups of the servers as well (Task Scheduler -> Microsoft -> Windows -> Data Integrity Scan -> Data Integrity Scan for Crash Recovery). Every time this happens it's incredibly nerve-wracking and makes me afraid the whole setup is going to burn down :|

The KB article about this (https://support.microsoft.com/en-us/hel ... windows-10) does mention trying the other two options. Maybe I should try Option3 ("RefsEnableInlineTrim")...

The fact that the KB lists different things to "try" makes me suspect that Microsoft really has no idea what's going on here. If they did, there wouldn't be any need for end-user tuning...if the system was about to go into an memory-starvation-spiral-of-death, it would detect as much and back off. The article mentions "memory pressure that can cause poor performance", but that's a gross misrepresentation since this essentially kills a server (ie, waiting 12 hours doesn't matter) when it occurs. Even a blue-screen would be better than this...easier to debug, certainly.

I'd be happy to send more logs to the ReFS team, but we don't want to take the risk that Microsoft will declare that they "have a 'fix'" (KB4013429) and charge for the incident...since I haven't tried all the options it listed yet.

EricJ
Influencer
Posts: 18
Liked: 4 times
Joined: Jan 12, 2017 7:06 pm
Contact:

Re: REFS 4k horror story

Post by EricJ » Mar 27, 2017 2:10 pm

graham8 wrote: This weekend, the copy repo in question with the patch went into an IO/memory(presumably)-starvation state again and is inaccessible. Had someone on site repower it, and it checked back in briefly and is now inaccessible again...going to wait a few hours, since from past experience the post-crash scan can cause long IO lockups of the servers as well (Task Scheduler -> Microsoft -> Windows -> Data Integrity Scan -> Data Integrity Scan for Crash Recovery). Every time this happens it's incredibly nerve-wracking and makes me afraid the whole setup is going to burn down :|
Same here. Lost it over the weekend during two large synthetic full jobs halfway into their fast clone process. I have now applied Option 2 (set to "32" - total guess since MS doesn't provide much guidance here). I am manually setting the jobs to run synthetic full on Monday (today) so I can run a backup and monitor the metafile usage with Rammap during the job.

So far I am noticing that the Metafile active memory climbs during a fast clone, but does get released after the job completes. However, my jobs encompassing 1-1.5 TB of VMs have caused the active usage to climb beyond 2.5 GB. Soon I will simulate what caused the failure this weekend - two large file servers (6.3 TB and 3.3 TB) fast cloning at the same time. I expect the metafile active usage will climb much higher - but we will see how the server handles it.

graham8
Enthusiast
Posts: 59
Liked: 20 times
Joined: Dec 14, 2016 1:56 pm
Contact:

Re: REFS 4k horror story

Post by graham8 » Mar 27, 2017 2:24 pm

Question - is everyone else getting the following event on a semi-regular basis? This exact event occurs on our 1.) Veeam Backup Copy Repo 2.) Veeam Primary Repo 3.) 2016+ReFS Hyper-V host which is being backed up .... since it's the exact same text on all these servers, and it seems to always succeed, I haven't been in a panic about it, but it would make me feel better to know other people are getting it.

Log: Microsoft-Windows-DataIntegrityScan/Admin
Source: DataIntegrityScan
EventID: 56
"Volume metadata inconsistency was detected and was repaired successfully.
Volume name: D:
Metadata reference: 0x204
Range offset: 0x0
Range length (in bytes): 0x0
Bytes repaired: 0x3000
Status: STATUS_SUCCESS"

mkretzer
Expert
Posts: 400
Liked: 79 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS 4k horror story

Post by mkretzer » Mar 27, 2017 2:27 pm 1 person likes this post

Here in our system it still works well after one week (64k, 110 TB Repo) but we did two things:

- Patch & Reg setting Option 1
- Increase RAM of the Repo server from 128 GB to 384 GB

One thing is very strange - the system can read from production faster than it can write to the backend. Now after the changes we see it reading with a extrem high speed and then dropping to 0 after 2-3 minutes and then continuing again. I checked the backend storage of the repo and saw a static stream of data - it looks as if all the data is going to RAM first and is then written out.... Which is kind of strange...

alesovodvojce
Enthusiast
Posts: 29
Liked: 2 times
Joined: Nov 29, 2016 10:09 pm
Contact:

Re: REFS 4k horror story

Post by alesovodvojce » Mar 27, 2017 2:47 pm

Our servers are having deadlocks still after the patch, Option1+2.
The frequency of deadlocks is lower, but still it is something that can't have place in any production site.
Rammap utility mentioned here shows steady grow of "Metafile" part, both Active and Total - Gigabyte after Gigabyte, until it uses all available memory. This shouldn't windows kernel drivers do.

So far we have tried:

Code: Select all

RefsEnableLargeWorkingSetTrim = 1 (Option 1)
Does not help. So we had

Code: Select all

RefsEnableLargeWorkingSetTrim = 1 (Option 1)
RefsNumberOfChunksToTrim=8 (Option 2)
Again today, after deadlock, we will do

Code: Select all

RefsEnableLargeWorkingSetTrim = 1 (Option 1)
RefsNumberOfChunksToTrim=32 (Option 2)
RefsEnableInlineTrim=1 (Option 3)
So we will have all three options in place. Most aggressive combination. Will keep you updated. Usually its less than week till it hits the rock.

Post Reply

Who is online

Users browsing this forum: No registered users and 22 guests