REFS 4k horror story

Availability for the Always-On Enterprise

Re: REFS 4k horror story

Veeam Logoby mkretzer » Tue Mar 21, 2017 5:39 am 1 person likes this post

Yes, up until now no big problems. 105 TB 64 k repo, but only 3 jobs with about 5 TB moved to that.

The only issue is that tape backup is very slow. Will perhaps contact support for that...
mkretzer
Expert
 
Posts: 251
Liked: 61 times
Joined: Thu Dec 17, 2015 7:17 am

Re: REFS 4k horror story

Veeam Logoby Mike Resseler » Tue Mar 21, 2017 5:46 am

@mkretzer

Yes, please do. Maybe create a new forum thread for it also (with Case ID and follow-up as always ;-))
Mike Resseler
Veeam Software
 
Posts: 2795
Liked: 343 times
Joined: Fri Feb 08, 2013 3:08 pm
Location: Belgium, the land of the fries, the beer, the chocolate and the diamonds...
Full Name: Mike Resseler

Re: REFS 4k horror story

Veeam Logoby WimVD » Tue Mar 21, 2017 10:31 am 1 person likes this post

HJAdams123 wrote:So has anyone actually had success with this latest patch and the registry settings?

No issues here, patched both my proxies and using the RefsEnableLargeWorkingSetTrim registry key.
Everything is stable and backups are lightning fast.
Then again my ReFS repository is only running for a week. (40TB in use of 170TB sobr)
Fingers crossed it stays that way.
WimVD
Service Provider
 
Posts: 48
Liked: 10 times
Joined: Tue Dec 23, 2014 4:04 pm

Re: REFS 4k horror story

Veeam Logoby VladV » Tue Mar 21, 2017 11:24 am

Stupid question, but I'm having some problems with our WSUS server so I want to check something with you guys. After applying the patch, do the registry keys need to be created manually or are they available after the update?

Thanks
VladV
Expert
 
Posts: 214
Liked: 24 times
Joined: Tue Apr 30, 2013 7:38 am
Full Name: Vlad Valeriu Velciu

Re: REFS 4k horror story

Veeam Logoby WimVD » Tue Mar 21, 2017 11:26 am 1 person likes this post

create manually
WimVD
Service Provider
 
Posts: 48
Liked: 10 times
Joined: Tue Dec 23, 2014 4:04 pm

Re: REFS 4k horror story

Veeam Logoby kubimike » Tue Mar 21, 2017 4:34 pm

new HBA installed, ReFS volume found. letting it sit idle to see if any issues crop up. BTW this issue occured w/o the latest KBs installed. HP Hardware error 0x13 "Previous Lockup Code"

[UPDATE] well, same error again. crap. back on the phone with HP

[UPDATE 2] Solution > Attn HP guys, big time bug in Smart Array firmware 4.52
kubimike
Expert
 
Posts: 141
Liked: 20 times
Joined: Fri Feb 03, 2017 2:34 pm
Full Name: MikeO

Re: REFS 4k horror story

Veeam Logoby j.forsythe » Fri Mar 24, 2017 2:48 pm 1 person likes this post

HJAdams123 wrote:So has anyone actually had success with this latest patch and the registry settings?

Hi guys.

My system is running fine since the installation of the patch and the RefsEnableLargeWorkingSetTrim registry key.
Today I changed my setup and all of my jobs will write the backups to the two (local SAS and iSCSI) ReFS repositorys.
One thing I mentioned is the change of used RAM of Metafile at the tool RamMap.
Before I was using about 3.5 GB and after installing the patch it went down to 890 MB.

I just hope that the jobs keep running smoothly. :!:
Cheers,
John
j.forsythe
Influencer
 
Posts: 10
Liked: 3 times
Joined: Wed Jan 06, 2016 10:26 am
Full Name: John P. Forsythe

Re: REFS 4k horror story

Veeam Logoby jimmycartrette » Mon Mar 27, 2017 12:40 pm

I have a 4k 2016 ReFS repo and am experiencing the issues.
I essentially turned all jobs off except the small production one until the hotfix. Applied the hotfix, set the RefsEnableLargeWorkingSetTrim to 1, rebooted, jobs were running fine, had them all on all week.
Saturday (possibly caused by a 11TB 3VM job with the synthetic full), the repo was locked. Reset the repo, dead shortly after.

I came in this morning and set the RefsNumberOfChunksToTrim to 8. Still locks up. I've got the repo running and Veeam shut down for right now. Found some interesting events in the log...
An IO took more than 30000 ms to complete:

Process Id: 5152
Process name: VeeamAgent.exe
File name: 000000000000070F 00000000000003FD
File offset: 0
IO Type: Write: Paging, NonCached, Sync
IO Size: 4096 bytes
0 cluster(s) starting at cluster 0
Latency: 31884 ms

Volume Id: {b1f2e230-ca74-478a-ad8a-bca2eb274fbd}
Volume name: R:


Where are we as far as official guidance on this ReFS problem? I'm not in a position to reformat as 64k at this moment, but as we bought Ent Pro to move all of our production backup to Veeam this is starting to get very concerning.

I should mention my repo is around 16TB, I've got 32GB of RAM assigned to it...
jimmycartrette
Novice
 
Posts: 9
Liked: 2 times
Joined: Thu Feb 02, 2017 2:13 pm
Full Name: JC

Re: REFS 4k horror story

Veeam Logoby Mike Resseler » Mon Mar 27, 2017 12:45 pm

Hi JC,

I suggest you open a new support case so that our engineers are aware of this and can have a look. Post the case ID and follow-up after the case here also.
Thanks
Mike
Mike Resseler
Veeam Software
 
Posts: 2795
Liked: 343 times
Joined: Fri Feb 08, 2013 3:08 pm
Location: Belgium, the land of the fries, the beer, the chocolate and the diamonds...
Full Name: Mike Resseler

Re: REFS 4k horror story

Veeam Logoby EricJ » Mon Mar 27, 2017 1:00 pm

We have 3 ReFS repos - 32 TB, 14 TB, 13 TB, all formatted 64k.

Had issues frequently until the reg key fix from MS. Applied the RefsEnableLargeWorkingSetTrim key. Backups ran fine all week with no issue, including some fast cloning of daily backup copy consolidation. However, for the 2nd week in a row, had the server freeze during the weekly synthetic backup jobs on Saturday. I guess our next step is to try Option 2. I'm also going to try to monitor this week with rammap.
EricJ
Influencer
 
Posts: 11
Liked: 1 time
Joined: Thu Jan 12, 2017 7:06 pm

Re: REFS 4k horror story

Veeam Logoby graham8 » Mon Mar 27, 2017 1:20 pm

jimmycartrette wrote:I have a 4k 2016 ReFS repo and am experiencing the issues.... Applied the hotfix, set the RefsEnableLargeWorkingSetTrim to 1 ... Saturday ... the repo was locked. Reset the repo, dead shortly after. .. An IO took more than 30000 ms to complete


I'm glad you posted this. Almost exactly the same story here - I applied that update to our 2016 backup copy repo, applied RefsEnableLargeWorkingSetTrim = 1, rebooted, and almost immediately had a raid card reset be issued (never happened before) and the array drop out. After rebooting, everything was back online, but this weekend, I again got the "IO took more than 30000 ms to complete". As I mentioned in a previous post, I've been getting these on all these 2016/ReFS/Veeam servers. Those events don't always correlate to the lockups.

This weekend, the copy repo in question with the patch went into an IO/memory(presumably)-starvation state again and is inaccessible. Had someone on site repower it, and it checked back in briefly and is now inaccessible again...going to wait a few hours, since from past experience the post-crash scan can cause long IO lockups of the servers as well (Task Scheduler -> Microsoft -> Windows -> Data Integrity Scan -> Data Integrity Scan for Crash Recovery). Every time this happens it's incredibly nerve-wracking and makes me afraid the whole setup is going to burn down :|

The KB article about this (https://support.microsoft.com/en-us/help/4016173/fix-heavy-memory-usage-in-refs-on-windows-server-2016-and-windows-10) does mention trying the other two options. Maybe I should try Option3 ("RefsEnableInlineTrim")...

The fact that the KB lists different things to "try" makes me suspect that Microsoft really has no idea what's going on here. If they did, there wouldn't be any need for end-user tuning...if the system was about to go into an memory-starvation-spiral-of-death, it would detect as much and back off. The article mentions "memory pressure that can cause poor performance", but that's a gross misrepresentation since this essentially kills a server (ie, waiting 12 hours doesn't matter) when it occurs. Even a blue-screen would be better than this...easier to debug, certainly.

I'd be happy to send more logs to the ReFS team, but we don't want to take the risk that Microsoft will declare that they "have a 'fix'" (KB4013429) and charge for the incident...since I haven't tried all the options it listed yet.
graham8
Enthusiast
 
Posts: 54
Liked: 20 times
Joined: Wed Dec 14, 2016 1:56 pm

Re: REFS 4k horror story

Veeam Logoby EricJ » Mon Mar 27, 2017 2:10 pm

graham8 wrote:This weekend, the copy repo in question with the patch went into an IO/memory(presumably)-starvation state again and is inaccessible. Had someone on site repower it, and it checked back in briefly and is now inaccessible again...going to wait a few hours, since from past experience the post-crash scan can cause long IO lockups of the servers as well (Task Scheduler -> Microsoft -> Windows -> Data Integrity Scan -> Data Integrity Scan for Crash Recovery). Every time this happens it's incredibly nerve-wracking and makes me afraid the whole setup is going to burn down :|


Same here. Lost it over the weekend during two large synthetic full jobs halfway into their fast clone process. I have now applied Option 2 (set to "32" - total guess since MS doesn't provide much guidance here). I am manually setting the jobs to run synthetic full on Monday (today) so I can run a backup and monitor the metafile usage with Rammap during the job.

So far I am noticing that the Metafile active memory climbs during a fast clone, but does get released after the job completes. However, my jobs encompassing 1-1.5 TB of VMs have caused the active usage to climb beyond 2.5 GB. Soon I will simulate what caused the failure this weekend - two large file servers (6.3 TB and 3.3 TB) fast cloning at the same time. I expect the metafile active usage will climb much higher - but we will see how the server handles it.
EricJ
Influencer
 
Posts: 11
Liked: 1 time
Joined: Thu Jan 12, 2017 7:06 pm

Re: REFS 4k horror story

Veeam Logoby graham8 » Mon Mar 27, 2017 2:24 pm

Question - is everyone else getting the following event on a semi-regular basis? This exact event occurs on our 1.) Veeam Backup Copy Repo 2.) Veeam Primary Repo 3.) 2016+ReFS Hyper-V host which is being backed up .... since it's the exact same text on all these servers, and it seems to always succeed, I haven't been in a panic about it, but it would make me feel better to know other people are getting it.

Log: Microsoft-Windows-DataIntegrityScan/Admin
Source: DataIntegrityScan
EventID: 56
"Volume metadata inconsistency was detected and was repaired successfully.
Volume name: D:
Metadata reference: 0x204
Range offset: 0x0
Range length (in bytes): 0x0
Bytes repaired: 0x3000
Status: STATUS_SUCCESS"
graham8
Enthusiast
 
Posts: 54
Liked: 20 times
Joined: Wed Dec 14, 2016 1:56 pm

Re: REFS 4k horror story

Veeam Logoby mkretzer » Mon Mar 27, 2017 2:27 pm 1 person likes this post

Here in our system it still works well after one week (64k, 110 TB Repo) but we did two things:

- Patch & Reg setting Option 1
- Increase RAM of the Repo server from 128 GB to 384 GB

One thing is very strange - the system can read from production faster than it can write to the backend. Now after the changes we see it reading with a extrem high speed and then dropping to 0 after 2-3 minutes and then continuing again. I checked the backend storage of the repo and saw a static stream of data - it looks as if all the data is going to RAM first and is then written out.... Which is kind of strange...
mkretzer
Expert
 
Posts: 251
Liked: 61 times
Joined: Thu Dec 17, 2015 7:17 am

Re: REFS 4k horror story

Veeam Logoby alesovodvojce » Mon Mar 27, 2017 2:47 pm

Our servers are having deadlocks still after the patch, Option1+2.
The frequency of deadlocks is lower, but still it is something that can't have place in any production site.
Rammap utility mentioned here shows steady grow of "Metafile" part, both Active and Total - Gigabyte after Gigabyte, until it uses all available memory. This shouldn't windows kernel drivers do.

So far we have tried:
Code: Select all
RefsEnableLargeWorkingSetTrim = 1 (Option 1)


Does not help. So we had
Code: Select all
RefsEnableLargeWorkingSetTrim = 1 (Option 1)
RefsNumberOfChunksToTrim=8 (Option 2)


Again today, after deadlock, we will do
Code: Select all
RefsEnableLargeWorkingSetTrim = 1 (Option 1)
RefsNumberOfChunksToTrim=32 (Option 2)
RefsEnableInlineTrim=1 (Option 3)


So we will have all three options in place. Most aggressive combination. Will keep you updated. Usually its less than week till it hits the rock.
alesovodvojce
Influencer
 
Posts: 23
Liked: 1 time
Joined: Tue Nov 29, 2016 10:09 pm

PreviousNext

Return to Veeam Backup & Replication



Who is online

Users browsing this forum: No registered users and 22 guests