REFS 4k horror story

Availability for the Always-On Enterprise

Re: REFS 4k horror story

Veeam Logoby mkretzer » Mon Apr 10, 2017 9:46 pm 1 person likes this post

I am not the first one to be optimistic about MS software stability but it seems like RAM can really "solve" the issue.

Our system now hosts 114 TB of backups and the REFS is still very stable. We did not have one high latency message in Eventlog and even directly browsing is fast most of the time. It feels like a totally different system.

In our case ther additional 256 GB of RAM is the much cheaper solution than more storage - the 114 TB only take 68 TB after only two weeks of synthetics...

Markus
mkretzer
Expert
 
Posts: 304
Liked: 67 times
Joined: Thu Dec 17, 2015 7:17 am

Re: REFS 4k horror story

Veeam Logoby richardkraal » Wed Apr 12, 2017 9:03 am

graham8 wrote:Latest update. Confirmed no backups/copies/etc were taking place, and deleted two 6.5TB VBK files from the server. As usual with ReFS, available disk space only slowly began to make itself available. With all three of the MS workaround options in place, memory usage for this operation climbed over 100%. Then the usual occurred...numlock stopped responding, mouse stopped moving, disk activity lights stopped. I did multiple rounds of initiating manual memory dumps. This time, unlike all the other times this has occurred, the problem isn't working itself out by disabling all disk-activity-related services/tasks/etc (Veeam, server shares, scheduled data integrity scans, etc). Within 2-3 minutes, the server becomes unresponsive now with each boot cycle.

Updated Microsoft, but unless they come back to us with some way to set the volume read-only so that it temporarily stops whatever bug is occurring (even if it means it doesn't free the disk space) so we can recover the data from the volume, then it looks like we have permanently lost backup history and will need to nuke this and put in some completely different solution. And again, the volume itself is fine - the data is all accessible...just only for 2-3 minutes until the ReFS driver nukes the server.

I'll submit the memory dumps to Microsoft, so hopefully that at least helps them towards a long-term resolution to the underlying bug.


I've got simmilar problems with our new veeam testsetup, made a call @ MS, but they won't help me with troubleshooting.
This is what they told me, we (MS) don't support veeam, use Windows Backup instead to test with one concurrent backup tasks at the time. As you can expect this is not giving any problems... because this load is peanuts
They don't wanna help. Veeam support tells us that we have to go Microsoft as they see it as performance issue at the host....

going nuts here
richardkraal
Novice
 
Posts: 8
Liked: 1 time
Joined: Wed Apr 05, 2017 10:48 am
Full Name: Richard Kraal

Re: REFS 4k horror story

Veeam Logoby graham8 » Wed Apr 12, 2017 12:14 pm

richardkraal wrote:I've got simmilar problems with our new veeam testsetup, made a call @ MS, but they won't help me with troubleshooting.
This is what they told me, we (MS) don't support veeam, use Windows Backup instead to test with one concurrent backup tasks at the time. As you can expect this is not giving any problems... because this load is peanuts
They don't wanna help. Veeam support tells us that we have to go Microsoft as they see it as performance issue at the host....
going nuts here


Yikes. Did you open a business support case with Microsoft? If that was their only response, you should ask to be escalated to a supervisor, and reference this thread and ask that the issue be coordinated with others...maybe PM Gostev here your case ID # so he can communicate it to the ReFS team (he has my case ID as well, and probably others).

I haven't gotten any resolutions or answers, but on my case they at least seem to be in communication with other people and let me know how they want me to gather memory dumps and all. I'm currently waiting on them to let me know where to submit them.

....though, I'm also building out Solaris boxes to switch back out to ZFS again.
graham8
Enthusiast
 
Posts: 59
Liked: 20 times
Joined: Wed Dec 14, 2016 1:56 pm

Re: REFS 4k horror story

Veeam Logoby kubimike » Wed Apr 12, 2017 3:36 pm

@graham8 you're lucky that you got dump files. I couldn't get it to trigger and make one with verifier turned on. With it turned off it wasn't giving a complete picture and would create files easily and crash. With that said microsoft couldnt help unless I could get it to crash with verififer ON.
kubimike
Expert
 
Posts: 230
Liked: 22 times
Joined: Fri Feb 03, 2017 2:34 pm
Full Name: MikeO

Re: REFS 4k horror story

Veeam Logoby graham8 » Wed Apr 12, 2017 3:47 pm

kubimike wrote:@graham8 you're lucky that you got dump files. I couldn't get it to trigger and make one with verifier turned on. With it turned off it wasn't giving a complete picture and would create files easily and crash. With that said microsoft couldnt help unless I could get it to crash with verififer ON.


Not sure what you mean by verifier. Anyway, we set it up for manual memory dump initialization with keyboard hotkeys (holding down right control key and pressing scroll lock twice). It worked for me, even when the numlock key on the keyboard wouldn't toggle. I'll paste the instructions I got from Microsoft on how they wanted that set up below:

Step 1: Create a paging file

1.Click Start, right-click Computer, and then click Properties.
2.Click Advanced system settings on the System page, and then click the Advanced tab.
3.Click Settings under the Performance area.
4.Click the Advanced tab, and then click Change under the Virtual memory area.
5.Select the system partition where the operating system is installed.

Note :- To enable the system partition, you have to click to clear the Automatically manage paging file size for all drives check box.

6.Set the value of Initial size and Maximum size to the amount of physical RAM that is installed plus 1 megabyte (MB) under the Custom Size button.
7.Click Set, and then click OK three times.
8.Restart Windows in order for your changes to take effect.

Step 2: Settings in the Registry

1.Go to the registry key: HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl
2.To specify that you want to use a complete memory dump, set the CrashDumpEnabled DWORD value to 1.

Step 3: Create a complete memory dump file

1.Click Start, right-click Computer, and then click Properties.
2.Click Advanced system settings on the System page, and then click the Advanced tab.
3.Click Settings under the Writing debugging information area, and then make sure Complete memory dump is selected.

Step 4: To enable the feature on a computer that uses a USB keyboard, follow these steps:

1.Start Registry Editor.
2.Locate and then click the following registry sub key:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\kbdhid\Parameters
3.On the Edit menu, click Add Value, and then add the following registry entry.
4.Name : CrashOnCtrlScroll
5.Data Type : REG_DWORD
Value : 1
6.Exit Registry Editor.
7.Restart the computer. (On a computer that uses a USB keyboard, you do not have to restart the computer. Unplugging the keyboard and plugging it back again is sufficient. After that, the Memory dump file can be generated.)

Step 5: To enable the feature on a computer that uses a PS2 keyboard, follow these steps:

1.Start Registry Editor.
2.Locate and then click the following registry sub key:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\i8042prt \Parameters
3.On the Edit menu, click Add Value, and then add the following registry entry.
4.Name : CrashOnCtrlScroll
5.Data Type : REG_DWORD
Value : 1
6.Exit Registry Editor.
7.Restart the computer. (On a computer that uses a USB keyboard, you do not have to restart the computer. Unplugging the keyboard and plugging it back again is sufficient. After that, the Memory dump file can be generated.)
graham8
Enthusiast
 
Posts: 59
Liked: 20 times
Joined: Wed Dec 14, 2016 1:56 pm

Re: REFS 4k horror story

Veeam Logoby kubimike » Wed Apr 12, 2017 3:53 pm

"Driver Verifier monitors Windows kernel-mode drivers and graphics drivers to detect illegal function calls or actions " . So it creates dump files in kernel mode instead of user mode. So the steps you've outlined basically makes a dump on demand?? Thats what I needed as well. I wish microsoft would have told me about that. I'd sit at a frozen server waiting for it to make a dump file on its own, that would never happen.
kubimike
Expert
 
Posts: 230
Liked: 22 times
Joined: Fri Feb 03, 2017 2:34 pm
Full Name: MikeO

Re: REFS 4k horror story

Veeam Logoby graham8 » Wed Apr 12, 2017 3:55 pm

Yep, what I outlined dumps the entire contents of the system memory to disk under C:\Windows\memory.dmp .. that can then be used with windbg to analyze what's going on (by Microsoft developers...likely not by any frontline support person). And yep, they should have mentioned it.... The system only writes automatic memory dumps when you encounter a BSOD, so for situations like these where that isn't happening, manually triggering a dump is the only option I'm aware of...
graham8
Enthusiast
 
Posts: 59
Liked: 20 times
Joined: Wed Dec 14, 2016 1:56 pm

Re: REFS 4k horror story

Veeam Logoby kubimike » Wed Apr 12, 2017 4:03 pm

so when you're machine is hose beasted and the clock stops working youre still able to create a dump file? If so pretty amazing.
kubimike
Expert
 
Posts: 230
Liked: 22 times
Joined: Fri Feb 03, 2017 2:34 pm
Full Name: MikeO

Re: REFS 4k horror story

Veeam Logoby mkretzer » Sat Apr 15, 2017 6:45 am

Ok. Bad news fronm our installation.

Even with 384 GB RAM we started to get the latency messages, the filesystem started "hanging" and WMI monitoring stopped working. It resolved itself after a while but my optimism is slowly going away. At the time there where no synthetic operations going on which is strange, only normal backup writes...

The thing is it took nearly one month and > 100 TB of backed up data. But in the end REFS seems to be somewhat unstable no matter what you do... It also does not seem to be RAM related, there is >200 Gb avaiable.
mkretzer
Expert
 
Posts: 304
Liked: 67 times
Joined: Thu Dec 17, 2015 7:17 am

Re: REFS 4k horror story

Veeam Logoby richardkraal » Mon Apr 17, 2017 6:32 pm

mkretzer wrote:Ok. Bad news fronm our installation.

Even with 384 GB RAM we started to get the latency messages, the filesystem started "hanging" and WMI monitoring stopped working. It resolved itself after a while but my optimism is slowly going away. At the time there where no synthetic operations going on which is strange, only normal backup writes...

The thing is it took nearly one month and > 100 TB of backed up data. But in the end REFS seems to be somewhat unstable no matter what you do... It also does not seem to be RAM related, there is >200 Gb avaiable.


ok, that's bad news.
I was in the mood for buying a lot of ram, hoping that would resolve the issue.

tommorow I've got a conference call with MS about the latency issues. What I saw was the following:
in the taskmanager everything was cool, no high latencies... but at the same time in perfmon there where latencies of 900 > ms latencies.
offcourse, at the same time the data drive / filesystem is not responding.

I would like to know if there are other people experiencing the same
richardkraal
Novice
 
Posts: 8
Liked: 1 time
Joined: Wed Apr 05, 2017 10:48 am
Full Name: Richard Kraal

Re: REFS 4k horror story

Veeam Logoby lepphce1 » Mon Apr 17, 2017 7:00 pm

@richardkraal - I am experiencing as you describe in regard to slow performance, especially restarting after a crash. Sometimes I don't even have access to the volume. I've noticed high "Disk queue lengths" in Resource Monitor and the System process doing *something* with the disk. I'm working with ~20TB of data, 32GB memory. Server doesn't ever seem to be close to using all of that 32GB.

EDIT: 36TB of data, 17.9TB of disk used. ReFS+Veeam is pretty awesome if I could get a stable server!
lepphce1
Enthusiast
 
Posts: 29
Liked: 2 times
Joined: Tue Jun 28, 2016 4:40 pm

Re: REFS 4k horror story

Veeam Logoby mkretzer » Mon Apr 17, 2017 8:24 pm

richardkraal wrote:ok, that's bad news.
I was in the mood for buying a lot of ram, hoping that would resolve the issue.


The thing is... The RAM solved the crash issue completely for us. This sunday we did all merges on one day at once (as a test). And the system is still running.

But still there was extreme latency (and messages) at some times last week when there was no merge running.
Perhaps we really have two different issues here.
mkretzer
Expert
 
Posts: 304
Liked: 67 times
Joined: Thu Dec 17, 2015 7:17 am

Re: REFS 4k horror story

Veeam Logoby DaveWatkins » Tue Apr 18, 2017 2:21 am

I see the event log warnings about I/O latency too and have done from day 1 with ReFS.

More RAM and 64k fixed all my stability issues, but the I/O latency warnings still appear, although I don't see any corruption or other issues because of them
DaveWatkins
Expert
 
Posts: 248
Liked: 61 times
Joined: Sun Dec 13, 2015 11:33 pm

Re: REFS 4k horror story

Veeam Logoby mkretzer » Tue Apr 18, 2017 6:32 am

@DaveWatkins
What kind of storage do you use behind your Veeam? I wonder if the behaviour is just because REFS works differently - for example one job seems to "hang" right now, but the storage is working at its performance limit (many writes) so i think REFS is doing its unmapping right now...This might be normal for not-so fast storages as the one we use right now (24 Disk 6 TB per disk, RAID 6)
mkretzer
Expert
 
Posts: 304
Liked: 67 times
Joined: Thu Dec 17, 2015 7:17 am

Re: REFS 4k horror story

Veeam Logoby adruet » Tue Apr 18, 2017 7:34 am

I also had plenty of I/O latency logs too (75TB repo with 64GB of RAM).
ReFS did work well in the begining as well, but started to be more more unusable as data grew, and especially when serveral merges (6 to 10) were working at the same time.
We never experienced crashes with our 64GB RAM setup during backup operation. But doing a shift+del on about 7TB of backup files crashed the server completely (even with the KB and registry keys applied).

Anyway, we have migrated back to NTFS, and no more latency issues, no more backups failing, and the backup window is back to its 4 hours, when with ReFS it was a never ending process.
adruet
Influencer
 
Posts: 22
Liked: 6 times
Joined: Wed Oct 31, 2012 2:28 pm
Full Name: Alex

PreviousNext

Return to Veeam Backup & Replication



Who is online

Users browsing this forum: Bing [Bot], rpost and 31 guests