Comprehensive data protection for all workloads
Locked
mkretzer
Veeam Legend
Posts: 1140
Liked: 387 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS 4k horror story

Post by mkretzer » 1 person likes this post

I am not the first one to be optimistic about MS software stability but it seems like RAM can really "solve" the issue.

Our system now hosts 114 TB of backups and the REFS is still very stable. We did not have one high latency message in Eventlog and even directly browsing is fast most of the time. It feels like a totally different system.

In our case ther additional 256 GB of RAM is the much cheaper solution than more storage - the 114 TB only take 68 TB after only two weeks of synthetics...

Markus
richardkraal
Service Provider
Posts: 13
Liked: 2 times
Joined: Apr 05, 2017 10:48 am
Full Name: Richard Kraal
Contact:

Re: REFS 4k horror story

Post by richardkraal »

graham8 wrote:Latest update. Confirmed no backups/copies/etc were taking place, and deleted two 6.5TB VBK files from the server. As usual with ReFS, available disk space only slowly began to make itself available. With all three of the MS workaround options in place, memory usage for this operation climbed over 100%. Then the usual occurred...numlock stopped responding, mouse stopped moving, disk activity lights stopped. I did multiple rounds of initiating manual memory dumps. This time, unlike all the other times this has occurred, the problem isn't working itself out by disabling all disk-activity-related services/tasks/etc (Veeam, server shares, scheduled data integrity scans, etc). Within 2-3 minutes, the server becomes unresponsive now with each boot cycle.

Updated Microsoft, but unless they come back to us with some way to set the volume read-only so that it temporarily stops whatever bug is occurring (even if it means it doesn't free the disk space) so we can recover the data from the volume, then it looks like we have permanently lost backup history and will need to nuke this and put in some completely different solution. And again, the volume itself is fine - the data is all accessible...just only for 2-3 minutes until the ReFS driver nukes the server.

I'll submit the memory dumps to Microsoft, so hopefully that at least helps them towards a long-term resolution to the underlying bug.
I've got simmilar problems with our new veeam testsetup, made a call @ MS, but they won't help me with troubleshooting.
This is what they told me, we (MS) don't support veeam, use Windows Backup instead to test with one concurrent backup tasks at the time. As you can expect this is not giving any problems... because this load is peanuts
They don't wanna help. Veeam support tells us that we have to go Microsoft as they see it as performance issue at the host....

going nuts here
graham8
Enthusiast
Posts: 59
Liked: 20 times
Joined: Dec 14, 2016 1:56 pm
Contact:

Re: REFS 4k horror story

Post by graham8 »

richardkraal wrote:I've got simmilar problems with our new veeam testsetup, made a call @ MS, but they won't help me with troubleshooting.
This is what they told me, we (MS) don't support veeam, use Windows Backup instead to test with one concurrent backup tasks at the time. As you can expect this is not giving any problems... because this load is peanuts
They don't wanna help. Veeam support tells us that we have to go Microsoft as they see it as performance issue at the host....
going nuts here
Yikes. Did you open a business support case with Microsoft? If that was their only response, you should ask to be escalated to a supervisor, and reference this thread and ask that the issue be coordinated with others...maybe PM Gostev here your case ID # so he can communicate it to the ReFS team (he has my case ID as well, and probably others).

I haven't gotten any resolutions or answers, but on my case they at least seem to be in communication with other people and let me know how they want me to gather memory dumps and all. I'm currently waiting on them to let me know where to submit them.

....though, I'm also building out Solaris boxes to switch back out to ZFS again.
kubimike
Veteran
Posts: 373
Liked: 41 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

@graham8 you're lucky that you got dump files. I couldn't get it to trigger and make one with verifier turned on. With it turned off it wasn't giving a complete picture and would create files easily and crash. With that said microsoft couldnt help unless I could get it to crash with verififer ON.
graham8
Enthusiast
Posts: 59
Liked: 20 times
Joined: Dec 14, 2016 1:56 pm
Contact:

Re: REFS 4k horror story

Post by graham8 »

kubimike wrote:@graham8 you're lucky that you got dump files. I couldn't get it to trigger and make one with verifier turned on. With it turned off it wasn't giving a complete picture and would create files easily and crash. With that said microsoft couldnt help unless I could get it to crash with verififer ON.
Not sure what you mean by verifier. Anyway, we set it up for manual memory dump initialization with keyboard hotkeys (holding down right control key and pressing scroll lock twice). It worked for me, even when the numlock key on the keyboard wouldn't toggle. I'll paste the instructions I got from Microsoft on how they wanted that set up below:

Step 1: Create a paging file

1.Click Start, right-click Computer, and then click Properties.
2.Click Advanced system settings on the System page, and then click the Advanced tab.
3.Click Settings under the Performance area.
4.Click the Advanced tab, and then click Change under the Virtual memory area.
5.Select the system partition where the operating system is installed.

Note :- To enable the system partition, you have to click to clear the Automatically manage paging file size for all drives check box.

6.Set the value of Initial size and Maximum size to the amount of physical RAM that is installed plus 1 megabyte (MB) under the Custom Size button.
7.Click Set, and then click OK three times.
8.Restart Windows in order for your changes to take effect.

Step 2: Settings in the Registry

1.Go to the registry key: HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl
2.To specify that you want to use a complete memory dump, set the CrashDumpEnabled DWORD value to 1.

Step 3: Create a complete memory dump file

1.Click Start, right-click Computer, and then click Properties.
2.Click Advanced system settings on the System page, and then click the Advanced tab.
3.Click Settings under the Writing debugging information area, and then make sure Complete memory dump is selected.

Step 4: To enable the feature on a computer that uses a USB keyboard, follow these steps:

1.Start Registry Editor.
2.Locate and then click the following registry sub key:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\kbdhid\Parameters
3.On the Edit menu, click Add Value, and then add the following registry entry.
4.Name : CrashOnCtrlScroll
5.Data Type : REG_DWORD
Value : 1
6.Exit Registry Editor.
7.Restart the computer. (On a computer that uses a USB keyboard, you do not have to restart the computer. Unplugging the keyboard and plugging it back again is sufficient. After that, the Memory dump file can be generated.)

Step 5: To enable the feature on a computer that uses a PS2 keyboard, follow these steps:

1.Start Registry Editor.
2.Locate and then click the following registry sub key:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\i8042prt \Parameters
3.On the Edit menu, click Add Value, and then add the following registry entry.
4.Name : CrashOnCtrlScroll
5.Data Type : REG_DWORD
Value : 1
6.Exit Registry Editor.
7.Restart the computer. (On a computer that uses a USB keyboard, you do not have to restart the computer. Unplugging the keyboard and plugging it back again is sufficient. After that, the Memory dump file can be generated.)
kubimike
Veteran
Posts: 373
Liked: 41 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

"Driver Verifier monitors Windows kernel-mode drivers and graphics drivers to detect illegal function calls or actions " . So it creates dump files in kernel mode instead of user mode. So the steps you've outlined basically makes a dump on demand?? Thats what I needed as well. I wish microsoft would have told me about that. I'd sit at a frozen server waiting for it to make a dump file on its own, that would never happen.
graham8
Enthusiast
Posts: 59
Liked: 20 times
Joined: Dec 14, 2016 1:56 pm
Contact:

Re: REFS 4k horror story

Post by graham8 »

Yep, what I outlined dumps the entire contents of the system memory to disk under C:\Windows\memory.dmp .. that can then be used with windbg to analyze what's going on (by Microsoft developers...likely not by any frontline support person). And yep, they should have mentioned it.... The system only writes automatic memory dumps when you encounter a BSOD, so for situations like these where that isn't happening, manually triggering a dump is the only option I'm aware of...
kubimike
Veteran
Posts: 373
Liked: 41 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

so when you're machine is hose beasted and the clock stops working youre still able to create a dump file? If so pretty amazing.
mkretzer
Veeam Legend
Posts: 1140
Liked: 387 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS 4k horror story

Post by mkretzer »

Ok. Bad news fronm our installation.

Even with 384 GB RAM we started to get the latency messages, the filesystem started "hanging" and WMI monitoring stopped working. It resolved itself after a while but my optimism is slowly going away. At the time there where no synthetic operations going on which is strange, only normal backup writes...

The thing is it took nearly one month and > 100 TB of backed up data. But in the end REFS seems to be somewhat unstable no matter what you do... It also does not seem to be RAM related, there is >200 Gb avaiable.
richardkraal
Service Provider
Posts: 13
Liked: 2 times
Joined: Apr 05, 2017 10:48 am
Full Name: Richard Kraal
Contact:

Re: REFS 4k horror story

Post by richardkraal »

mkretzer wrote:Ok. Bad news fronm our installation.

Even with 384 GB RAM we started to get the latency messages, the filesystem started "hanging" and WMI monitoring stopped working. It resolved itself after a while but my optimism is slowly going away. At the time there where no synthetic operations going on which is strange, only normal backup writes...

The thing is it took nearly one month and > 100 TB of backed up data. But in the end REFS seems to be somewhat unstable no matter what you do... It also does not seem to be RAM related, there is >200 Gb avaiable.
ok, that's bad news.
I was in the mood for buying a lot of ram, hoping that would resolve the issue.

tommorow I've got a conference call with MS about the latency issues. What I saw was the following:
in the taskmanager everything was cool, no high latencies... but at the same time in perfmon there where latencies of 900 > ms latencies.
offcourse, at the same time the data drive / filesystem is not responding.

I would like to know if there are other people experiencing the same
lepphce1
Enthusiast
Posts: 31
Liked: 2 times
Joined: Jun 28, 2016 4:40 pm
Contact:

Re: REFS 4k horror story

Post by lepphce1 »

@richardkraal - I am experiencing as you describe in regard to slow performance, especially restarting after a crash. Sometimes I don't even have access to the volume. I've noticed high "Disk queue lengths" in Resource Monitor and the System process doing *something* with the disk. I'm working with ~20TB of data, 32GB memory. Server doesn't ever seem to be close to using all of that 32GB.

EDIT: 36TB of data, 17.9TB of disk used. ReFS+Veeam is pretty awesome if I could get a stable server!
mkretzer
Veeam Legend
Posts: 1140
Liked: 387 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS 4k horror story

Post by mkretzer »

richardkraal wrote: ok, that's bad news.
I was in the mood for buying a lot of ram, hoping that would resolve the issue.
The thing is... The RAM solved the crash issue completely for us. This sunday we did all merges on one day at once (as a test). And the system is still running.

But still there was extreme latency (and messages) at some times last week when there was no merge running.
Perhaps we really have two different issues here.
DaveWatkins
Veteran
Posts: 370
Liked: 97 times
Joined: Dec 13, 2015 11:33 pm
Contact:

Re: REFS 4k horror story

Post by DaveWatkins »

I see the event log warnings about I/O latency too and have done from day 1 with ReFS.

More RAM and 64k fixed all my stability issues, but the I/O latency warnings still appear, although I don't see any corruption or other issues because of them
mkretzer
Veeam Legend
Posts: 1140
Liked: 387 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS 4k horror story

Post by mkretzer »

@DaveWatkins
What kind of storage do you use behind your Veeam? I wonder if the behaviour is just because REFS works differently - for example one job seems to "hang" right now, but the storage is working at its performance limit (many writes) so i think REFS is doing its unmapping right now...This might be normal for not-so fast storages as the one we use right now (24 Disk 6 TB per disk, RAID 6)
adruet
Influencer
Posts: 23
Liked: 7 times
Joined: Oct 31, 2012 2:28 pm
Full Name: Alex
Contact:

Re: REFS 4k horror story

Post by adruet »

I also had plenty of I/O latency logs too (75TB repo with 64GB of RAM).
ReFS did work well in the begining as well, but started to be more more unusable as data grew, and especially when serveral merges (6 to 10) were working at the same time.
We never experienced crashes with our 64GB RAM setup during backup operation. But doing a shift+del on about 7TB of backup files crashed the server completely (even with the KB and registry keys applied).

Anyway, we have migrated back to NTFS, and no more latency issues, no more backups failing, and the backup window is back to its 4 hours, when with ReFS it was a never ending process.
mkretzer
Veeam Legend
Posts: 1140
Liked: 387 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS 4k horror story

Post by mkretzer »

@adruet
When you say "7 TB of backup files" - how many files are you talking about? Did you use per-VM?
As said before my "feeling" is that the number of files also play a role, thats why we disabled per-VM.
adruet
Influencer
Posts: 23
Liked: 7 times
Joined: Oct 31, 2012 2:28 pm
Full Name: Alex
Contact:

Re: REFS 4k horror story

Post by adruet »

We have one job per VM.
That was about: 20 jobs with 30 points of retention max, so I would say arround 620 files (including the vbm files).
That is not an extraordinary demand for a production file system, is it ?
Rmachado
Service Provider
Posts: 23
Liked: 4 times
Joined: Dec 15, 2016 11:39 pm
Contact:

Re: REFS 4k horror story

Post by Rmachado »

Theres a lot of change since the beggining of the REFS and some patchs (i read all)

Is there any official recommendation from Veeam about the use of REFS and 64k ? WE're begging a project with 60 - 80 VMS with about 90TB of Storage and 32GB of Ram.

Should i change the repository to NTFS to be safe? Or can i use REFS?

thank you.
kubimike
Veteran
Posts: 373
Liked: 41 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

everyone thats having this issue what is your blocksize at the controller ? My drive latency problems went away when I made it smaller. Unsure if its related but I'd figured I would throw that out there.
tsightler
VP, Product Management
Posts: 6009
Liked: 2843 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: REFS 4k horror story

Post by tsightler »

Rmachado wrote:Is there any official recommendation from Veeam about the use of REFS and 64k ? WE're begging a project with 60 - 80 VMS with about 90TB of Storage and 32GB of Ram
64K is the official recommendation from Veeam at this time and I would recommend no less than 4GB of RAM per task, ideally more if you can.
lepphce1
Enthusiast
Posts: 31
Liked: 2 times
Joined: Jun 28, 2016 4:40 pm
Contact:

Re: REFS 4k horror story

Post by lepphce1 »

This is a pretty long thread so I apologize if this has already been asked...

May I *gently* ask if the Veeam folks following this thread have been able to replicate this in your lab? The reason I ask, is the top Google search for "Server 2016 ReFS crash" is a Veeam thread, and not much else. For example, is there a correlation on how Veeam is using block alignment that exacerbates some kind of bug in ReFS? In other words, it seems like it's Veeam users who are getting the brunt of whatever is happening here, and I am wondering if Veeam is taking part in this investigation with Microsoft in any way? I understand that there are a lot of little differences in what we are all seeing, but the common thread here is server instability with this combination of products.

Thank you for your consideration...
Gostev
Chief Product Officer
Posts: 31460
Liked: 6648 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS 4k horror story

Post by Gostev »

Yes, we've been in touch with the ReFS development team on this issue for a while now. Right now, they are working with one of our customers who has the issue reproducing most consistently. Internally, we do not have a lab that replicates the issue (and based on our support statistics, it does not seems to be very common in general ).
rendest
Influencer
Posts: 20
Liked: 6 times
Joined: Feb 01, 2017 8:36 pm
Full Name: Stef
Contact:

Re: REFS 4k horror story

Post by rendest »

Gostev wrote:Yes, we've been in touch with the ReFS development team on this issue for a while now. Right now, they are working with one of our customers who has the issue reproducing most consistently. Internally, we do not have a lab that replicates the issue (and based on our support statistics, it does not seems to be very common in general ).
Are there any case ID's we can refer to ?

Last weeks update made matters worse, resulting in the VBR's crashing overnight again. So besides the terrible performance, we now face rebooting our vbr's every 24hours.
j.forsythe
Influencer
Posts: 15
Liked: 4 times
Joined: Jan 06, 2016 10:26 am
Full Name: John P. Forsythe
Contact:

Re: REFS 4k horror story

Post by j.forsythe » 1 person likes this post

adruet wrote:
Based on my HP Hardware, 4 servers like this:
- HP DL380 Gen9 with dual CPU Intel E5-2660 v4 2Ghz, 64 GB of RAM, raid 1 SSD for the OS, and 2 NVMe 800GB disks
- Dual 10 Gbit network cards (HP 560FLR) supporting the offloading of SMB v3 (RDMA capabale)
- 2 x DAS HP D3700 with 25 x 1.8TB 12G 10K SAS disks configured as Raid 6 with HP p441 controller

I have done some storage spaces (and Storage Spaces Direct) testing, and the results were not very promising in terms of performance.
When the annonce of the licensing being Windows Server Datacenter only, we dropped the idea of ever using storage spaces direct.
So we tried to use storage spaces localy, using the NVMe disks as journal disk to improve performance.
But comparing the results using a veeam backup profile with diskspd between our p441 controller in HBA mode with storage spaces and the NVMe as journal disks (write cache for the volume) and parity for the rest of the D3700 disks, and standard Raid 6 with the p441 controller, we decided to stick with the p441 and raid 6 as it was faster and less CPU consuming.
Regarding RAM usage, this is probably due to ReFS, and you can check that with Sysinternals RAMMap.
Hi and thank you for the information.

Yeah I think I will disable Storage Spaces and go back to recreating the RAID with the HP controller and NTFS.
Even if the Veeam officials keep praising the ReFS solution and keep telling us that we are only a few having this problem, people are loosing backup data and that should not be something they just accept.
Seeing the weekly email from Gostev praising Windows 10 with ReFS 3.1 as a cheap, good solution for ROBO sites made me very disappointed.
And I sure hope, that future users using ReFS won't have to face the problem....

John
MatBac
Lurker
Posts: 1
Liked: never
Joined: Apr 24, 2017 12:56 pm
Full Name: Mattias Backrud
Contact:

Re: REFS 4k horror story

Post by MatBac »

Gostev wrote:All, here is the official KB article from Microsoft > FIX: Heavy memory usage in ReFS on Windows Server 2016 and Windows 10
Please don't forget to install KB4013429 before applying the registry values, and remember to reboot the server after doing so.

Finally, please do remember to share what option has worked for you!
Hi!

The KB 4013429 can’t be installed om my system since it is intended for Server 2016 (OS Build 14393.953) and I have (OS Build 14393.1066).
In the packet details of KB4013429 in Microsoft Update Catalog is shows that it has been replaced with KB4015438 (OS Build 14393.969), KB4016635 (OS Build 14393.970) and KB4015217 (OS Build 14393.1066 and 14393.1083).

I can see that update KB4015217 was installed via Windows Update a couple of days ago. Since these are cumulative updates does this 4015217 include all the ReFS fixes from KB4013429 that I need?

Regarding the ReFS RegKeys in question. Are these supposed to be added automatically by this KB-fix or should I create these keys manually? None of them are present in my registry right now even thou I have KB4015217 installed…
dmartenstyn
Lurker
Posts: 2
Liked: never
Joined: Apr 24, 2017 1:35 pm
Contact:

Re: REFS 4k horror story

Post by dmartenstyn »

So I've been watching this issue closely with a vested interest since I have just deployed a solution utilising Veeam, a local repository on ReFS and Windows 2016.

My physical Veeam B&R server is fairly hefty (2 x E5-2667 v4's, 256GB RAM with a LSI MegaRAID 9361-4i attached to 16 x HGST Deskstar 4TB HDD's in a RAID10). ReFS was initially formatted at 4k but having stumbled on this thread (luckily at the beginning of deployment) I blew the config away and went with 64k instead. I've not experienced any issues thus far, fingers crossed, however my backup jobs are fairly small (1 job containing 14 VM's -> 400GB for a full to local repository and another that goes to a weekly rotated external hard disk). I've had RamMap running in the background since the start and the Metafile has creeped in usage (currently at 6.1GB). Free memory seems fine at 223GB. I've not made any updates or applied any patches to Windows (Veeam is 9.5.0.823).

This environment goes live in approximately 3-4 weeks so I'm stuck in the middle a tad. Whilst I am fully aware we have personally experienced no issues as of yet I am a bit hesitant since I am a contractor here and once I leave my client will effectively be on their own with the architecture. I am at the stage that I can effectively blow the config away again and go with NTFS if required. The lack of information from Veeam / Microsoft is a little concerning it must be said.
lepphce1
Enthusiast
Posts: 31
Liked: 2 times
Joined: Jun 28, 2016 4:40 pm
Contact:

Re: REFS 4k horror story

Post by lepphce1 »

@Gostev,
Thanks for the reply. I've not opened a ticket with Veeam up to this point because I didn't think anything worthwhile could be immediately remedied by support. Would you like those of us who are having ReFS troubles to open a Veeam ticket on this issue, if we have not already done so?
evander
Enthusiast
Posts: 86
Liked: 5 times
Joined: Nov 17, 2011 7:55 am
Contact:

Re: REFS 4k horror story

Post by evander » 1 person likes this post

Just a thought for those that are at the point where they are building a new repository and have to make a decision which way to go, ReFS or NTFS. If you have plenty of disk space and/or your available disk space will take a while to fill up why not create two volumes on the same server and format one ReFS and one NTFS. If you can, run your backup window twice per night to each one, (NTFS first I suggest) and then if Microsoft finally fix ReFS you can simply blow away the NTFS and extend the volume, or just split your backup jobs between two ReFS volumes.
The benefits of ReFS are really great (if it works) so build your ReFS and cover your bet with NTFS.
I understand your concern may be that if the server locks up nothing will backup but that again can be less stress if you are forced to blow away (or simply dismount/pull-out) your ReFS partition the server will still be up and running on the NTFS partition ready to resume backups. This is also only if you are one of the unlucky ones that has this problem with ReFS as its very sporadic at best and not everyone seems to be affected, myself included.

This is especially easy if your repository is running as a VM but not that much more work if its running on a physical server and worth the extra admin if you ask me.

2 cents.
Gostev
Chief Product Officer
Posts: 31460
Liked: 6648 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS 4k horror story

Post by Gostev »

lepphce1 wrote:@Gostev,
Thanks for the reply. I've not opened a ticket with Veeam up to this point because I didn't think anything worthwhile could be immediately remedied by support. Would you like those of us who are having ReFS troubles to open a Veeam ticket on this issue, if we have not already done so?
No, we really want everyone experiencing the issue to open a ticket with Microsoft instead, to help raise the priority of this issue on their side.
alesovodvojce
Enthusiast
Posts: 61
Liked: 9 times
Joined: Nov 29, 2016 10:09 pm
Contact:

Re: REFS 4k horror story

Post by alesovodvojce »

After week of tests, and after countless ReFS horror days personally lived here, we have copied our Refs 4k repo to different filesystems to see the size differences. Here it is.

In actual numbers
ReFS 4k: 21 TB (source repo)
ReFS 64k 31 TB at least - we had to stop the file copy as the underlying disks runs out of free space
NTFS: 31 TB at least - same reason to stop. Finally we have shrinked source repo to 13 TB by deleting it files. After that, the target NTFS partition copied that size to 24 TB. So 13 TB Refs 4k made 24 TB NTFS).

Generalized
Refs 4k - best space saver. But lot of troubles (as in this thread)
Refs 64k - not a win in space saving, whilst still lot of refs benefits. but, the troubles will theoretically start as well, they are just postponed for later (when the repo size grows over unsaid limit)
NTFS - not win in space saving, no special benefits. Main benefit is stable filesystem = backups secured

We migrated first repo to NTFS now, enjoying stable backups. Second repo remais in Refs 4k for now for experiments
Locked

Who is online

Users browsing this forum: Bing [Bot] and 172 guests