Comprehensive data protection for all workloads
Locked
rfn
Expert
Posts: 141
Liked: 5 times
Joined: Jan 27, 2010 9:43 am
Full Name: René Frej Nielsen
Contact:

Re: REFS 4k horror story

Post by rfn »

I enabled the RefsEnableLargeWorkingSetTrim registry key but not one of the others...

I have no experience with RAMMap... how do I set it up to record a sequence like you have created? The problem is that once the server is unresponsive, then I can't save anything before resetting the server.
Hauke
Influencer
Posts: 23
Liked: 4 times
Joined: Apr 16, 2015 11:25 am
Full Name: Hauke Ihnen
Contact:

Re: REFS 4k horror story

Post by Hauke »

More details to my issues with freesing storage:

Registry Keys from MS are not helping. They lower the used RAM by ReFS, but Windows is still freezing. With all 3 options set the used RAM will stay at max. ~3GB, not more.
Freezing always occur on high usage of the array, for example multiple running jobs, compacting jobs, or high reading and writing at the same time (backups running + offsite backup to tape job).
Also ReFS lost all it's performance benefits after a few weeks, it's gotten very slow. It was only fast just after the creation. It feels like a very very high fragmented drive. It's not even using full 1GBit LAN now.

For me it is very clear - I will return to NTFS. No time to play a beta tester for MS, backup must be reliable without spending hours over hours for storage issues.

After again a freeze today let's hope that the drive will come up again to move the data away to another storage... at the moment it's freezing 5 minutes after each reboot.
And lets hope that moving the huge files away will not cause the storage to freeze again...

Edit: I was connected to the storage during freeze by RDP, so I can see the task manager just before the box died: memory usage 10GB from 24GB. So it's not a memory issue for me... CPU Load 100%, Uptime 19 minutes. Jay!
kubimike
Veteran
Posts: 391
Liked: 56 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

rfn wrote:Yes, I'm using NIC teaming... I have a HP 10G NIC where I have teamed the two connectors and connected them to two HPE 5900 series switches that are stacked for redundancy.

I literally only have Windows Server, Veeam Backup & Replication and the HPE drivers and tools on there server. Nothing else... I also got the search result that you're linking to.
Are you agentless? Maybe thats HPs stuff doing that .
kubimike
Veteran
Posts: 391
Liked: 56 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

Hauke wrote:Edit: I was connected to the storage during freeze by RDP, so I can see the task manager just before the box died: memory usage 10GB from 24GB. So it's not a memory issue for me... CPU Load 100%, Uptime 19 minutes. Jay!
You wont see ReFS memory usage in task manager. Also get with Gostev, IIRC he has some fix from Microsoft to try out.
rfn
Expert
Posts: 141
Liked: 5 times
Joined: Jan 27, 2010 9:43 am
Full Name: René Frej Nielsen
Contact:

Re: REFS 4k horror story

Post by rfn »

I'm not 100% sure but "HPE ProLiant Agentless Management Service" is installed... I have a second VBR server as a repository, on another site, and it has the same software installed, but doesn't get this error. That server is a DL380 Gen8.
kubimike
Veteran
Posts: 391
Liked: 56 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

@RFN, damn dude you got me stumped. I've given you my bag of tricks to get a DL380 to work with veeam lol. The only thing left is we are running different windows patches.

For s****-n-giggles you could remove all Win updates except for the one that is there by default and install only the following:

kb3211320 + kb3213986

Oh and so you don't get crypt0rwared turn off SMB v1 in features :D

This is how I run my Veeam box.
rfn
Expert
Posts: 141
Liked: 5 times
Joined: Jan 27, 2010 9:43 am
Full Name: René Frej Nielsen
Contact:

Re: REFS 4k horror story

Post by rfn »

kubimike wrote:@RFN, damn dude you got me stumped. I've given you my bag of tricks to get a DL380 to work with veeam lol. The only thing left is we are running different windows patches.
:D
My servers are fully patched... I have also updated to VBR update 2 today. I will see if it hangs again, and then I will implement the RefsNumberOfChunksToTrim registry key like you did, and see if that helps...
rfn
Expert
Posts: 141
Liked: 5 times
Joined: Jan 27, 2010 9:43 am
Full Name: René Frej Nielsen
Contact:

Re: REFS 4k horror story

Post by rfn »

kubimike wrote:@RFN, damn dude you got me stumped. I've given you my bag of tricks to get a DL380 to work with veeam lol. The only thing left is we are running different windows patches.

For s****-n-giggles you could remove all Win updates except for the one that is there by default and install only the following:

kb3211320 + kb3213986

Oh and so you don't get crypt0rwared turn off SMB v1 in features :D

This is how I run my Veeam box.
Interesting suggestion... but I really like to have my servers patched, and I'm pretty sure that our auditors would spank me if did what you suggest :shock:

I have really locked down the Windows Firewall on these boxes so any ransomware would have to be very good to get into them! Unfortunately the firewall rules are "reset" by the VBR update 2 installation :x
kubimike
Veteran
Posts: 391
Liked: 56 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

@RFN
Can't try it to see if it works though ? Just for a few days ? All my prior recommendations + the Microsoft KBs have made my box stable for a few months now .
rfn
Expert
Posts: 141
Liked: 5 times
Joined: Jan 27, 2010 9:43 am
Full Name: René Frej Nielsen
Contact:

Re: REFS 4k horror story

Post by rfn »

It's an interesting suggestion... I will first see if it magically works now, and if not, then try the registry fix. If that doesn't work either, then I can try it, but I almost hope that it doesn't fix it, because we really should be able to patch our servers without it breaking stuff.

I'm also considering rebuilding the server to see if that error in the Application log goes away!
kubimike
Veteran
Posts: 391
Liked: 56 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

@RFN sounds like a plan. I'm excited to hear the result.
Hauke
Influencer
Posts: 23
Liked: 4 times
Joined: Apr 16, 2015 11:25 am
Full Name: Hauke Ihnen
Contact:

Re: REFS 4k horror story

Post by Hauke »

Hauke wrote: Also ReFS lost all it's performance benefits after a few weeks, it's gotten very slow. It was only fast just after the creation.
Just to add numbers.. I have two identical NAS devices. One with Windows Server 2016 and ReFS, the second one with 2012 R2 and NTFS. Same Raid config. Same Harddisks.
Now copying the files from the ReFS box to the NTFS Box.
Speed: 50MB/s, not more. Reading only 1 file, no other load on the box!
Load:
Source (ReFS): constant 100%
Target (NTFS): 2-4% (and it's not a new drive, it's old and fragmented too, used before for Veeam for over 1 year)

...

Again, ReFS worked fine without issues for 2 Months, but every day it got slower and slower. Maybe ReFS isn't a good choice for harddisks because of its heavy fragmentation, and it will work better on SSDs.
I don't think a simple patch from MS will solve that, its by design.
suprnova
Enthusiast
Posts: 38
Liked: never
Joined: Apr 08, 2016 5:15 pm
Contact:

Re: REFS 4k horror story

Post by suprnova »

suprnova wrote:I was hoping to avoid the issue by not using synthetic fulls, but this issue is also happening for incremental merges with block cloning. My CPU and RAM are fine, but during the merge I am unable to browse the Veeam repo drive in Windows.

I am fully patched and I have RefsEnableLargeWorkingSetTrim set to 1.
I did test out the test Microsoft fix, but this did not help last night. I do not have CPU or memory problems, but my WMI monitoring has large gaps in my repository data. It's tough to say what causes it, when the instability started, there was only one merge running. Overall, I think at this point I need to turn off block cloning, move back to NTFS, or start using the block clone synthetic fulls.
thomas.raabo
Service Provider
Posts: 28
Liked: 11 times
Joined: Oct 31, 2016 6:27 pm
Full Name: Thomas Raabo
Location: infrastructure guy
Contact:

Re: REFS 4k horror story

Post by thomas.raabo »

News update.

Working with a new ReFS.sys driver from MS and everything seems much more stable.

Still to early to say anything ...... but! does seem to have a big effect on our setup.
Skyview
Service Provider
Posts: 56
Liked: 14 times
Joined: Jan 10, 2012 8:53 pm
Contact:

Re: REFS 4k horror story

Post by Skyview »

Hauke wrote:Again, ReFS worked fine without issues for 2 Months, but every day it got slower and slower. Maybe ReFS isn't a good choice for harddisks because of its heavy fragmentation, and it will work better on SSDs.
I don't think a simple patch from MS will solve that, its by design.
But this wouldn't necessarily matter for customers using block storage, correct? I'm on a Compellent SC8000 & SCv2080
Skyview
Service Provider
Posts: 56
Liked: 14 times
Joined: Jan 10, 2012 8:53 pm
Contact:

Re: REFS 4k horror story

Post by Skyview »

One BIG reason I'm trying ReFS- corruption detection. But wouldn't a regular backup files health check on NTFS accomplish the same thing?
Gostev
Chief Product Officer
Posts: 31814
Liked: 7302 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS 4k horror story

Post by Gostev »

Hauke wrote:Maybe ReFS isn't a good choice for harddisks because of its heavy fragmentation, and it will work better on SSDs.
There's no "heavy fragmentation" with ReFS as Veeam blocks (which is what is being cloned) are quite large in size, 512KB on average with the default settings. With such block size, even a single spindle of 7200rpm drive should be able to do 30-50MB/s throughput on 100% "fragmented" volume - while any backup storage will usually have multiple spindles and so much more I/O capacity. So the reason here is not fragmentation, but something else. Likely the impact of the core issue discussed here, because looks like that issue simply keeps the entire volume overloaded and constantly busy.
Skyview wrote:One BIG reason I'm trying ReFS- corruption detection. But wouldn't a regular backup files health check on NTFS accomplish the same thing?
No, not the same - health check only checks and fixes (if needed) the latest restore point. While ReFS monitors the entire volume (including GFS backups etc.) plus with storage spaces, it is able to recovery corruption such as bit rot too - making it an awesome choice for long-term backup repositories.
Gostev
Chief Product Officer
Posts: 31814
Liked: 7302 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS 4k horror story

Post by Gostev »

thomas.raabo wrote:News update.

Working with a new ReFS.sys driver from MS and everything seems much more stable.

Still to early to say anything ...... but! does seem to have a big effect on our setup.
Awesome news! Let's observe it for a week or two now.
kubimike
Veteran
Posts: 391
Liked: 56 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike »

new refs.sys, music to my monitor! :shock: :P
Skyview
Service Provider
Posts: 56
Liked: 14 times
Joined: Jan 10, 2012 8:53 pm
Contact:

Re: REFS 4k horror story

Post by Skyview »

This thread is a bit deep, any more information on this?
kubimike
Veteran
Posts: 391
Liked: 56 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS 4k horror story

Post by kubimike » 2 people like this post

@Skyview, head to the restroom and read up .. all good stuff :)
Gostev
Chief Product Officer
Posts: 31814
Liked: 7302 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS 4k horror story

Post by Gostev »

We've been testing a private fix from Microsoft with 6 affected customers, Thomas is one of those who volunteered when I asked for help a few pages ago. I think it makes sense to wait feedback from all of them before expanding this effort.
Skyview
Service Provider
Posts: 56
Liked: 14 times
Joined: Jan 10, 2012 8:53 pm
Contact:

Re: REFS 4k horror story

Post by Skyview »

Thanks for the update Gostev.
alesovodvojce
Enthusiast
Posts: 63
Liked: 9 times
Joined: Nov 29, 2016 10:09 pm
Contact:

Re: REFS 4k horror story

Post by alesovodvojce »

@Gostev, interested in testing. case #02173809. Our backup server freezed just few hours ago
thomas.raabo
Service Provider
Posts: 28
Liked: 11 times
Joined: Oct 31, 2016 6:27 pm
Full Name: Thomas Raabo
Location: infrastructure guy
Contact:

Re: REFS 4k horror story

Post by thomas.raabo » 1 person likes this post

Gostev wrote:We've been testing a private fix from Microsoft with 6 affected customers, Thomas is one of those who volunteered when I asked for help a few pages ago. I think it makes sense to wait feedback from all of them before expanding this effort.
Hi All.
This is the third day of testing the new ReFS.sys file and our backup window has gone down with about 10 hours.
Right now we are not able to make the disk go "offline" in explore and disk counters does not stop working. It seems that this have had a big effect on our 4 repos with the patch and veeam jobs now process as expected.

We have a total of 600TB running this patch.
Ram is steady at 60GB ram and performance does not seem to be affected by this patch.

And no lockups... we needed to reboot our repo to keep refs only almost every day.

Will keep you updated.
mkretzer
Veeam Legend
Posts: 1203
Liked: 417 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS 4k horror story

Post by mkretzer »

Is it known when this hotfix will be distributed by MS?
JimmyO
Enthusiast
Posts: 55
Liked: 9 times
Joined: Apr 27, 2014 8:19 pm
Contact:

Re: REFS 4k horror story

Post by JimmyO »

I´m also testing the fix from MS, but see only minor improvements. Testing goes on....

@thomas.raabo; are you running the latest MS update for 2016? what about "RefsEnableLargeWorkingSetTrim", are you using it?
thomas.raabo
Service Provider
Posts: 28
Liked: 11 times
Joined: Oct 31, 2016 6:27 pm
Full Name: Thomas Raabo
Location: infrastructure guy
Contact:

Re: REFS 4k horror story

Post by thomas.raabo »

No this is a special hotfix that are not public.

Day 4 still no problems!
Gostev
Chief Product Officer
Posts: 31814
Liked: 7302 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS 4k horror story

Post by Gostev »

JimmyO you have the same special fix.

Also, I don't know if Thomas has RefsEnableLargeWorkingSetTrim enabled, but it would not matter much for him because I know he has infinite RAM on backup repository server ;) well, 256GB that is.
thomas.raabo
Service Provider
Posts: 28
Liked: 11 times
Joined: Oct 31, 2016 6:27 pm
Full Name: Thomas Raabo
Location: infrastructure guy
Contact:

Re: REFS 4k horror story

Post by thomas.raabo »

correct ..

my main issues is meta data change resulting in disk going offline. this does seem to happen on all meta changes.

RefsEnableLargeWorkingSetTrim did help making my system not crash when deleting syntetic fulls and max ram usage around 80g
Locked

Who is online

Users browsing this forum: Gostev and 71 guests