-
- Veeam Legend
- Posts: 1203
- Liked: 417 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS 4k horror story
Hello Oliver,
attached FC Array with HW RAID.
Scaling to > 100 3,5' Disks with integrated storage is kind of difficult...
Markus
attached FC Array with HW RAID.
Scaling to > 100 3,5' Disks with integrated storage is kind of difficult...
Markus
-
- Influencer
- Posts: 20
- Liked: 6 times
- Joined: Feb 01, 2017 8:36 pm
- Full Name: Stef
- Contact:
Re: REFS 4k horror story
Same!
We have one legacy setup with iSCSI, which craps out even faster on ReFS (the iscsi time outs seem to make it worse)
We have one legacy setup with iSCSI, which craps out even faster on ReFS (the iscsi time outs seem to make it worse)
-
- Lurker
- Posts: 2
- Liked: never
- Joined: Sep 19, 2013 12:53 pm
- Full Name: Tobias Schwendemann
- Contact:
Re: REFS 4k horror story
Hello Everyone,
we had the same Issue on saturday. I installed a new Backupserver at a customer last week using Windows Server 2016 with ReFS. There is one Veeam Repository on a ReFS Volume formated with 4k. We have about 12TB .vbk Files on a 100TB Volume. Even 128GB Ram were not enough to prevent the Server from crashing. Thats what I see in the Windows Eventlog:
---
Log Name: System
Source: Microsoft-Windows-WER-SystemErrorReporting
Date: 2/11/2017 6:01:34 PM
Event ID: 1001
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer:
Description:
The computer has rebooted from a bugcheck. The bugcheck was: 0x00000133 (0x0000000000000001, 0x0000000000001e00, 0x0000000000000000, 0x0000000000000000). A dump was saved in: C:\Windows\MEMORY.DMP. Report Id: 7246bf68-4886-4772-a8ea-168290eb66e7.
---
Regards
Tobias
we had the same Issue on saturday. I installed a new Backupserver at a customer last week using Windows Server 2016 with ReFS. There is one Veeam Repository on a ReFS Volume formated with 4k. We have about 12TB .vbk Files on a 100TB Volume. Even 128GB Ram were not enough to prevent the Server from crashing. Thats what I see in the Windows Eventlog:
---
Log Name: System
Source: Microsoft-Windows-WER-SystemErrorReporting
Date: 2/11/2017 6:01:34 PM
Event ID: 1001
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer:
Description:
The computer has rebooted from a bugcheck. The bugcheck was: 0x00000133 (0x0000000000000001, 0x0000000000001e00, 0x0000000000000000, 0x0000000000000000). A dump was saved in: C:\Windows\MEMORY.DMP. Report Id: 7246bf68-4886-4772-a8ea-168290eb66e7.
---
Regards
Tobias
-
- Chief Product Officer
- Posts: 31806
- Liked: 7300 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: REFS 4k horror story
All, it is extremely important that everyone opens support cases with Microsoft on these 4K cluster issues, so that they are aware and prioritize fixing this issue based on the number of bug reports. I am sure they have a lot of bugs to work through given how young Windows Server 2016 is, and at least in Veeam, the number of support cases is the primary metric when prioritizing hot fixes.
-
- Enthusiast
- Posts: 63
- Liked: 9 times
- Joined: Nov 29, 2016 10:09 pm
- Contact:
Re: REFS 4k horror story
For last few weeks we are having the same issues, affecting our Veeam 2016 Refs (4k cluster) servers only. Till now we have tried to solve them with Veeam support.
Will open MS support ticket now.
Will open MS support ticket now.
-
- Service Provider
- Posts: 9
- Liked: 2 times
- Joined: Apr 23, 2015 4:10 pm
- Full Name: Rodd Ahrenstorff
- Contact:
Re: REFS 4k horror story
I just wanted to add that we implemented a number of 12TB repository appliances with 2016 ReFs using 4k cluster setting for SMB customers and have experienced no issues. This 4K problem seems to be relegated to larger repositories.
-
- Chief Product Officer
- Posts: 31806
- Liked: 7300 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: REFS 4k horror story
I concur, this has been our experience as well.
-
- Veteran
- Posts: 391
- Liked: 56 times
- Joined: Feb 03, 2017 2:34 pm
- Full Name: MikeO
- Contact:
Re: REFS 4k horror story
@ tschwendemann You're problem sounds like the problem I Had. "DPC WATCHDOG 0x00000133" pop that sucker open in WINDBG, willing to bet its networking related!
-
- Veteran
- Posts: 361
- Liked: 109 times
- Joined: Dec 28, 2012 5:20 pm
- Full Name: Guido Meijers
- Contact:
Re: REFS 4k horror story
Smells like Broadcom...
So why do you guys keep trying 4k repositories instead on 64k on bigger arrays? Seems like to most unstable things to run in production for a long time currently...
So why do you guys keep trying 4k repositories instead on 64k on bigger arrays? Seems like to most unstable things to run in production for a long time currently...
-
- Service Provider
- Posts: 248
- Liked: 28 times
- Joined: Dec 14, 2015 8:20 pm
- Full Name: Mehmet Istanbullu
- Location: Türkiye
- Contact:
Re: REFS 4k horror story
I've experienced this issue 4k 27TB refs repository 1 month ago. After that now i'm formatting 64k all ReFS datastores Small or big.
VMCA v12
-
- Product Manager
- Posts: 8191
- Liked: 1322 times
- Joined: Feb 08, 2013 3:08 pm
- Full Name: Mike Resseler
- Location: Belgium
- Contact:
Re: REFS 4k horror story
Hey Mehmet,
Are the ones you already changed to 64k experiencing issues? (I assume not but thought I asked anyway )
Are the ones you already changed to 64k experiencing issues? (I assume not but thought I asked anyway )
-
- Service Provider
- Posts: 248
- Liked: 28 times
- Joined: Dec 14, 2015 8:20 pm
- Full Name: Mehmet Istanbullu
- Location: Türkiye
- Contact:
-
- Enthusiast
- Posts: 59
- Liked: 20 times
- Joined: Dec 14, 2016 1:56 pm
- Contact:
Re: REFS 4k horror story
Well, we just had another pseudo-crash (veeam backup server becomes unresponsive and remains that way in perpetuity until hard repowered). 32TB, 32GB ram, still on 4k.
Has anyone gotten anything new from MS on this issue that they could share with the rest of us? alesovodvojce, you said you were going to open a MS ticket. Have you learned anything from them yet?
rendest said that switching to 64kb did not fix their issue. Others have said it does fix their issue. Does 64kb actually fix things? At this point in the thread it seems somewhat inconclusive. I'm inclined to format, but I don't want to mess around with production servers unless I'm certain it will fix things.
@ Veeam Employees - what settings are available to limit the amount of load Veeam puts on the underlying storage? rendest mentioned throttling - is that only available in certain license types? Is there any way to perhaps disable multithreaded IO? What is the default thread count for Veeam? Is there no way a hotfix could be released for Veeam that monitors disk IO latency and backs off before it cripples the server? I'd appreciate any guidance anyone could offer us for how we could limit, as much as possible, the load Veeam puts on the storage subsystem to try to avoid crashing our servers until development (Microsoft or Veeam) can come up with some fix for the issue.
Also, under EventViewer->Microsoft->Windows->ReFS we regularly get these warnings:
An IO took more than 30000 ms to complete:
Process Id: 9932
Process name: VeeamAgent.exe
File name: 0000000000000C57 0000000000000405
File offset: 2181038080
IO Type: Write: Paging, NonCached, Sync
IO Size: 1048576 bytes
0 cluster(s) starting at cluster 0
Latency: 72954 ms
Volume Id: {7085173e-b757-4884-b34a-d23aa46d4941}
Volume name: D:
Is everyone else getting this? These only show up in the extended ("crimson") event channel for ReFS btw - browse to the path I mentioned to check on your repos. Also, we have identically-configured servers hosting VHDXs which are *not* getting those messages. We're getting those messages only on the primary Veeam repo and the offsite one.
Has anyone gotten anything new from MS on this issue that they could share with the rest of us? alesovodvojce, you said you were going to open a MS ticket. Have you learned anything from them yet?
rendest said that switching to 64kb did not fix their issue. Others have said it does fix their issue. Does 64kb actually fix things? At this point in the thread it seems somewhat inconclusive. I'm inclined to format, but I don't want to mess around with production servers unless I'm certain it will fix things.
@ Veeam Employees - what settings are available to limit the amount of load Veeam puts on the underlying storage? rendest mentioned throttling - is that only available in certain license types? Is there any way to perhaps disable multithreaded IO? What is the default thread count for Veeam? Is there no way a hotfix could be released for Veeam that monitors disk IO latency and backs off before it cripples the server? I'd appreciate any guidance anyone could offer us for how we could limit, as much as possible, the load Veeam puts on the storage subsystem to try to avoid crashing our servers until development (Microsoft or Veeam) can come up with some fix for the issue.
Also, under EventViewer->Microsoft->Windows->ReFS we regularly get these warnings:
An IO took more than 30000 ms to complete:
Process Id: 9932
Process name: VeeamAgent.exe
File name: 0000000000000C57 0000000000000405
File offset: 2181038080
IO Type: Write: Paging, NonCached, Sync
IO Size: 1048576 bytes
0 cluster(s) starting at cluster 0
Latency: 72954 ms
Volume Id: {7085173e-b757-4884-b34a-d23aa46d4941}
Volume name: D:
Is everyone else getting this? These only show up in the extended ("crimson") event channel for ReFS btw - browse to the path I mentioned to check on your repos. Also, we have identically-configured servers hosting VHDXs which are *not* getting those messages. We're getting those messages only on the primary Veeam repo and the offsite one.
-
- Influencer
- Posts: 15
- Liked: 4 times
- Joined: Jan 06, 2016 10:26 am
- Full Name: John P. Forsythe
- Contact:
Re: REFS 4k horror story
Hi,
I had to open up a case as well #02083290.
I have two repositorys one attached via iSCSI, the other one local SAS.
Both less than 20TB and 64k. Since this weekend the backup server crashed each time the backup starts, before it was running great for about 2 weeks.
I had to open up a case as well #02083290.
I have two repositorys one attached via iSCSI, the other one local SAS.
Both less than 20TB and 64k. Since this weekend the backup server crashed each time the backup starts, before it was running great for about 2 weeks.
-
- Service Provider
- Posts: 9
- Liked: 2 times
- Joined: Apr 23, 2015 4:10 pm
- Full Name: Rodd Ahrenstorff
- Contact:
Re: REFS 4k horror story
Just to confirm; the original appliances were configured with 4K and have experienced no issues. However, we are implementing 64K in our build process going forward.Delo123 wrote:So why do you guys keep trying 4k repositories instead on 64k on bigger arrays? Seems like to most unstable things to run in production for a long time currently...
-
- Enthusiast
- Posts: 63
- Liked: 9 times
- Joined: Nov 29, 2016 10:09 pm
- Contact:
Re: REFS 4k horror story
@graham8 news from MS - short answer: not tried. Longer answer: we thought support is covered in our SA, but is not. Opening a ticket will cost us $500, given that MS already postponed February's patches due to this Refs issue we are very like to get an answer "we are working on it, wait", which is not enough for the expense
We are eager to read here some more answers to your questions, as our VBR repos (ReFS 4k) are failing regularly
We are eager to read here some more answers to your questions, as our VBR repos (ReFS 4k) are failing regularly
-
- Veeam Legend
- Posts: 1203
- Liked: 417 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS 4k horror story
Microsoft closed our ticket in wich we found the memory issue with the KB because they do not think an update will arrive until march patchday....
-
- Product Manager
- Posts: 8191
- Liked: 1322 times
- Joined: Feb 08, 2013 3:08 pm
- Full Name: Mike Resseler
- Location: Belgium
- Contact:
Re: REFS 4k horror story
@john,
Keep us informed about your support case since it is an issue with 64K. You didn't install KB32 something right?
@all: Let's hope that MSFT indeed has a fix in March because this is really not good. And @Alesovodvojce: I'm pretty surprised there are no support tickets when you have an SA agreement
Keep us informed about your support case since it is an issue with 64K. You didn't install KB32 something right?
@all: Let's hope that MSFT indeed has a fix in March because this is really not good. And @Alesovodvojce: I'm pretty surprised there are no support tickets when you have an SA agreement
-
- Veteran
- Posts: 391
- Liked: 56 times
- Joined: Feb 03, 2017 2:34 pm
- Full Name: MikeO
- Contact:
Re: REFS 4k horror story
@j.forsythe
Interesting that you're having crashing with 64k. my server too is crashing daily after doing backups. Some days it works fine other days it might crash twice a day. What sort of Bugcheck are you seeing if any ? My respository is 48TB 64k ReFS connected via SAS HP P841 controller.
Interesting that you're having crashing with 64k. my server too is crashing daily after doing backups. Some days it works fine other days it might crash twice a day. What sort of Bugcheck are you seeing if any ? My respository is 48TB 64k ReFS connected via SAS HP P841 controller.
-
- Enthusiast
- Posts: 59
- Liked: 20 times
- Joined: Dec 14, 2016 1:56 pm
- Contact:
Re: REFS 4k horror story
Thanks for the updates everyone.
@alesovodvojce - it's been a while since I've felt desperate enough to call Microsoft for something, but I seem to recall that they don't charge if the issue is a legitimate Microsoft bug. At least, they've waived it for me in the past. I understand not wanting to take that risk though, of course.
@alesovodvojce - it's been a while since I've felt desperate enough to call Microsoft for something, but I seem to recall that they don't charge if the issue is a legitimate Microsoft bug. At least, they've waived it for me in the past. I understand not wanting to take that risk though, of course.
-
- Veteran
- Posts: 391
- Liked: 56 times
- Joined: Feb 03, 2017 2:34 pm
- Full Name: MikeO
- Contact:
Re: REFS 4k horror story
the way I see it $500 is a drop in the bucket for the time I'd have to spend fixing stuff because it keeps crashing. Yes they will refund you if its a bug.
-
- Veeam Legend
- Posts: 1203
- Liked: 417 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS 4k horror story
I am confused about the kind of crashes you are seeing... In our case we never saw a Bluescreen, the system only "hang". But from what i understand now some of you see bluescreens?
-
- Enthusiast
- Posts: 59
- Liked: 20 times
- Joined: Dec 14, 2016 1:56 pm
- Contact:
Re: REFS 4k horror story
Right, likewise - only "hangs" here (due to extreme memory exhaustion). I've never had an actual crash. It makes me wonder if it's the same problem if someone is getting a bluescreen.
-
- Veteran
- Posts: 391
- Liked: 56 times
- Joined: Feb 03, 2017 2:34 pm
- Full Name: MikeO
- Contact:
Re: REFS 4k horror story
sometimes it bluescreens, sometimes it just freezes and I have to reboot the system. When it does bluescreen I get a stop error bugcheck 0x133. When it freezes the iLO remote window is just BLACK
-
- Enthusiast
- Posts: 63
- Liked: 9 times
- Joined: Nov 29, 2016 10:09 pm
- Contact:
Re: REFS 4k horror story
Symptoms of crash
- CPU allowance of the guest VM is on 100% (seen from host)
- high memory demand - always higher than satisfied), no disk activity or queue
- sometimes it is possible to move the mouse or switch the window, but not to launch new app
These are common to dozens of "crashes" we have observed at our facility during last two months. Note that the Veeam (together with its storage) is on virtual, not physical server.
What next
If you see this symptoms and attribute them to the ReFS, I would suggest to force stop the machine immediately.
We have been trying to wait, even few days if the situation will recover, but it won't. Instead, long-term guest VM troubles led its time service to behave in a crazy way - sometimes shifting time to hours and even a few months to the future. This is something that guest or host were not able to repair. Wrong dates caused lot of other strange things, i.e. bad traces saved to SQL database, so our Veeam backup scheduler stopped to behave correctly since (our case #02063994).
After a force restart the machine sometimes get in troubles again immediately. We have ended the troubles by starting the machine and
a) lightly shutting down the machine just after windows booted, if it allowed to do so (if not, than hard shutdown and again)
b) waiting few hours after start, to recover (worked only sometimes for us)
Hope this info save you some time.
little OT: thanks for the MSFT ticket advices, we are NGO so the SA is maybe crippled of support benefit because of that.
- CPU allowance of the guest VM is on 100% (seen from host)
- high memory demand - always higher than satisfied), no disk activity or queue
- sometimes it is possible to move the mouse or switch the window, but not to launch new app
These are common to dozens of "crashes" we have observed at our facility during last two months. Note that the Veeam (together with its storage) is on virtual, not physical server.
What next
If you see this symptoms and attribute them to the ReFS, I would suggest to force stop the machine immediately.
We have been trying to wait, even few days if the situation will recover, but it won't. Instead, long-term guest VM troubles led its time service to behave in a crazy way - sometimes shifting time to hours and even a few months to the future. This is something that guest or host were not able to repair. Wrong dates caused lot of other strange things, i.e. bad traces saved to SQL database, so our Veeam backup scheduler stopped to behave correctly since (our case #02063994).
After a force restart the machine sometimes get in troubles again immediately. We have ended the troubles by starting the machine and
a) lightly shutting down the machine just after windows booted, if it allowed to do so (if not, than hard shutdown and again)
b) waiting few hours after start, to recover (worked only sometimes for us)
Hope this info save you some time.
little OT: thanks for the MSFT ticket advices, we are NGO so the SA is maybe crippled of support benefit because of that.
-
- Veteran
- Posts: 391
- Liked: 56 times
- Joined: Feb 03, 2017 2:34 pm
- Full Name: MikeO
- Contact:
Re: REFS 4k horror story
exactly what happens to me. iLO wont respond (Black window), but if I happen to have an RDP session going I can move boxes around, switch windows, start button doesn't work. I can launch WINKEY + R and type 'shutdown /r /t 0' nothing happens. Have to hard reset the machine.alesovodvojce wrote:Symptoms of crash
- sometimes it is possible to move the mouse or switch the window, but not to launch new app[/i]
-
- Product Manager
- Posts: 8191
- Liked: 1322 times
- Joined: Feb 08, 2013 3:08 pm
- Full Name: Mike Resseler
- Location: Belgium
- Contact:
Re: REFS 4k horror story
For all the guys out there running 64k ReFS repositories and seeing these freeze issues. Do you have created a support call with us? I am sure we want to investigate those! Even if we find something which is not related to us but to MSFT, we want to know so we can notify them and give them our analysis.
Please do the support call, post the ID's here and keep us informed on the forums.
Thanks
Mike
Please do the support call, post the ID's here and keep us informed on the forums.
Thanks
Mike
-
- Veteran
- Posts: 391
- Liked: 56 times
- Joined: Feb 03, 2017 2:34 pm
- Full Name: MikeO
- Contact:
Re: REFS 4k horror story
no ticket open but Im getting freezing with 64k ReFS. I can open a ticket. I do have a Microsoft ticket. My machine is currently running with verifier w/special pools enabled.
-
- Influencer
- Posts: 20
- Liked: 4 times
- Joined: Jan 12, 2017 7:06 pm
- Contact:
Re: REFS 4k horror story
We are seeing the same issues described, with memory exhaustion leading to server lock-up. 32TB repository with 4k ReFS. Not a total freeze or blue-screen - RDP sometimes cuts out, sometimes stays connected but nearly unusable.
A forced reset causes a vicious cycle, as ReFS detects an unclean reboot and kicks off background integrity checks. Even with Veeam services disabled, memory usage climbs and will become exhausted again within 10-15 minutes.
Poolmon shows the culprit is refs.sys and refsv1.sys. Last weekend I reformatted the primary repository with 64k, and have not had the issue since then... but it sounds like 64k isn't a total remedy either?
Hopefully Microsoft will have a fix on patch day (March 14).
A forced reset causes a vicious cycle, as ReFS detects an unclean reboot and kicks off background integrity checks. Even with Veeam services disabled, memory usage climbs and will become exhausted again within 10-15 minutes.
Poolmon shows the culprit is refs.sys and refsv1.sys. Last weekend I reformatted the primary repository with 64k, and have not had the issue since then... but it sounds like 64k isn't a total remedy either?
Hopefully Microsoft will have a fix on patch day (March 14).
-
- VP, Product Management
- Posts: 6035
- Liked: 2860 times
- Joined: Jun 05, 2009 12:57 pm
- Full Name: Tom Sightler
- Contact:
Re: REFS 4k horror story
I don't believe that 64K is a remedy, more a mitigation. In load testing with a 100TB repoitory I was able to crash the 4K ReFS system pretty much nightly. With 64K it ran a month without issues, but did eventually have a hang around day 40 or so.
Who is online
Users browsing this forum: Baidu [Spider] and 256 guests