REFS issues (server lockups, high CPU, high RAM)

Availability for the Always-On Enterprise

Re: REFS issues (server lockups, high CPU, high RAM)

Veeam Logoby mweissen13 » Tue Jul 10, 2018 6:06 pm 3 people like this post

I just installed the July Cumulative Update (KB4338814) and I am keeping my fingers crossed.

ReFS.sys driver is now version 1.0.14393.2363
mweissen13
Service Provider
 
Posts: 15
Liked: 6 times
Joined: Thu Dec 28, 2017 3:22 pm
Full Name: Michael Weissenbacher

Re: REFS issues (server lockups, high CPU, high RAM)

Veeam Logoby oscaru » Wed Jul 11, 2018 5:49 am

Hi, mweissen13!

Please give us some feedback if this new driver brings the fix @Gostev mentioned on early june.

Thanks!
oscaru
Service Provider
 
Posts: 15
Liked: 4 times
Joined: Tue Jul 26, 2016 6:49 pm
Full Name: Oscar Suarez

Re: REFS issues (server lockups, high CPU, high RAM)

Veeam Logoby wingphil » Wed Jul 11, 2018 8:24 am

I have also installed 2363 and anecdotally it seems faster than 2312, but I will have to wait a few days till my ~10 hour compact happens to be sure.
wingphil
Novice
 
Posts: 4
Liked: never
Joined: Mon Jun 11, 2018 8:51 am

Re: REFS issues (server lockups, high CPU, high RAM)

Veeam Logoby Humphro » Wed Jul 11, 2018 9:31 am

I too have this issue. I have applied KB4338814 to one of our remote repositories which is the target for a backup copy job, which since the previous update started to run for up to 4 days. I noticed that the update did change the version of refs.sys from 10.0.14393.2312 to 10.0.14393.2363.

I'm now running the backup copy job again and will post results as soon as available.
Humphro
Novice
 
Posts: 4
Liked: 1 time
Joined: Thu Mar 09, 2017 1:35 pm
Full Name: Matthew Humphreys

Re: REFS issues (server lockups, high CPU, high RAM)

Veeam Logoby billcouper » Wed Jul 11, 2018 9:41 am

Looking forward to hearing the results of KB4338814.

Thankfully, either through our environment being tiny, or backing up from flash to flash, or whatever is different, this hasn't impacted us dramatically.
Still... I've decided I'm not going to install any more windows updates without checking this thread first :)
billcouper
Enthusiast
 
Posts: 38
Liked: 9 times
Joined: Mon Dec 18, 2017 8:58 am
Full Name: Bill Couper

Re: REFS issues (server lockups, high CPU, high RAM)

Veeam Logoby Mgamerz » Wed Jul 11, 2018 6:47 pm

Be aware of the listed known issue for the July 2018 cumulative update:

After installing this update on a DHCP Failover Server, Enterprise clients may receive an invalid configuration when requesting a new IP address. This may result in loss of connectivity as systems fail to renew their leases.

This would have been a nice surprise to deal with as our backup server is also a DHCP server in our HA setup. I guess I will put up with this for another few weeks...
Mgamerz
Enthusiast
 
Posts: 27
Liked: 5 times
Joined: Fri Sep 29, 2017 8:07 pm

Re: REFS issues (server lockups, high CPU, high RAM)

Veeam Logoby nobrell » Wed Jul 11, 2018 7:32 pm 3 people like this post

I have updated our Server 2016 repository-server to OS Build 14393.2363 which contains an update to ReFS.sys 14393.2363.
Yesterday the compact job took 6h 41m and today after the update 1h 17m, so it seems to be fixed now finally!
Very happy about this :)
nobrell
Lurker
 
Posts: 2
Liked: 5 times
Joined: Thu Sep 17, 2015 6:22 pm
Full Name: Rikard Nobrell

Re: REFS issues (server lockups, high CPU, high RAM)

Veeam Logoby HenrikS. » Thu Jul 12, 2018 7:02 am 3 people like this post

Thumbs up for the ReFS.sys 14393.2363! No ReFS issues here atm.
HenrikS.
Novice
 
Posts: 5
Liked: 4 times
Joined: Tue Jul 04, 2017 12:59 pm
Full Name: Henrik Schewe

Re: REFS issues (server lockups, high CPU, high RAM)

Veeam Logoby doktornotor » Thu Jul 12, 2018 8:04 am

Mgamerz wrote:Be aware of the listed known issue for the July 2018 cumulative update:

Update already pulled from WSUS at least due to the DHCP SNAFU. Well done, MS. Excellent QA as always. And I must not forget to mention how much we all love these cumulative updates, don't we? Why not flip a coin between screwed filesystem driver and broken DHCP... Great patching scheme. And finally, the very fine detailed KB acticles that say nothing about what's being touched, pure masterpiece. Where's DHCP in the KB4338814 list of "fixes"? Or, where's the ReFS mentioned in there?

:x :x :evil: :evil: :twisted: :twisted:
doktornotor
Novice
 
Posts: 9
Liked: 3 times
Joined: Wed Mar 07, 2018 12:57 pm

Re: REFS issues (server lockups, high CPU, high RAM)

Veeam Logoby Gostev » Thu Jul 12, 2018 10:54 am

Well, to be fair, I don't know many customers who run DHCP Failover Server on their backup server :D
Other than that I agree, things do not seem to be improving as far as updates documentation or their quality :(
Gostev
Veeam Software
 
Posts: 22400
Liked: 2675 times
Joined: Sun Jan 01, 2006 1:01 am
Location: Baar, Switzerland

Re: REFS issues (server lockups, high CPU, high RAM)

Veeam Logoby mkaec » Thu Jul 12, 2018 5:35 pm 1 person likes this post

doktornotor wrote:...And I must not forget to mention how much we all love these cumulative updates, don't we?...

I do have love for the cumulative updates...because I remember the system it replaced.

The original patching scheme in Windows NT was to release hotfixes for individual customers that encountered problems. And then, after more extensive testing, release them as part of a service pack to all customers. With each successive version of Windows, fewer service packs were released. 6 for Windows NT, 4 for Windows 2000, 3 for Windows XP, 2 for Windows Vista, 1 for Windows 7. While Windows 7 was current, it was decided that there would be no more service packs. Microsoft kept releasing the hotfixes, but provided no mechanism to get them into the systems of the general population. So, our Windows 7 and Windows 2008 R2 systems were running with hundreds of bugs in which fixes existed for them that had to be manually installed. Microsoft staff responded to this by creating blog posts listing recommended hotfixes for certain usage scenarios. That left administrators to hunt down, and monitor, those blog posts, download the hotfixes, and manually into them on their servers. That was an awful situation. Windows 8 / Server 2012 introduced the cumulative update model which allowed bug fixes to reach all users. That methodology eventually got applied back to Windows 7 /Server 2008 R2.

I agree that the poor QA is frustrating. But, I think we are in a better situation than before. If the QA can be fixed, then I think it will be a very good system.
mkaec
Expert
 
Posts: 244
Liked: 54 times
Joined: Thu Jul 16, 2015 1:31 pm
Full Name: Marc K

Re: REFS issues (server lockups, high CPU, high RAM)

Veeam Logoby DonZoomik » Thu Jul 12, 2018 6:43 pm 1 person likes this post

Ugh… don't remind me of these times, literally hunting for fixes because there was no list of fixes for more obscure problems and official ones were always out of date. And then deploying them (standalone MSUs have no WSUS/SCCM/… integration) - even worse! Win8 model was quite good as each month's optional patches were separate so you could skip one (there were a few breaking ones) if you needed to. And there were release notes and they were quite good! As in old-times good (detailed symptoms, cause, solution)! Unfortunately they dropped this in late 2014 without any announcements (that I noticed…) and here we are now.

I'm not saying that current model is good (Win8 was better) but if you didn't hunt down hotfixes, your environment was not big enough or you just didn't know better. For example, I believe I had maybe 300 limited release hotfixes (I believe in proactive patching) in my image building setup at one point (before 2016 convenience rollup) for Windows 7 alone. Some community efforts made things better but man, these were dark times...
DonZoomik
Enthusiast
 
Posts: 43
Liked: 13 times
Joined: Fri Nov 25, 2016 1:56 pm

Re: REFS issues (server lockups, high CPU, high RAM)

Veeam Logoby Raleigh » Thu Jul 12, 2018 7:02 pm

tsightler wrote:Of course, I can't guarantee that the problem will be resolved, but I can say that I have worked with literally dozens, perhaps 100's, of customers using ReFS, most at scales of 100's of TBs per server, and RAM was always an important factor in resolving lockups. I would never recommend less than 64GB of RAM for your setup and use case, so I'm very hopeful that will improve your situation. Please keep us posted and thanks for participating in the Veeam community!

Reporting back on the results of the RAM upgrade. I bumped the RAM in our repo server from 16GB up to 96GB. I then performed the same test that would consistently render the server unresponsive: I deleted (via Windows Explorer) a ~5TB .vbk file. Result: no server lockup. I then performed the “real world” test by changing the retention policy of our large file server backup job so that Veeam would attempt to delete old backup sets. Result: no server lockup.

Tom, it appears that increasing server memory (per your suggestion) resolved the problem that has been a thorn in my side for three months now. Thank you!

However, I am left wondering why it took so long to get here. I have an open ticket with Microsoft Pro Support. I have an open ticket with Veeam Support (I have worked this issue with Tier 1 and Tier 2 support people at Veeam). At no point did anyone at MS support or Veeam support suggest that increasing RAM would resolve my server lockup issue. They have had me collect and upload diagnostic files and event logs, tweak registry entries, adjust page file settings, and verify driver versions. Never a suggestion to increase memory.

I have a few suggestions, if I may:

1. Share this information with Veeam Support personnel. I would have tried increasing server memory months ago had someone in Veeam Support suggested it.

2. Create and maintain a sticky post for this forum thread. There are 70+ pages here, far too much for a busy network admin responsible for dozens of systems to parse through. If there are things that are known about these issues, and things that can help customers resolve problems, it would be of great benefit to Veeam customers to summarize in a sticky post on the first page of this thread. Had I seen a suggestion a sticky post that increasing repo server memory can resolve lockup issues, I would have tried that. It would have saved me time as well as Veeam Support personnel time.

3. Veeam should be working directly with Microsoft on these ReFS issues. IMO, Veeam should have the lab environment facilities to replicate these issues and subsequently share data with Microsoft toward understanding and resolving these problems. Asking customers to open a $500 support case with Microsoft Pro Support is bad form, IMO. I realize that Microsoft ultimately needs to provide the fix, but throwing the $500 at Microsoft has not got us a thing. They have been analyzing our server memory.dmp file for almost two months now. No suggested fixes came out of that. No mention of server memory having a relationship to system dependability.

I’m VERY happy that our problem is now resolved, I just wish it hadn’t taken three months to get here (yes, I think I first opened the ticket with Veeam Support on this issue around April 11).

Thanks,
Raleigh
Raleigh
Novice
 
Posts: 7
Liked: never
Joined: Tue Jun 26, 2018 11:33 pm
Full Name: Raleigh

Re: REFS issues (server lockups, high CPU, high RAM)

Veeam Logoby Gostev » Fri Jul 13, 2018 12:38 am

Raleigh, the only reason ReFS is finally becoming usable is because Veeam has been working directly with Microsoft on these ReFS issues for the past 2 years - and in fact, directly with actual ReFS developers even. I realize that having joined this community just 2 weeks ago you can't know this, but you are making a lot of wrong assumptions for some reason, including Veeam not having the lab environment and such.

And regarding $500 support cases, your view is somewhat naive. Because support cases is actually the only way to raise the priority of a particular issue over thousands of other open issues. No one at Microsoft would be working on this issue still if it only had one support case open from some vendor. Moreover, for the first half a year since this all started, they were busy estimating "revenue impact" from this bug based on affected customer sizes and estimated cost of downtime due to locked up backup repositories (I was asked to assist in calculations). Seems like it was not until a certain threshold of total revenue impact among all customers who opened support cases, until the first developer was finally assigned to research this issue many months later.

Last but not least, these $500 should have been refunded in every single case, because the issue in question is now a confirmed bug. So this should not have costed anyone a penny regardless.

I do agree with your other two points. The only reason this has not been done yet is that this is actually quite new information/confirmation from the field. If you read my recent posts above, this was merely a theory until now, based on the fact that Microsoft was seemingly working on improving ReFS memory management.
Gostev
Veeam Software
 
Posts: 22400
Liked: 2675 times
Joined: Sun Jan 01, 2006 1:01 am
Location: Baar, Switzerland

Re: REFS issues (server lockups, high CPU, high RAM)

Veeam Logoby bfrizie » Fri Jul 13, 2018 7:40 pm 1 person likes this post

I can confirm that the latest update resolved our issue as well.
bfrizie
Lurker
 
Posts: 1
Liked: 1 time
Joined: Thu Jul 05, 2018 1:07 pm
Full Name: Brandon Frizie

PreviousNext

Return to Veeam Backup & Replication



Who is online

Users browsing this forum: Google [Bot], ManOrs and 60 guests