REFS issues (server lockups, high CPU, high RAM)

Jul 02, 2018 4:27 pm

Raleigh wrote:FWIW, our backup repository server's basic specs:
Dell PowerEdge R740XD
16GB RAM
29TB (ReFS 64K) storage volume

That's a pretty small amount of RAM for a Veeam repo. The best practice recommendation is 4GB/core so even if you only have a 4 core processor, 16GB would basically be the minimum, and those are based largely on NTFS, which has a much lighter in kernel memory load. For ReFS, especially in these smaller configurations, the standing recommendation is at least 1GB per TB of space. Once you get to the 100's of TBs, you can usually begin to pare this back a little (for example, it's not uncommon to have a 400TB repo with 256GB of RAM), but on the smaller end of the scale the 1GB/1TB ratio has proven to be quite stable.

Mgamerz · Post by **Mgamerz** » Jul 02, 2018 4:44 pm this post

looks like you have to restore refsv1.sys as well, cause my performance of synthetic full is still abysmal just doing refs.sys rollbackup (not refsv1).

Edit: Or windows just decided to reinstall the refs.sys file... I did not install any windows update recently.

Post by **Gostev** » Jul 02, 2018 5:04 pm this post

refs.sys is the only file we had to replace in our own testing.

Post by **AlexL** » Jul 02, 2018 5:43 pm this post

Most, if not all, discussion seems to be around Backup Jobs, does no one do Backup Copy jobs, or just not with REFS?
Anyway, is there a recommendation for GB/TB for repo's with only Backup Copy jobs?
We've had a server with 20 cores and 64GB and a 36TB ReFS repo for a year, last month we added a 400TB repo to this same server.
Memory usage decreased from average 60% free to 40% free, mostly I have 20 to 30GB free of the 64GB, cpu is seldom above 5 to 10%, still I have frequent freezes when the 400TB store is hit with writes, the refs drivers are rolled back already.
I do not seem to have a memory issue, or do I? And if so, where do I look to observe that?

Mgamerz · Post by **Mgamerz** » Jul 02, 2018 5:43 pm this post

Yeah, I replaced that, but apparently windows replaced it somehow. Maybe a windows update snuck in and I didn't see it. Been trying to make a synthetic full for like 3 weeks now, going to run out of storage space soon...

Our backup copy jobs go to refs, no problems there, but the server we send them to locked up last friday like we have seen on the main server.

thaapavuori · Post by **thaapavuori** » Jul 02, 2018 6:16 pm this post

Hi,

I think that your above KB number is wrong? I think that correct KB is kb4077525.

nhwanderer · Post by **nhwanderer** » Jul 04, 2018 12:15 am this post

I was experiencing very long compaction times. Following gm2783 instructions from pp 68 I extracted refs.sys and refsv1.sys from windows10.0-kb4093120-x64_72c7d6ce20eb42c0df760cd13a917bbc1e57c0b7.msu . On Windows 2016 I had to shift-restart to get to a command prompt to replace them, as I was unable to replace them when the system was live, as the machine complained about permissions. Starting another backup run now, we'll see what happens!

nhwanderer · Jul 04, 2018 1:56 am

nhwanderer wrote:I was experiencing very long compaction times. Following gm2783 instructions from pp 68 I extracted refs.sys and refsv1.sys from windows10.0-kb4093120-x64_72c7d6ce20eb42c0df760cd13a917bbc1e57c0b7.msu . On Windows 2016 I had to shift-restart to get to a command prompt to replace them, as I was unable to replace them when the system was live, as the machine complained about permissions. Starting another backup run now, we'll see what happens!

June update driver: Killed the compaction at 72% after 36 hours
Rolled back driver: 40 minutes to compact

yay

Raleigh · Post by **Raleigh** » Jul 05, 2018 6:11 pm this post

tsightler wrote:
Raleigh wrote:
FWIW, our backup repository server's basic specs:
Dell PowerEdge R740XD
16GB RAM
29TB (ReFS 64K) storage volume

That's a pretty small amount of RAM for a Veeam repo. The best practice recommendation is 4GB/core so even if you only have a 4 core processor, 16GB would basically be the minimum, and those are based largely on NTFS, which has a much lighter in kernel memory load. For ReFS, especially in these smaller configurations, the standing recommendation is at least 1GB per TB of space. Once you get to the 100's of TBs, you can usually begin to pare this back a little (for example, it's not uncommon to have a 400TB repo with 256GB of RAM), but on the smaller end of the scale the 1GB/1TB ratio has proven to be quite stable.

Are you saying that increasing the amount of RAM in our ReFS repository server can improve its reliability (preventing the server crashes we're experiencing when large .vbk files are deleted)? I've had several open tickets with Veeam Support on our issue, and never did they bring up the amount of RAM in the server. Also, I worked with a Veeam sales team (sales guy and his technical sidekick), and they vetted the server configuration before I even placed the order with Dell. If it is a known fact that increasing the RAM could resolve this type of problem, I'm willing to give it a try. I wish I would have known about this sooner.

Also, I realized I left out the CPU info for our server. It has dual Xeon Silver 4108 CPUs, with 8 cores each, for a total of 16 cores. Based on your recommendation, 64GB of RAM is a best practice for our Veeam repository server with 16 CPU cores. Correct?

Can anyone confirm that adding RAM to their repository server resolved (or greatly minimized) this "ReFS-related server crash" issue? I'm willing to throw money at this problem, but I'd like to know it's not wasted money.

Thanks for the insight.

--Raleigh

Raleigh · Post by **Raleigh** » Jul 05, 2018 6:22 pm this post

AlexL wrote:I have a feeling it is more the .vib size that is causing the trouble than the .vbk size, could that also be the case in your situation Raleigh?

Alex,
For us, it's definitely the .vbk files. I can reproduce the problem by trying to delete these files manually from Windows Explorer. I don't seem to be experiencing the crashing problem when deleting the (typically much smaller) .vib files from Windows Explorer. Also, it's not all .vbk files that cause the problem. I seem to be able to delete 2TB (and smaller) .vbk files without any problem. Veeam jobs have no problem deleting these either. It's only the backup jobs with larger .vbk files (4 TB and greater) that seem to give our repository server problems.

--Raleigh

Jul 05, 2018 7:54 pm

Raleigh wrote:Are you saying that increasing the amount of RAM in our ReFS repository server can improve its reliability (preventing the server crashes we're experiencing when large .vbk files are deleted)?

Yes, that is exactly what I'm saying. ReFS definitely uses more memory than NTFS, especially kernel memory, and deletes of large files with lots of referenced blocks are one of the big hitters for spikes in memory usage. When deletes of large files with many reference blocks occur on ReFS, you may not see a ton of memory usage from an application perspective, but kernel memory will definitely increase, and, if it gets tight, can lead to deadlocks. I believe this is still a bug in the Windows 2016 memory management code, but having lots more free RAM helps to mitigate it (it does not completely eliminate it). Based on that, I definitely don't recommend running ReFS with anything less than the best practice memory configuration. If Microsoft ever corrects this issue (it's been hinted to me that the fix is in the RS4 builds), then perhaps this concern will go away. Exactly how much memory you will need is difficult to say, but certainly more than the absolute minimum, which is what I would consider you to have.

Raleigh wrote:I've had several open tickets with Veeam Support on our issue, and never did they bring up the amount of RAM in the server. Also, I worked with a Veeam sales team (sales guy and his technical sidekick), and they vetted the server configuration before I even placed the order with Dell. If it is a known fact that increasing the RAM could resolve this type of problem, I'm willing to give it a try. I wish I would have known about this sooner.

Also, I realized I left out the CPU info for our server. It has dual Xeon Silver 4108 CPUs, with 8 cores each, for a total of 16 cores. Based on your recommendation, 64GB of RAM is a best practice for our Veeam repository server with 16 CPU cores. Correct?

I'm quite disappointed that the SE didn't provide some additional guidelines based on our best practice, however, we do have a lot of SEs these days, so they could have been new themselves. You can read the sizing recommendations for repos for yourself here:
https://bp.veeam.expert/architecture-ov ... ing/sizing

Note that the best practice guide is maintained by the Solutions Architecture team here at Veeam and thus it reflects not the minimums, but the recommendations that we've collected based on significant field experience with customers small and large. I am part of that team (specifically the Principal Solutions Architect for NA). Our goal in maintaining the best practice guide is to document guidelines that will provide the best performance and reliability across a wide range of circumstances using proven practices from the field.

Raleigh wrote:Can anyone confirm that adding RAM to their repository server resolved (or greatly minimized) this "ReFS-related server crash" issue? I'm willing to throw money at this problem, but I'd like to know it's not wasted money.

There's a post on the last page that is a reply to your message that specifically says exactly this, perhaps you didn't see it?
https://forums.veeam.com/veeam-backup-r ... ml#p285642

Raleigh · Post by **Raleigh** » Jul 05, 2018 8:34 pm this post

Tom, thanks for the detailed response. I just received a quote from my Dell rep to upgrade the server to 96GB of RAM (that was a logical configuration based on the memory it already has plus Dell's guidelines on memory upgrades). I'll install this memory upgrade and report back if it has improved the repository server reliability. I'm sure hoping this resolves the problem. 96GB more than meets the best practice recommendations...agreed?

Actually, I somehow had missed that post. Thanks. So yes, I'm going to give it a try. The cost of the memory upgrade is well worth it if it resolves the server lockup issues we've been having.

--Raleigh

Post by **tsightler** » Jul 05, 2018 10:20 pm this post

Raleigh wrote:Tom, thanks for the detailed response. I just received a quote from my Dell rep to upgrade the server to 96GB of RAM (that was a logical configuration based on the memory it already has plus Dell's guidelines on memory upgrades). I'll install this memory upgrade and report back if it has improved the repository server reliability. I'm sure hoping this resolves the problem. 96GB more than meets the best practice recommendations...agreed?

Actually, I somehow had missed that post. Thanks. So yes, I'm going to give it a try. The cost of the memory upgrade is well worth it if it resolves the server lockup issues we've been having.

Of course, I can't guarantee that the problem will be resolved, but I can say that I have worked with literally dozens, perhaps 100's, of customers using ReFS, most at scales of 100's of TBs per server, and RAM was always an important factor in resolving lockups. I would never recommend less than 64GB of RAM for your setup and use case, so I'm very hopeful that will improve your situation. Please keep us posted and thanks for participating in the Veeam community!

vsssper · Post by **vsssper** » Jul 06, 2018 7:48 am this post

It is a nightmare:

Looks like it won't finish before the proper fix from MS will be released

antipolis · Post by **antipolis** » Jul 06, 2018 9:38 am this post

at this point you should really cancel the job and rollback the driver

mweissen13 · Jul 10, 2018 6:06 pm

I just installed the July Cumulative Update (KB4338814) and I am keeping my fingers crossed.

ReFS.sys driver is now version 1.0.14393.2363

Post by **oscaru** » Jul 11, 2018 5:49 am this post

Hi, mweissen13!

Please give us some feedback if this new driver brings the fix @Gostev mentioned on early june.

Thanks!

wingphil · Post by **wingphil** » Jul 11, 2018 8:24 am this post

I have also installed 2363 and anecdotally it seems faster than 2312, but I will have to wait a few days till my ~10 hour compact happens to be sure.

Humphro · Post by **Humphro** » Jul 11, 2018 9:31 am this post

I too have this issue. I have applied KB4338814 to one of our remote repositories which is the target for a backup copy job, which since the previous update started to run for up to 4 days. I noticed that the update did change the version of refs.sys from 10.0.14393.2312 to 10.0.14393.2363.

I'm now running the backup copy job again and will post results as soon as available.

Post by **billcouper** » Jul 11, 2018 9:41 am this post

Looking forward to hearing the results of KB4338814.

Thankfully, either through our environment being tiny, or backing up from flash to flash, or whatever is different, this hasn't impacted us dramatically.
Still... I've decided I'm not going to install any more windows updates without checking this thread first

Mgamerz · Post by **Mgamerz** » Jul 11, 2018 6:47 pm this post

Be aware of the listed known issue for the July 2018 cumulative update:

After installing this update on a DHCP Failover Server, Enterprise clients may receive an invalid configuration when requesting a new IP address. This may result in loss of connectivity as systems fail to renew their leases.

This would have been a nice surprise to deal with as our backup server is also a DHCP server in our HA setup. I guess I will put up with this for another few weeks...

nobrell · Jul 11, 2018 7:32 pm

I have updated our Server 2016 repository-server to OS Build 14393.2363 which contains an update to ReFS.sys 14393.2363.
Yesterday the compact job took 6h 41m and today after the update 1h 17m, so it seems to be fixed now finally!
Very happy about this

HenrikS. · Jul 12, 2018 7:02 am

Thumbs up for the ReFS.sys 14393.2363! No ReFS issues here atm.

doktornotor · Post by **doktornotor** » Jul 12, 2018 8:04 am this post

Mgamerz wrote:Be aware of the listed known issue for the July 2018 cumulative update:

Update already pulled from WSUS at least due to the DHCP SNAFU. Well done, MS. Excellent QA as always. And I must not forget to mention how much we all love these cumulative updates, don't we? Why not flip a coin between screwed filesystem driver and broken DHCP... Great patching scheme. And finally, the very fine detailed KB acticles that say nothing about what's being touched, pure masterpiece. Where's DHCP in the KB4338814 list of "fixes"? Or, where's the ReFS mentioned in there?

Post by **Gostev** » Jul 12, 2018 10:54 am this post

Well, to be fair, I don't know many customers who run DHCP Failover Server on their backup server

Other than that I agree, things do not seem to be improving as far as updates documentation or their quality

mkaec · Jul 12, 2018 5:35 pm

doktornotor wrote:...And I must not forget to mention how much we all love these cumulative updates, don't we?...

I do have love for the cumulative updates...because I remember the system it replaced.

The original patching scheme in Windows NT was to release hotfixes for individual customers that encountered problems. And then, after more extensive testing, release them as part of a service pack to all customers. With each successive version of Windows, fewer service packs were released. 6 for Windows NT, 4 for Windows 2000, 3 for Windows XP, 2 for Windows Vista, 1 for Windows 7. While Windows 7 was current, it was decided that there would be no more service packs. Microsoft kept releasing the hotfixes, but provided no mechanism to get them into the systems of the general population. So, our Windows 7 and Windows 2008 R2 systems were running with hundreds of bugs in which fixes existed for them that had to be manually installed. Microsoft staff responded to this by creating blog posts listing recommended hotfixes for certain usage scenarios. That left administrators to hunt down, and monitor, those blog posts, download the hotfixes, and manually into them on their servers. That was an awful situation. Windows 8 / Server 2012 introduced the cumulative update model which allowed bug fixes to reach all users. That methodology eventually got applied back to Windows 7 /Server 2008 R2.

I agree that the poor QA is frustrating. But, I think we are in a better situation than before. If the QA can be fixed, then I think it will be a very good system.

Jul 12, 2018 6:43 pm

Ugh… don't remind me of these times, literally hunting for fixes because there was no list of fixes for more obscure problems and official ones were always out of date. And then deploying them (standalone MSUs have no WSUS/SCCM/… integration) - even worse! Win8 model was quite good as each month's optional patches were separate so you could skip one (there were a few breaking ones) if you needed to. And there were release notes and they were quite good! As in old-times good (detailed symptoms, cause, solution)! Unfortunately they dropped this in late 2014 without any announcements (that I noticed…) and here we are now.

I'm not saying that current model is good (Win8 was better) but if you didn't hunt down hotfixes, your environment was not big enough or you just didn't know better. For example, I believe I had maybe 300 limited release hotfixes (I believe in proactive patching) in my image building setup at one point (before 2016 convenience rollup) for Windows 7 alone. Some community efforts made things better but man, these were dark times...

Raleigh · Post by **Raleigh** » Jul 12, 2018 7:02 pm this post

tsightler wrote:Of course, I can't guarantee that the problem will be resolved, but I can say that I have worked with literally dozens, perhaps 100's, of customers using ReFS, most at scales of 100's of TBs per server, and RAM was always an important factor in resolving lockups. I would never recommend less than 64GB of RAM for your setup and use case, so I'm very hopeful that will improve your situation. Please keep us posted and thanks for participating in the Veeam community!

Reporting back on the results of the RAM upgrade. I bumped the RAM in our repo server from 16GB up to 96GB. I then performed the same test that would consistently render the server unresponsive: I deleted (via Windows Explorer) a ~5TB .vbk file. Result: no server lockup. I then performed the “real world” test by changing the retention policy of our large file server backup job so that Veeam would attempt to delete old backup sets. Result: no server lockup.

Tom, it appears that increasing server memory (per your suggestion) resolved the problem that has been a thorn in my side for three months now. Thank you!

However, I am left wondering why it took so long to get here. I have an open ticket with Microsoft Pro Support. I have an open ticket with Veeam Support (I have worked this issue with Tier 1 and Tier 2 support people at Veeam). At no point did anyone at MS support or Veeam support suggest that increasing RAM would resolve my server lockup issue. They have had me collect and upload diagnostic files and event logs, tweak registry entries, adjust page file settings, and verify driver versions. Never a suggestion to increase memory.

I have a few suggestions, if I may:

1. Share this information with Veeam Support personnel. I would have tried increasing server memory months ago had someone in Veeam Support suggested it.

2. Create and maintain a sticky post for this forum thread. There are 70+ pages here, far too much for a busy network admin responsible for dozens of systems to parse through. If there are things that are known about these issues, and things that can help customers resolve problems, it would be of great benefit to Veeam customers to summarize in a sticky post on the first page of this thread. Had I seen a suggestion a sticky post that increasing repo server memory can resolve lockup issues, I would have tried that. It would have saved me time as well as Veeam Support personnel time.

3. Veeam should be working directly with Microsoft on these ReFS issues. IMO, Veeam should have the lab environment facilities to replicate these issues and subsequently share data with Microsoft toward understanding and resolving these problems. Asking customers to open a $500 support case with Microsoft Pro Support is bad form, IMO. I realize that Microsoft ultimately needs to provide the fix, but throwing the $500 at Microsoft has not got us a thing. They have been analyzing our server memory.dmp file for almost two months now. No suggested fixes came out of that. No mention of server memory having a relationship to system dependability.

I’m VERY happy that our problem is now resolved, I just wish it hadn’t taken three months to get here (yes, I think I first opened the ticket with Veeam Support on this issue around April 11).

Thanks,
Raleigh

Post by **Gostev** » Jul 13, 2018 12:38 am this post

Raleigh, the only reason ReFS is finally becoming usable is because Veeam has been working directly with Microsoft on these ReFS issues for the past 2 years - and in fact, directly with actual ReFS developers even. I realize that having joined this community just 2 weeks ago you can't know this, but you are making a lot of wrong assumptions for some reason, including Veeam not having the lab environment and such.

And regarding $500 support cases, your view is somewhat naive. Because support cases is actually the only way to raise the priority of a particular issue over thousands of other open issues. No one at Microsoft would be working on this issue still if it only had one support case open from some vendor. Moreover, for the first half a year since this all started, they were busy estimating "revenue impact" from this bug based on affected customer sizes and estimated cost of downtime due to locked up backup repositories (I was asked to assist in calculations). Seems like it was not until a certain threshold of total revenue impact among all customers who opened support cases, until the first developer was finally assigned to research this issue many months later.

Last but not least, these $500 should have been refunded in every single case, because the issue in question is now a confirmed bug. So this should not have costed anyone a penny regardless.

I do agree with your other two points. The only reason this has not been done yet is that this is actually quite new information/confirmation from the field. If you read my recent posts above, this was merely a theory until now, based on the fact that Microsoft was seemingly working on improving ReFS memory management.

bfrizie · Jul 13, 2018 7:40 pm

I can confirm that the latest update resolved our issue as well.

R&D Forums

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Who is online