Availability for the Always-On Enterprise
Locked
tsightler
Veeam Software
Posts: 5199
Liked: 2082 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by tsightler » Jul 02, 2018 4:27 pm 1 person likes this post

Raleigh wrote:FWIW, our backup repository server's basic specs:
Dell PowerEdge R740XD
16GB RAM
29TB (ReFS 64K) storage volume
That's a pretty small amount of RAM for a Veeam repo. The best practice recommendation is 4GB/core so even if you only have a 4 core processor, 16GB would basically be the minimum, and those are based largely on NTFS, which has a much lighter in kernel memory load. For ReFS, especially in these smaller configurations, the standing recommendation is at least 1GB per TB of space. Once you get to the 100's of TBs, you can usually begin to pare this back a little (for example, it's not uncommon to have a 400TB repo with 256GB of RAM), but on the smaller end of the scale the 1GB/1TB ratio has proven to be quite stable.

Mgamerz
Enthusiast
Posts: 62
Liked: 8 times
Joined: Sep 29, 2017 8:07 pm
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Mgamerz » Jul 02, 2018 4:44 pm

looks like you have to restore refsv1.sys as well, cause my performance of synthetic full is still abysmal just doing refs.sys rollbackup (not refsv1).

Edit: Or windows just decided to reinstall the refs.sys file... I did not install any windows update recently.

Gostev
Veeam Software
Posts: 23116
Liked: 2917 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Jul 02, 2018 5:04 pm

refs.sys is the only file we had to replace in our own testing.

AlexL
Enthusiast
Posts: 58
Liked: never
Joined: Aug 24, 2010 8:55 am
Full Name: Alex
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by AlexL » Jul 02, 2018 5:43 pm

Most, if not all, discussion seems to be around Backup Jobs, does no one do Backup Copy jobs, or just not with REFS?
Anyway, is there a recommendation for GB/TB for repo's with only Backup Copy jobs?
We've had a server with 20 cores and 64GB and a 36TB ReFS repo for a year, last month we added a 400TB repo to this same server.
Memory usage decreased from average 60% free to 40% free, mostly I have 20 to 30GB free of the 64GB, cpu is seldom above 5 to 10%, still I have frequent freezes when the 400TB store is hit with writes, the refs drivers are rolled back already.
I do not seem to have a memory issue, or do I? And if so, where do I look to observe that?

Mgamerz
Enthusiast
Posts: 62
Liked: 8 times
Joined: Sep 29, 2017 8:07 pm
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Mgamerz » Jul 02, 2018 5:43 pm

Yeah, I replaced that, but apparently windows replaced it somehow. Maybe a windows update snuck in and I didn't see it. Been trying to make a synthetic full for like 3 weeks now, going to run out of storage space soon...

Our backup copy jobs go to refs, no problems there, but the server we send them to locked up last friday like we have seen on the main server.

thaapavuori
Lurker
Posts: 1
Liked: never
Joined: Jul 02, 2018 5:48 pm
Full Name: thaapavuori
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by thaapavuori » Jul 02, 2018 6:16 pm

Hi,

I think that your above KB number is wrong? I think that correct KB is kb4077525.

nhwanderer
Novice
Posts: 7
Liked: 1 time
Joined: Oct 13, 2017 7:37 pm
Full Name: Jordan Desroches
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by nhwanderer » Jul 04, 2018 12:15 am

I was experiencing very long compaction times. Following gm2783 instructions from pp 68 I extracted refs.sys and refsv1.sys from windows10.0-kb4093120-x64_72c7d6ce20eb42c0df760cd13a917bbc1e57c0b7.msu . On Windows 2016 I had to shift-restart to get to a command prompt to replace them, as I was unable to replace them when the system was live, as the machine complained about permissions. Starting another backup run now, we'll see what happens!

nhwanderer
Novice
Posts: 7
Liked: 1 time
Joined: Oct 13, 2017 7:37 pm
Full Name: Jordan Desroches
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by nhwanderer » Jul 04, 2018 1:56 am 1 person likes this post

nhwanderer wrote:I was experiencing very long compaction times. Following gm2783 instructions from pp 68 I extracted refs.sys and refsv1.sys from windows10.0-kb4093120-x64_72c7d6ce20eb42c0df760cd13a917bbc1e57c0b7.msu . On Windows 2016 I had to shift-restart to get to a command prompt to replace them, as I was unable to replace them when the system was live, as the machine complained about permissions. Starting another backup run now, we'll see what happens!
June update driver: Killed the compaction at 72% after 36 hours
Rolled back driver: 40 minutes to compact

yay :-)

Raleigh
Novice
Posts: 7
Liked: never
Joined: Jun 26, 2018 11:33 pm
Full Name: Raleigh
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Raleigh » Jul 05, 2018 6:11 pm

tsightler wrote:
Raleigh wrote:
FWIW, our backup repository server's basic specs:
Dell PowerEdge R740XD
16GB RAM
29TB (ReFS 64K) storage volume

That's a pretty small amount of RAM for a Veeam repo. The best practice recommendation is 4GB/core so even if you only have a 4 core processor, 16GB would basically be the minimum, and those are based largely on NTFS, which has a much lighter in kernel memory load. For ReFS, especially in these smaller configurations, the standing recommendation is at least 1GB per TB of space. Once you get to the 100's of TBs, you can usually begin to pare this back a little (for example, it's not uncommon to have a 400TB repo with 256GB of RAM), but on the smaller end of the scale the 1GB/1TB ratio has proven to be quite stable.
Are you saying that increasing the amount of RAM in our ReFS repository server can improve its reliability (preventing the server crashes we're experiencing when large .vbk files are deleted)? I've had several open tickets with Veeam Support on our issue, and never did they bring up the amount of RAM in the server. Also, I worked with a Veeam sales team (sales guy and his technical sidekick), and they vetted the server configuration before I even placed the order with Dell. If it is a known fact that increasing the RAM could resolve this type of problem, I'm willing to give it a try. I wish I would have known about this sooner.

Also, I realized I left out the CPU info for our server. It has dual Xeon Silver 4108 CPUs, with 8 cores each, for a total of 16 cores. Based on your recommendation, 64GB of RAM is a best practice for our Veeam repository server with 16 CPU cores. Correct?

Can anyone confirm that adding RAM to their repository server resolved (or greatly minimized) this "ReFS-related server crash" issue? I'm willing to throw money at this problem, but I'd like to know it's not wasted money.

Thanks for the insight.

--Raleigh

Raleigh
Novice
Posts: 7
Liked: never
Joined: Jun 26, 2018 11:33 pm
Full Name: Raleigh
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Raleigh » Jul 05, 2018 6:22 pm

AlexL wrote:I have a feeling it is more the .vib size that is causing the trouble than the .vbk size, could that also be the case in your situation Raleigh?
Alex,
For us, it's definitely the .vbk files. I can reproduce the problem by trying to delete these files manually from Windows Explorer. I don't seem to be experiencing the crashing problem when deleting the (typically much smaller) .vib files from Windows Explorer. Also, it's not all .vbk files that cause the problem. I seem to be able to delete 2TB (and smaller) .vbk files without any problem. Veeam jobs have no problem deleting these either. It's only the backup jobs with larger .vbk files (4 TB and greater) that seem to give our repository server problems.

--Raleigh

tsightler
Veeam Software
Posts: 5199
Liked: 2082 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by tsightler » Jul 05, 2018 7:54 pm 2 people like this post

Raleigh wrote:Are you saying that increasing the amount of RAM in our ReFS repository server can improve its reliability (preventing the server crashes we're experiencing when large .vbk files are deleted)?
Yes, that is exactly what I'm saying. ReFS definitely uses more memory than NTFS, especially kernel memory, and deletes of large files with lots of referenced blocks are one of the big hitters for spikes in memory usage. When deletes of large files with many reference blocks occur on ReFS, you may not see a ton of memory usage from an application perspective, but kernel memory will definitely increase, and, if it gets tight, can lead to deadlocks. I believe this is still a bug in the Windows 2016 memory management code, but having lots more free RAM helps to mitigate it (it does not completely eliminate it). Based on that, I definitely don't recommend running ReFS with anything less than the best practice memory configuration. If Microsoft ever corrects this issue (it's been hinted to me that the fix is in the RS4 builds), then perhaps this concern will go away. Exactly how much memory you will need is difficult to say, but certainly more than the absolute minimum, which is what I would consider you to have.
Raleigh wrote:I've had several open tickets with Veeam Support on our issue, and never did they bring up the amount of RAM in the server. Also, I worked with a Veeam sales team (sales guy and his technical sidekick), and they vetted the server configuration before I even placed the order with Dell. If it is a known fact that increasing the RAM could resolve this type of problem, I'm willing to give it a try. I wish I would have known about this sooner.

Also, I realized I left out the CPU info for our server. It has dual Xeon Silver 4108 CPUs, with 8 cores each, for a total of 16 cores. Based on your recommendation, 64GB of RAM is a best practice for our Veeam repository server with 16 CPU cores. Correct?
I'm quite disappointed that the SE didn't provide some additional guidelines based on our best practice, however, we do have a lot of SEs these days, so they could have been new themselves. You can read the sizing recommendations for repos for yourself here:
https://bp.veeam.expert/architecture-ov ... ing/sizing

Note that the best practice guide is maintained by the Solutions Architecture team here at Veeam and thus it reflects not the minimums, but the recommendations that we've collected based on significant field experience with customers small and large. I am part of that team (specifically the Principal Solutions Architect for NA). Our goal in maintaining the best practice guide is to document guidelines that will provide the best performance and reliability across a wide range of circumstances using proven practices from the field.
Raleigh wrote:Can anyone confirm that adding RAM to their repository server resolved (or greatly minimized) this "ReFS-related server crash" issue? I'm willing to throw money at this problem, but I'd like to know it's not wasted money.
There's a post on the last page that is a reply to your message that specifically says exactly this, perhaps you didn't see it?
https://forums.veeam.com/veeam-backup-r ... ml#p285642

Raleigh
Novice
Posts: 7
Liked: never
Joined: Jun 26, 2018 11:33 pm
Full Name: Raleigh
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Raleigh » Jul 05, 2018 8:34 pm

Tom, thanks for the detailed response. I just received a quote from my Dell rep to upgrade the server to 96GB of RAM (that was a logical configuration based on the memory it already has plus Dell's guidelines on memory upgrades). I'll install this memory upgrade and report back if it has improved the repository server reliability. I'm sure hoping this resolves the problem. 96GB more than meets the best practice recommendations...agreed?

Actually, I somehow had missed that post. Thanks. So yes, I'm going to give it a try. The cost of the memory upgrade is well worth it if it resolves the server lockup issues we've been having.

--Raleigh

tsightler
Veeam Software
Posts: 5199
Liked: 2082 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by tsightler » Jul 05, 2018 10:20 pm

Raleigh wrote:Tom, thanks for the detailed response. I just received a quote from my Dell rep to upgrade the server to 96GB of RAM (that was a logical configuration based on the memory it already has plus Dell's guidelines on memory upgrades). I'll install this memory upgrade and report back if it has improved the repository server reliability. I'm sure hoping this resolves the problem. 96GB more than meets the best practice recommendations...agreed?

Actually, I somehow had missed that post. Thanks. So yes, I'm going to give it a try. The cost of the memory upgrade is well worth it if it resolves the server lockup issues we've been having.
Of course, I can't guarantee that the problem will be resolved, but I can say that I have worked with literally dozens, perhaps 100's, of customers using ReFS, most at scales of 100's of TBs per server, and RAM was always an important factor in resolving lockups. I would never recommend less than 64GB of RAM for your setup and use case, so I'm very hopeful that will improve your situation. Please keep us posted and thanks for participating in the Veeam community!

vsssper
Lurker
Posts: 1
Liked: never
Joined: Jul 06, 2018 7:43 am
Full Name: Stas
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by vsssper » Jul 06, 2018 7:48 am

It is a nightmare:
Image

Looks like it won't finish before the proper fix from MS will be released :(

antipolis
Enthusiast
Posts: 68
Liked: 9 times
Joined: Oct 26, 2016 9:17 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by antipolis » Jul 06, 2018 9:38 am

at this point you should really cancel the job and rollback the driver

Locked

Who is online

Users browsing this forum: No registered users and 43 guests