Comprehensive data protection for all workloads
Locked
Ctek
Service Provider
Posts: 84
Liked: 13 times
Joined: Nov 11, 2015 3:50 pm
Location: Canada
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Ctek »

Ctek wrote:ReFS.sys 2395 (2018-07 CU) has been abysmal in our environments, we have severe server locking issues and downtime during fast cloning since that cumulative update.

Today fresh off the Microsoft presses, there is KB4343887 which does not change the ReFS.sys driver version.

Fun times.
Actually, both ReFS 2363 and 2395 are causing major issues in our environment when fast cloning. I admit I am a little under sized compared to the suggested 1GB/1TB Rule for RAM/Storage, however all my ReFS servers were chugging along fine since at least April but are now causing lock ups since those 2 ReFS versions.

I am reverting back 1 server to 2312 for testing purposes.
VMCE
Mgamerz
Expert
Posts: 160
Liked: 28 times
Joined: Sep 29, 2017 8:07 pm
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Mgamerz »

Haven't had a server lock up yet with the July updates, though the hardware backing the server started failing - migrated to a new server and has been stable for at least 10 days now on 25 & 35TB repositories. Though having 160GB of ram might help (I may or may not have put all of our spare ram into this system, I plead the 5th).
Ctek
Service Provider
Posts: 84
Liked: 13 times
Joined: Nov 11, 2015 3:50 pm
Location: Canada
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Ctek »

Maybe the issue is mitigated by a hugely favorable RAM/Storage ratio.
VMCE
l0stb@ackup
Influencer
Posts: 14
Liked: 4 times
Joined: Jul 19, 2018 2:10 am
Contact:

Re: Feedback on newer updates

Post by l0stb@ackup »

Thank you all, if you could please also include in your experiences the ReFS.sys version and currently installed CU that would be grand
ejenner
Veteran
Posts: 636
Liked: 100 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ejenner »

wingphil wrote:I had a server lockup again last week and again this morning. Refs driver version 2363, and we have about 14TB of repository space and four vCPUs (running under VMware).

We had 20GB ram assigned. I've upped it to 24GB. Should I still be expecting to see problems? I know this is a lot less than the rest of you have allocated, but it's a lot more than the 1GB/1TB recommended
There are other size guidelines based on CPU cores. Have you taken that into account also? The ReFS advice is based on the size of the volume. There is separate guidance for a ratio based on CPU cores. The calculations don't produce the same results.
jim3cantos
Enthusiast
Posts: 64
Liked: 12 times
Joined: Jan 08, 2013 6:14 pm
Full Name: José Ignacio Martín Jiménez
Location: Madrid, Spain
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by jim3cantos » 1 person likes this post

jim3cantos wrote:KB4338822 (Last 2018-07 Cumulative Update) installed and checked version of ReFS.sys file at .2395. Will post again if problems are detected.
After 2 days no problems detected. 14 TB repository (64KB cluster size) with 16 GB of RAM here.

Who goes first with next update: "2018-08 Cumulative Update for Windows Server 2016 for x64-based Systems (KB4343887)"? As indicated above it doesn't seem to change ReFS.sys file version.
ejenner
Veteran
Posts: 636
Liked: 100 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ejenner »

Is it the case that ReFS.sys isn't mentioned as part of the improvements in the release notes or has someone installed the update and checked to see if the file has been affected?

Because we've seen in the past that the file gets updated despite it not being mentioned in the release notes. So which kind of update is this one? :lol:
Iain_Green
Service Provider
Posts: 158
Liked: 9 times
Joined: Dec 05, 2014 2:13 pm
Full Name: Iain Green
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Iain_Green »

Hi

Where are we at with a stable driver?

Compacting full backup file (92% done) [fast clone] after 34 hours is not ideal!

Running version 10.0.14393.2312 debating if I should update to .2363?
Many thanks

Iain Green
wingphil
Novice
Posts: 7
Liked: never
Joined: Jun 11, 2018 8:51 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by wingphil »

ejenner wrote: There are other size guidelines based on CPU cores. Have you taken that into account also? The ReFS advice is based on the size of the volume. There is separate guidance for a ratio based on CPU cores. The calculations don't produce the same results.
Thanks, that's a good point. It's an AIO VM running the backup server and repositories and acting as a proxy for some of the VMs also. So it could be using a lot seperately from the ReFS requirements.

We're a small shop and I don't have the resources to throw more RAM at the problem unless I have to, but at least things make sense now. If 24GB doesn't cut it I will try 32GB, which should be more than enough for the worst case memory consumption based on Veeam's official system requirements plus the 1GB/1TB ReFS guidance from this thread.

I do wish Veeam had some options to tune, e.g. memory consumption vs performance. It's great that my incrementals are ten times faster than our previous solution (BackupAssist) but at our scale we don't need it to be that fast and would be better to use less memory.

Thanks,

Phil
Gostev
Chief Product Officer
Posts: 31766
Liked: 7265 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev »

@Iain the latest are stable, just make sure you have sufficient RAM on the backup repository server.

@wingphil we cannot control that unfortunately :( can only hope Microsoft addresses ReFS memory consumption issues (specifically, kernel memory consumption). Throwing much RAM at the server is more of a workaround - you won't even see it used as much (server lockups happen when the OS runs out of kernel memory, not physical memory).
Iain_Green
Service Provider
Posts: 158
Liked: 9 times
Joined: Dec 05, 2014 2:13 pm
Full Name: Iain Green
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Iain_Green »

Gostev wrote:@Iain the latest are stable, just make sure you have sufficient RAM on the backup repository server.
So 174TB repo would require 174GB of RAM?
Currently, both repos have 64GB and in the last 30 days have spiked to a 24% in usage.

If we are not seeing high memory usage would this indicate we are having different issues and the REFS is not an issue??
Many thanks

Iain Green
Gostev
Chief Product Officer
Posts: 31766
Liked: 7265 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev »

As I noted in my previous post, you won't see high physical memory usage. In your case, I would say 128GB RAM is the absolute minimum for a 174TB repo.
verocab
Novice
Posts: 3
Liked: never
Joined: Oct 15, 2012 6:26 pm
Full Name: Véronique Cabana
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by verocab »

@Gostev, do you know if there's a patch for server 1803? Cause latest update (KB4343909) still has file Refs.sys - 10.0.17134.137 - 2018-06-15.
l0stb@ackup
Influencer
Posts: 14
Liked: 4 times
Joined: Jul 19, 2018 2:10 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by l0stb@ackup »

Iain_Green wrote:Where are we at with a stable driver?
Gostev wrote:@Iain the latest are stable, just make sure you have sufficient RAM on the backup repository server.
Thanks Gostev - does that version improve fast clone and write performance?
We're on 10.0.14393.2097, 128GB RAM and a 152TB repo
Iain_Green
Service Provider
Posts: 158
Liked: 9 times
Joined: Dec 05, 2014 2:13 pm
Full Name: Iain Green
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Iain_Green »

Thanks Gostev - does that version improve fast clone and write performance?
We're on 10.0.14393.2097, 128GB RAM and a 152TB repo
Support is advising I roll back to .2097 which is not the latest...
Many thanks

Iain Green
Gostev
Chief Product Officer
Posts: 31766
Liked: 7265 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev »

@l0stb@ackup as far as I know, it does not (comparing to 2097).

@Iain as I have already noted in my earlier response to you, it is an incorrect advise. Your engineer was using some very old internal KB article. I have notified the support management and this has been fixed.
JaySt
Service Provider
Posts: 453
Liked: 86 times
Joined: Jun 09, 2015 7:08 pm
Full Name: JaySt
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by JaySt »

Gostev wrote:This setting makes Veeam operate with 8MB blocks (as opposed to 1MB blocks by default), which basically reduces the amount of cloned blocks 8x. I suppose it can be an alternative to increasing RAM size on the backup repository, although this will also increase incremental backup file sizes quite significantly.
This is interesting. Capacity usage could be an issue after chaning to larger blocksizes, but could also be acceptable in some cases if stability and performance is better.
Has there been any (more) feedback about usage of larger blocksizes by changing this setting in regards of stability? Any (more) feedback where it was a good alternative to increasing RAM?
Veeam Certified Engineer
Iain_Green
Service Provider
Posts: 158
Liked: 9 times
Joined: Dec 05, 2014 2:13 pm
Full Name: Iain Green
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Iain_Green »

@Gostev regarding support case 03141694

This the action plan provided by support:

1) The current plan is rolling back the REFS driver to 10.0.14393.2097
2) Then monitor performance
3) We don’t advise changing the RAM

I am unsure how to proceed, part of me wants to run the latest updates and go the driver level you suggested, however, the RAM per TB is an issue as each repo is 174tb with 64gb of RAM.
Another side of me wants to follow supports suggestion but then I'm stuck not being able to update as I will need to ensure the driver level is not updated!

Any guidance now would be very much appreciated!
Many thanks

Iain Green
Gostev
Chief Product Officer
Posts: 31766
Liked: 7265 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev »

Ian, you are right to question this action plan, as it makes no sense. I am very sorry for this and I am really frustrated myself that you keep getting the invalid recommendations after we've discussed this internally with support twice already. I've asked the support management to have someone more senior to take over your case to prevent further miscommunication.

Bottomline, you should do these two things:
1. Bump RAM on the backup repository to 128GB.
2. Install latest Windows updates to get your ReFS driver to the actual version.

Let me know if you keep hearing anything other than that from support :D

By the way, I heard they were also working on the official KB article for ReFS best practices.
Iain_Green
Service Provider
Posts: 158
Liked: 9 times
Joined: Dec 05, 2014 2:13 pm
Full Name: Iain Green
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Iain_Green »

Gostev,

Great news, appreciate your assistance in resolving this issue.
While we await your KB are you able to provide an explanation for the increase in RAM (appreciate you may have explained this already, but I am struggling to locate it in this long trail)?

As currently, I would need to submit a request for 64gb of RAM for each server, to do so I would need to provide evidence that the server is needing it. However with performance graphs showing its is barely using the RAM it already has I don't expect to get sign off on the required purchase.

Informing them that Veeam suggests 1gb per 1tb for REFS without any reasoning will be shot down.

For now, I will arrange for the servers to be fully patched and confirm the latest driver is installed. We will then monitor the performance.

Thanks
Many thanks

Iain Green
Gostev
Chief Product Officer
Posts: 31766
Liked: 7265 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev »

Iain, actually the explanation was already provided a few times on the last few pages of this thread, both by myself and our solution architect Tom Sightler - so if you need these details, please simply review those earlier posts if you don't mind. Thanks!
LBegnaud
Service Provider
Posts: 19
Liked: 7 times
Joined: Jan 24, 2018 12:08 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by LBegnaud »

Gostev, I have read through a lot of this thread, and I'm looking for a bit of clarification on ReFS + RAM Requirements/Usage.

We see a few bits of behavior with our extents, and i am trying to sort out a solid troubleshooting method when things go haywire. Currently, when we see issues, it'll be on a single extent where things are thrashing. The behavior is such that job tasks that are actively transferring to other extents are uninterupted, but any tasks that are to the problematic extent stop transferring data (or transfer very little). More problematic, is any metadata tasks for jobs that point to that SoBR hang as well. The fastest way we have to get things moving again is, unfortunately, to hard reset the physical machine hosting that thrashing extent.

I'm hoping to come up with something we can watch that might give us some advance warning that issues are cropping up. I notice that "Modified" RAM usage will spike when large transfers happen to our servers. I'm assuming this is due to read or write cache on our storage pool virtual disks? Also, I notice that as the uptime of the server increases, we see larger usage of "Metafile", when viewed by RamMap. I've never seen either of those combined come close to utilizing the RAM on the server, even during the thrashing. Most of our extents follow the 1GB/1TB rule, but we've definitely seen issues on ones that far exceed it (15TB volume with 48GB of RAM is one we're having issues with the last couple of days).

If you want to just link to relevant posts in this thread to get me some reading material that'd be great. ReFS thrashing is one of the most frustrating things to troubleshoot, as I can't seem to find ANYTHING to indicate that it's happening, except that a server will just stop responding to WMI, login starts hanging, disk access is delayed, etc.
Gostev
Chief Product Officer
Posts: 31766
Liked: 7265 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev »

Here's a good post which also explains that while more RAM helps with the issue, it doesn't completely eliminate one. Thanks!
ejenner
Veteran
Posts: 636
Liked: 100 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ejenner »

I think a justification for increasing the RAM is that it has helped resolve problems for other users.

The way I see it, MS spent a long time developing NTFS and it is now a mature technology. With ReFS they're still working out all the wrinkles.

As it's a performance enhancement, if you're unable to get authorisisation to purchase the supporting hardware then you'll have to downgrade your repository to NTFS. It's not an entitlement, it's just something you can get if you spend the money.

A potential test for the 174TB repository is to move the backups off (or delete the jobs) and reformat the volume as a 64TB repository (or smaller) and then put the jobs back on.
Gostev
Chief Product Officer
Posts: 31766
Liked: 7265 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev »

Considering disk space savings from using ReFS, one would argue it would be way more costly to use NTFS vs. buying some extra RAM for ReFS repository. Although of course this depends a lot on the retention policy too.
Iain_Green
Service Provider
Posts: 158
Liked: 9 times
Joined: Dec 05, 2014 2:13 pm
Full Name: Iain Green
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Iain_Green »

ejenner wrote:I think a justification for increasing the RAM is that it has helped resolve problems for other users.

The way I see it, MS spent a long time developing NTFS and it is now a mature technology. With ReFS they're still working out all the wrinkles.

As it's a performance enhancement, if you're unable to get authorisisation to purchase the supporting hardware then you'll have to downgrade your repository to NTFS. It's not an entitlement, it's just something you can get if you spend the money.

A potential test for the 174TB repository is to move the backups off (or delete the jobs) and reformat the volume as a 64TB repository (or smaller) and then put the jobs back on.
Unforfortuanly that is not an option.
Many thanks

Iain Green
ejenner
Veteran
Posts: 636
Liked: 100 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ejenner »

Gostev wrote:Considering disk space savings from using ReFS, one would argue it would be way more costly to use NTFS vs. buying some extra RAM for ReFS repository. Although of course this depends a lot on the retention policy too.
And you would've already bought your storage. If you can't get the authorization for the memory which you may not have realized was required when you specified the storage then you'll have to use NTFS. In our case we were way under specified for RAM for all the servers we bought for our Veeam project. But I didn't spec them so whether or not we would've properly calculated our requirements if it had been my choice is a bit of an unknown. We've been able to upgrade our servers so will keep an eye out for crashes and see how things go. 28 days is the longest our repository has gone without a STOP error so we have to wait quite a long time to prove whether or not we're fixed.
ejenner
Veteran
Posts: 636
Liked: 100 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ejenner »

Iain_Green wrote:Unforfortuanly that is not an option.
Usually I would say "in that case you're stuck then, you'll just have to accept it."

You'll find a way through one way or another as something will give somewhere. Another possible option would be to spin up another server to use as an additional repository where the RAM / TB ratio is favorable. Then back-to-back compare by running the same jobs to both repositories. In our case, all our old backup servers that we used to use for DPM are still hanging around so if we were in this situation this would be a way of proving the case for us.
Iain_Green
Service Provider
Posts: 158
Liked: 9 times
Joined: Dec 05, 2014 2:13 pm
Full Name: Iain Green
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Iain_Green »

Veeam best practice guide needs to be updated with information around REFS, 3 mentions of REFS in it and none relate to repositories. Yet when adding a repository it's is stated Veeam suggest repositories being formatted to REFS!

Appreciate this is MSs screw up though and Veeam is working with them to correct.

For all future customer deployments, I will be for now deploying NTFS until we are confident REFS will not be broken.
Many thanks

Iain Green
ejenner
Veteran
Posts: 636
Liked: 100 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ejenner » 1 person likes this post

That's a bit like how people were with FAT v.s. NTFS if you go back far enough. There are a lot of parallels with that situation.
Locked

Who is online

Users browsing this forum: amiura, Bing [Bot], Semrush [Bot] and 92 guests