Comprehensive data protection for all workloads
Locked
LBegnaud
Service Provider
Posts: 19
Liked: 7 times
Joined: Jan 24, 2018 12:08 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by LBegnaud » Aug 24, 2018 2:01 pm

Gostev, thanks for the reply. That's pretty consistent with what we see...i'm more asking for how to identify the issue with real stats. I don't know too much about the inner workings of windows, but i'm pretty sure kernel memory usage is outlined in task manager as "Page" and "Non-Paged" pools.

Our current situation is that we have 8 backup servers running 12 extents in one SoBR. As I said most of the servers are well over 1GB / 1TB. We have 2 that run 48GB of RAM with 60TB of space, but those rarely give us issues. When we see that the metadata operations are hanging, it would be useful to be able to look somewhere to know for certain which one is causing the hold up, force a reboot, and let things pick up the pieces.

Ctek
Service Provider
Posts: 69
Liked: 9 times
Joined: Nov 11, 2015 3:50 pm
Location: Canada
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Ctek » Aug 27, 2018 1:34 pm 1 person likes this post

Gostev's good news update from the email digest about ReFS:

Microsoft ReFS users: some good news came out of nowhere! As you know, while overall ReFS stability has much improved, there's still one major issue with the ReFS driver memory management, which causes kernel memory usage to spike on large backup file deletions, sometimes causing server lockups. The workaround for this issue has been to throw lots of RAM at the backup repository server (1GB RAM per 1TB storage), which is obviously not ideal. Per Microsoft, this issue was resolved in the RS5 build (aka Windows Server 2019) quite a while ago, but initially they did not plan to port it back to the RS1 build (aka Windows Server 2016) due to some significant complexities of this process.

However, it looks like the support case pressure made them change their mind - because during my regular status check with the ReFS team last week, something totally unexpected came up. Apparently, they've been working on backporting this and few other ReFS performance fixes to RS1 – and the corresponding package is just around the corner! I certainly did not see that coming! So be on a lookout for KB434884 in the next few days - this should be the update's name. However, given that the update brings significant code changes, I would obviously advise against jumping it immediately unless your ReFS repositories are misbehaving anyway – and waiting until we've had a chance to put it through its paces in our test labs first.


Some good news indeed.
VMCE 9 Certified - Systems Administrator

ejenner
Expert
Posts: 397
Liked: 57 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ejenner » Aug 28, 2018 10:30 am

Just come back from the server room after installing the extra RAM. :roll: :lol:

Our Veeam system is still being deployed and not yet production so happy to update the driver to the latest version and watch what happens.

kubimike
Expert
Posts: 334
Liked: 40 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by kubimike » Aug 30, 2018 9:23 pm

Ive been lurking, Im still using the second beta with -0- issues. Do we still not having a working driver ???
This boot time is from my veeam server. You can see how stable it is !
System Boot Time: 4/3/2018, 12:31:40 AM

Gostev
SVP, Product Management
Posts: 24787
Liked: 3519 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Aug 30, 2018 10:48 pm

This stability may be at the cost of file system reliability though. Troubleshooting stage builds sometimes have a number of potentially problematic function calls commented out in an effort to try and pinpoint the issue. I personally would definitely not risk running beta drivers in production, especially when it comes to my backup storage...

kubimike
Expert
Posts: 334
Liked: 40 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by kubimike » Aug 31, 2018 1:26 pm

@gostev I dont know to laugh or cry after reading what you said. So our options are
1. run public release drives that cause the system to freeze, in my case (0x133) so many times it causes corruption at the file level and hardware level to the point the raid crashes. get no sleep baby sit all jobs all the time.
2. run a beta driver that works for several months now and I've done restores from flawlessly.
3. don't use REFS at all.

Iain_Green
Service Provider
Posts: 148
Liked: 8 times
Joined: Dec 05, 2014 2:13 pm
Full Name: Iain Green
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Iain_Green » Aug 31, 2018 1:29 pm

kubimike wrote:3. don't use REFS at all.
Pretty much where we are at at the minute. Have two new 600tb repos now being set up with NTFS as we have no faith in REFS.
Many thanks

Iain Green

Gostev
SVP, Product Management
Posts: 24787
Liked: 3519 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Aug 31, 2018 5:39 pm

@kubimike just curious what makes you think that the current ReFS driver are as bad as the one you used last year? Most of our customers are running the up-to-date ReFS driver with great success. It's only important to have sufficient RAM (1GB per 1TB is the recommendation) to avoid system freezes on large file deletions.

mkretzer
Expert
Posts: 553
Liked: 124 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by mkretzer » Aug 31, 2018 6:08 pm

@gostev After so much time i return to the biggest thread i ever started and i must say i share Iain_Green POV.
Today we gave REFS a fourth chance on a freshly set up system with latest patches (512 TB of RAM) with a fresh formatted volume. And many of the old problems came up right away. We started to backup ~200 VMs at the same time and after 20 Minutes the whole system which hosts the REFS started hanging from time to time. No total crash gladly.

Memory usage went from 9 GB to 120 GB. As soon as memory went > 70 GB the volume was no longer "browsable" via explorer.

Then i set the three common registry settings and now the RAM usage goes up to 140 GB - but the volume still starts to get unusable under normal write load.

Backup start and finish of the individual VMs backups take forever... The graph shows high write rate and then 0 for 5 minutes.

Markus

kubimike
Expert
Posts: 334
Liked: 40 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by kubimike » Aug 31, 2018 8:13 pm

@gostev just from reading this forum. Looks like KB434884 is the answer we've been waiting for. I'll actually try that patch out. I wonder if that fix is in my beta 2 driver I've had all along.

csydas
Expert
Posts: 193
Liked: 46 times
Joined: Jan 16, 2018 5:14 pm
Full Name: Harvey Carel
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by csydas » Sep 01, 2018 7:55 pm

@gostev

Not to doubt, but is there a Microsoft document acknowledging the 1GB/1TB recommendation Veeam offers? Or a Veeam document? In our tests, we're seeing that this seems to be true, but it's hard to justify costs without some hard evidence to point to from either Veeam or Microsoft that says this is best practice. Sorry, budget boards don't accept forum posts :(

Even Oracle publishes a calculation for ZFS, and while it's RAM hungry, at least I can point and say "hey, this is what Oracle says on the matter".

Gostev
SVP, Product Management
Posts: 24787
Liked: 3519 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Sep 02, 2018 2:08 pm

Microsoft definitely does not have one. They only acknowledge that there are in fact known issues with ReFS memory management on large files deletions.

Ctek
Service Provider
Posts: 69
Liked: 9 times
Joined: Nov 11, 2015 3:50 pm
Location: Canada
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Ctek » Sep 02, 2018 3:47 pm

Ctek wrote:Gostev's good news update from the email digest about ReFS:

Microsoft ReFS users: some good news came out of nowhere! As you know, while overall ReFS stability has much improved, there's still one major issue with the ReFS driver memory management, which causes kernel memory usage to spike on large backup file deletions, sometimes causing server lockups. The workaround for this issue has been to throw lots of RAM at the backup repository server (1GB RAM per 1TB storage), which is obviously not ideal. Per Microsoft, this issue was resolved in the RS5 build (aka Windows Server 2019) quite a while ago, but initially they did not plan to port it back to the RS1 build (aka Windows Server 2016) due to some significant complexities of this process.

However, it looks like the support case pressure made them change their mind - because during my regular status check with the ReFS team last week, something totally unexpected came up. Apparently, they've been working on backporting this and few other ReFS performance fixes to RS1 – and the corresponding package is just around the corner! I certainly did not see that coming! So be on a lookout for KB434884 in the next few days - this should be the update's name. However, given that the update brings significant code changes, I would obviously advise against jumping it immediately unless your ReFS repositories are misbehaving anyway – and waiting until we've had a chance to put it through its paces in our test labs first.


Some good news indeed.
I just re-read my post, is there a typo on the KB number? I believe there is 7 numbers normally in KB updates from MS since a while.
VMCE 9 Certified - Systems Administrator

Gostev
SVP, Product Management
Posts: 24787
Liked: 3519 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Sep 02, 2018 9:15 pm

Good catch, it does miss a digit. The correct name is KB4343884 and Windows Server 2016 package is here. Thanks!

kubimike
Expert
Posts: 334
Liked: 40 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by kubimike » Sep 03, 2018 1:22 am

I visited the url for the update. Is it normal that fireworks display when downloading ? On a serious note I don’t see any mention of refs fixes ?

soehl
Enthusiast
Posts: 52
Liked: 8 times
Joined: May 09, 2011 12:43 pm
Full Name: Sebastian
Location: Germany
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by soehl » Sep 03, 2018 8:00 am 1 person likes this post

A new version of the refs.sys is included:
Image

:)

ejenner
Expert
Posts: 397
Liked: 57 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ejenner » Sep 03, 2018 9:13 am 1 person likes this post

csydas wrote:Not to doubt, but is there a Microsoft document acknowledging the 1GB/1TB recommendation Veeam offers? Or a Veeam document? In our tests, we're seeing that this seems to be true, but it's hard to justify costs without some hard evidence to point to from either Veeam or Microsoft that says this is best practice. Sorry, budget boards don't accept forum posts :(
It has been pointed out already, that the savings in terms of disk space (and disks you have to buy) make up for the slightly higher memory requirement.

Gostev
SVP, Product Management
Posts: 24787
Liked: 3519 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Sep 03, 2018 11:37 am

kubimike wrote:On a serious note I don’t see any mention of refs fixes ?
It's a good tradition already :D "the file system which name should not be spoken"

mkretzer
Expert
Posts: 553
Liked: 124 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by mkretzer » Sep 03, 2018 11:54 am

For us the new driver does change nothing. Memory still goes way over 100 GB (the total data written was ~2 TB) and > 70 GB Memory usage the filesystem is no longer browsable for some time...

Iain_Green
Service Provider
Posts: 148
Liked: 8 times
Joined: Dec 05, 2014 2:13 pm
Full Name: Iain Green
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Iain_Green » Sep 03, 2018 12:15 pm

mkretzer wrote:For us the new driver does change nothing. Memory still goes way over 100 GB (the total data written was ~2 TB) and > 70 GB Memory usage the filesystem is no longer browsable for some time...
Which driver are you writing about? Are you talking about the new one that has just been released?
Many thanks

Iain Green

ejenner
Expert
Posts: 397
Liked: 57 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ejenner » Sep 03, 2018 1:15 pm

mkretzer wrote:For us the new driver does change nothing. Memory still goes way over 100 GB (the total data written was ~2 TB) and > 70 GB Memory usage the filesystem is no longer browsable for some time...
Also, 'nearly crashing' just because it has been heavily loaded with jobs is not the same as a STOP error. The problem there has slightly different symptoms to what seems to be the typical REFS issue.

The repository we were having issues with has stopped crashing after adding more memory. We have the recommended amount now and it hasn't crashed since the beginning of August. Although that could be due to having taken it down for the memory upgrade. The longest it went without crashing before was 28 days... so I'd have to wait at least that long before being completely sure.

ejenner
Expert
Posts: 397
Liked: 57 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ejenner » Sep 03, 2018 1:30 pm

mkretzer wrote:For us the new driver does change nothing. Memory still goes way over 100 GB (the total data written was ~2 TB) and > 70 GB Memory usage the filesystem is no longer browsable for some time...
Was just wondering though, the way you're describing the memory usage made me think your configuration might be the cause? Is that possible. I've just had a look on my repository and found most of my VBK files are a reasonable size. But unexpectedly, one of my files is 7TB because I accidently configured a 'entire computer' backup of a cluster node which was holding the cluster fileserver role. It's not a live file any longer as I realized at the time that I'd messed up. But is it possible your repository is full of such files and the way you've configured your backups is really pushing the limits as to what can be handled causing your setup to struggle? Just a thought, it's a long thread, you may already have looked into that sort of thing. All my other files are under a terabyte.

mkretzer
Expert
Posts: 553
Liked: 124 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by mkretzer » Sep 03, 2018 7:10 pm

Iain_Green wrote: Which driver are you writing about? Are you talking about the new one that has just been released?
Yes, we installed this morning and tested.

mkretzer
Expert
Posts: 553
Liked: 124 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by mkretzer » Sep 03, 2018 7:13 pm

ejenner wrote: But is it possible your repository is full of such files and the way you've configured your backups is really pushing the limits as to what can be handled causing your setup to struggle? Just a thought, it's a long thread, you may already have looked into that sort of thing. All my other files are under a terabyte.
No! We just started using this repo&proxy and only backed up VMs < 100 GB (~200 VMs now) with per-VM backup files.

We just created a MS premier case - lets see what they say...

Gostev
SVP, Product Management
Posts: 24787
Liked: 3519 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Sep 03, 2018 8:49 pm

Please send me the Microsoft case ID over PM. Hopefully it's something else in your case, as I have not yet seen ReFS consume 100 GB and misbehave with mere 2TB worth of backups written to it - even in its early days. We could never reproduce the issues internally in this kind of small labs.

mkretzer
Expert
Posts: 553
Liked: 124 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by mkretzer » Sep 04, 2018 7:59 am

Done. If you want you can look at it yourself...

jarzi
Service Provider
Posts: 8
Liked: 1 time
Joined: Aug 29, 2016 1:57 pm
Full Name: Jarno Arajärvi
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by jarzi » Sep 04, 2018 10:38 am

For me, problem with refs is that it's very slow.

Case: 03146435

I have created copy pools with 64KB REFS repos and the copy jobs run very slow and erratically. If I use NTFS repos on the same server on same storage, they work fine. Merges are not exactly quick either.

Newest KB didn't help.

soehl
Enthusiast
Posts: 52
Liked: 8 times
Joined: May 09, 2011 12:43 pm
Full Name: Sebastian
Location: Germany
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by soehl » Sep 04, 2018 1:41 pm

Unfourtainley, the KB434884-patch is definitely not the holy grail, I didn´t see an performance enhancement.
The main problem, that the OS becomes very sluggish on Fast Clone operations is still there. :?

Gostev
SVP, Product Management
Posts: 24787
Liked: 3519 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Sep 04, 2018 3:08 pm

Just to correct the expectations for this latest KB: the issues that we've been working with Microsoft for the past two years were not related to performance of Active Fulls or fast clone operations, but rather OS stability due to the retention processing and specifically, deleting large amount of backup files with ReFS block cloning in use. This was causing the system to completely freeze (clock not updating in the task bar), and often BSOD. These are the issues that were being discussed in this topic.

I don't expect the patch to be fixing any other issues except perhaps implicitly, and actually I've not been aware of the two specific ones mentioned above to exist. For example, we have definitely not seen full backup performance problems in our labs, and the only times we saw fast clone performance issues was when the corresponding regression was temporarily introduced in the May 2018 Windows update.

So I would say, these issues will have to be investigated with Microsoft separately.

kubimike
Expert
Posts: 334
Liked: 40 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by kubimike » Sep 04, 2018 3:16 pm

mkretzer wrote:Yes, we installed this morning and tested.
Does it work ?

Locked

Who is online

Users browsing this forum: Bing [Bot], Google [Bot], marc.heuser and 54 guests