-
- Service Provider
- Posts: 19
- Liked: 7 times
- Joined: Jan 24, 2018 12:08 am
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
Gostev, thanks for the reply. That's pretty consistent with what we see...i'm more asking for how to identify the issue with real stats. I don't know too much about the inner workings of windows, but i'm pretty sure kernel memory usage is outlined in task manager as "Page" and "Non-Paged" pools.
Our current situation is that we have 8 backup servers running 12 extents in one SoBR. As I said most of the servers are well over 1GB / 1TB. We have 2 that run 48GB of RAM with 60TB of space, but those rarely give us issues. When we see that the metadata operations are hanging, it would be useful to be able to look somewhere to know for certain which one is causing the hold up, force a reboot, and let things pick up the pieces.
Our current situation is that we have 8 backup servers running 12 extents in one SoBR. As I said most of the servers are well over 1GB / 1TB. We have 2 that run 48GB of RAM with 60TB of space, but those rarely give us issues. When we see that the metadata operations are hanging, it would be useful to be able to look somewhere to know for certain which one is causing the hold up, force a reboot, and let things pick up the pieces.
-
- Service Provider
- Posts: 84
- Liked: 13 times
- Joined: Nov 11, 2015 3:50 pm
- Location: Canada
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
Gostev's good news update from the email digest about ReFS:
Microsoft ReFS users: some good news came out of nowhere! As you know, while overall ReFS stability has much improved, there's still one major issue with the ReFS driver memory management, which causes kernel memory usage to spike on large backup file deletions, sometimes causing server lockups. The workaround for this issue has been to throw lots of RAM at the backup repository server (1GB RAM per 1TB storage), which is obviously not ideal. Per Microsoft, this issue was resolved in the RS5 build (aka Windows Server 2019) quite a while ago, but initially they did not plan to port it back to the RS1 build (aka Windows Server 2016) due to some significant complexities of this process.
However, it looks like the support case pressure made them change their mind - because during my regular status check with the ReFS team last week, something totally unexpected came up. Apparently, they've been working on backporting this and few other ReFS performance fixes to RS1 – and the corresponding package is just around the corner! I certainly did not see that coming! So be on a lookout for KB434884 in the next few days - this should be the update's name. However, given that the update brings significant code changes, I would obviously advise against jumping it immediately unless your ReFS repositories are misbehaving anyway – and waiting until we've had a chance to put it through its paces in our test labs first.
Some good news indeed.
Microsoft ReFS users: some good news came out of nowhere! As you know, while overall ReFS stability has much improved, there's still one major issue with the ReFS driver memory management, which causes kernel memory usage to spike on large backup file deletions, sometimes causing server lockups. The workaround for this issue has been to throw lots of RAM at the backup repository server (1GB RAM per 1TB storage), which is obviously not ideal. Per Microsoft, this issue was resolved in the RS5 build (aka Windows Server 2019) quite a while ago, but initially they did not plan to port it back to the RS1 build (aka Windows Server 2016) due to some significant complexities of this process.
However, it looks like the support case pressure made them change their mind - because during my regular status check with the ReFS team last week, something totally unexpected came up. Apparently, they've been working on backporting this and few other ReFS performance fixes to RS1 – and the corresponding package is just around the corner! I certainly did not see that coming! So be on a lookout for KB434884 in the next few days - this should be the update's name. However, given that the update brings significant code changes, I would obviously advise against jumping it immediately unless your ReFS repositories are misbehaving anyway – and waiting until we've had a chance to put it through its paces in our test labs first.
Some good news indeed.
VMCE
-
- Veteran
- Posts: 636
- Liked: 100 times
- Joined: Mar 23, 2018 4:43 pm
- Full Name: EJ
- Location: London
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
Just come back from the server room after installing the extra RAM.
Our Veeam system is still being deployed and not yet production so happy to update the driver to the latest version and watch what happens.
Our Veeam system is still being deployed and not yet production so happy to update the driver to the latest version and watch what happens.
-
- Veteran
- Posts: 391
- Liked: 56 times
- Joined: Feb 03, 2017 2:34 pm
- Full Name: MikeO
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
Ive been lurking, Im still using the second beta with -0- issues. Do we still not having a working driver ???
This boot time is from my veeam server. You can see how stable it is !
System Boot Time: 4/3/2018, 12:31:40 AM
This boot time is from my veeam server. You can see how stable it is !
System Boot Time: 4/3/2018, 12:31:40 AM
-
- Chief Product Officer
- Posts: 31783
- Liked: 7283 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
This stability may be at the cost of file system reliability though. Troubleshooting stage builds sometimes have a number of potentially problematic function calls commented out in an effort to try and pinpoint the issue. I personally would definitely not risk running beta drivers in production, especially when it comes to my backup storage...
-
- Veteran
- Posts: 391
- Liked: 56 times
- Joined: Feb 03, 2017 2:34 pm
- Full Name: MikeO
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
@gostev I dont know to laugh or cry after reading what you said. So our options are
1. run public release drives that cause the system to freeze, in my case (0x133) so many times it causes corruption at the file level and hardware level to the point the raid crashes. get no sleep baby sit all jobs all the time.
2. run a beta driver that works for several months now and I've done restores from flawlessly.
3. don't use REFS at all.
1. run public release drives that cause the system to freeze, in my case (0x133) so many times it causes corruption at the file level and hardware level to the point the raid crashes. get no sleep baby sit all jobs all the time.
2. run a beta driver that works for several months now and I've done restores from flawlessly.
3. don't use REFS at all.
-
- Service Provider
- Posts: 158
- Liked: 9 times
- Joined: Dec 05, 2014 2:13 pm
- Full Name: Iain Green
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
Pretty much where we are at at the minute. Have two new 600tb repos now being set up with NTFS as we have no faith in REFS.kubimike wrote:3. don't use REFS at all.
Many thanks
Iain Green
Iain Green
-
- Chief Product Officer
- Posts: 31783
- Liked: 7283 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
@kubimike just curious what makes you think that the current ReFS driver are as bad as the one you used last year? Most of our customers are running the up-to-date ReFS driver with great success. It's only important to have sufficient RAM (1GB per 1TB is the recommendation) to avoid system freezes on large file deletions.
-
- Veeam Legend
- Posts: 1202
- Liked: 416 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
@gostev After so much time i return to the biggest thread i ever started and i must say i share Iain_Green POV.
Today we gave REFS a fourth chance on a freshly set up system with latest patches (512 TB of RAM) with a fresh formatted volume. And many of the old problems came up right away. We started to backup ~200 VMs at the same time and after 20 Minutes the whole system which hosts the REFS started hanging from time to time. No total crash gladly.
Memory usage went from 9 GB to 120 GB. As soon as memory went > 70 GB the volume was no longer "browsable" via explorer.
Then i set the three common registry settings and now the RAM usage goes up to 140 GB - but the volume still starts to get unusable under normal write load.
Backup start and finish of the individual VMs backups take forever... The graph shows high write rate and then 0 for 5 minutes.
Markus
Today we gave REFS a fourth chance on a freshly set up system with latest patches (512 TB of RAM) with a fresh formatted volume. And many of the old problems came up right away. We started to backup ~200 VMs at the same time and after 20 Minutes the whole system which hosts the REFS started hanging from time to time. No total crash gladly.
Memory usage went from 9 GB to 120 GB. As soon as memory went > 70 GB the volume was no longer "browsable" via explorer.
Then i set the three common registry settings and now the RAM usage goes up to 140 GB - but the volume still starts to get unusable under normal write load.
Backup start and finish of the individual VMs backups take forever... The graph shows high write rate and then 0 for 5 minutes.
Markus
-
- Veteran
- Posts: 391
- Liked: 56 times
- Joined: Feb 03, 2017 2:34 pm
- Full Name: MikeO
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
@gostev just from reading this forum. Looks like KB434884 is the answer we've been waiting for. I'll actually try that patch out. I wonder if that fix is in my beta 2 driver I've had all along.
-
- Expert
- Posts: 193
- Liked: 47 times
- Joined: Jan 16, 2018 5:14 pm
- Full Name: Harvey Carel
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
@gostev
Not to doubt, but is there a Microsoft document acknowledging the 1GB/1TB recommendation Veeam offers? Or a Veeam document? In our tests, we're seeing that this seems to be true, but it's hard to justify costs without some hard evidence to point to from either Veeam or Microsoft that says this is best practice. Sorry, budget boards don't accept forum posts
Even Oracle publishes a calculation for ZFS, and while it's RAM hungry, at least I can point and say "hey, this is what Oracle says on the matter".
Not to doubt, but is there a Microsoft document acknowledging the 1GB/1TB recommendation Veeam offers? Or a Veeam document? In our tests, we're seeing that this seems to be true, but it's hard to justify costs without some hard evidence to point to from either Veeam or Microsoft that says this is best practice. Sorry, budget boards don't accept forum posts
Even Oracle publishes a calculation for ZFS, and while it's RAM hungry, at least I can point and say "hey, this is what Oracle says on the matter".
-
- Chief Product Officer
- Posts: 31783
- Liked: 7283 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
Microsoft definitely does not have one. They only acknowledge that there are in fact known issues with ReFS memory management on large files deletions.
-
- Service Provider
- Posts: 84
- Liked: 13 times
- Joined: Nov 11, 2015 3:50 pm
- Location: Canada
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
I just re-read my post, is there a typo on the KB number? I believe there is 7 numbers normally in KB updates from MS since a while.Ctek wrote:Gostev's good news update from the email digest about ReFS:
Microsoft ReFS users: some good news came out of nowhere! As you know, while overall ReFS stability has much improved, there's still one major issue with the ReFS driver memory management, which causes kernel memory usage to spike on large backup file deletions, sometimes causing server lockups. The workaround for this issue has been to throw lots of RAM at the backup repository server (1GB RAM per 1TB storage), which is obviously not ideal. Per Microsoft, this issue was resolved in the RS5 build (aka Windows Server 2019) quite a while ago, but initially they did not plan to port it back to the RS1 build (aka Windows Server 2016) due to some significant complexities of this process.
However, it looks like the support case pressure made them change their mind - because during my regular status check with the ReFS team last week, something totally unexpected came up. Apparently, they've been working on backporting this and few other ReFS performance fixes to RS1 – and the corresponding package is just around the corner! I certainly did not see that coming! So be on a lookout for KB434884 in the next few days - this should be the update's name. However, given that the update brings significant code changes, I would obviously advise against jumping it immediately unless your ReFS repositories are misbehaving anyway – and waiting until we've had a chance to put it through its paces in our test labs first.
Some good news indeed.
VMCE
-
- Chief Product Officer
- Posts: 31783
- Liked: 7283 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
-
- Veteran
- Posts: 391
- Liked: 56 times
- Joined: Feb 03, 2017 2:34 pm
- Full Name: MikeO
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
I visited the url for the update. Is it normal that fireworks display when downloading ? On a serious note I don’t see any mention of refs fixes ?
-
- Enthusiast
- Posts: 57
- Liked: 8 times
- Joined: May 09, 2011 12:43 pm
- Full Name: Sebastian
- Location: Germany
- Contact:
-
- Veteran
- Posts: 636
- Liked: 100 times
- Joined: Mar 23, 2018 4:43 pm
- Full Name: EJ
- Location: London
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
It has been pointed out already, that the savings in terms of disk space (and disks you have to buy) make up for the slightly higher memory requirement.csydas wrote:Not to doubt, but is there a Microsoft document acknowledging the 1GB/1TB recommendation Veeam offers? Or a Veeam document? In our tests, we're seeing that this seems to be true, but it's hard to justify costs without some hard evidence to point to from either Veeam or Microsoft that says this is best practice. Sorry, budget boards don't accept forum posts
-
- Chief Product Officer
- Posts: 31783
- Liked: 7283 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
It's a good tradition already "the file system which name should not be spoken"kubimike wrote:On a serious note I don’t see any mention of refs fixes ?
-
- Veeam Legend
- Posts: 1202
- Liked: 416 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
For us the new driver does change nothing. Memory still goes way over 100 GB (the total data written was ~2 TB) and > 70 GB Memory usage the filesystem is no longer browsable for some time...
-
- Service Provider
- Posts: 158
- Liked: 9 times
- Joined: Dec 05, 2014 2:13 pm
- Full Name: Iain Green
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
Which driver are you writing about? Are you talking about the new one that has just been released?mkretzer wrote:For us the new driver does change nothing. Memory still goes way over 100 GB (the total data written was ~2 TB) and > 70 GB Memory usage the filesystem is no longer browsable for some time...
Many thanks
Iain Green
Iain Green
-
- Veteran
- Posts: 636
- Liked: 100 times
- Joined: Mar 23, 2018 4:43 pm
- Full Name: EJ
- Location: London
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
Also, 'nearly crashing' just because it has been heavily loaded with jobs is not the same as a STOP error. The problem there has slightly different symptoms to what seems to be the typical REFS issue.mkretzer wrote:For us the new driver does change nothing. Memory still goes way over 100 GB (the total data written was ~2 TB) and > 70 GB Memory usage the filesystem is no longer browsable for some time...
The repository we were having issues with has stopped crashing after adding more memory. We have the recommended amount now and it hasn't crashed since the beginning of August. Although that could be due to having taken it down for the memory upgrade. The longest it went without crashing before was 28 days... so I'd have to wait at least that long before being completely sure.
-
- Veteran
- Posts: 636
- Liked: 100 times
- Joined: Mar 23, 2018 4:43 pm
- Full Name: EJ
- Location: London
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
Was just wondering though, the way you're describing the memory usage made me think your configuration might be the cause? Is that possible. I've just had a look on my repository and found most of my VBK files are a reasonable size. But unexpectedly, one of my files is 7TB because I accidently configured a 'entire computer' backup of a cluster node which was holding the cluster fileserver role. It's not a live file any longer as I realized at the time that I'd messed up. But is it possible your repository is full of such files and the way you've configured your backups is really pushing the limits as to what can be handled causing your setup to struggle? Just a thought, it's a long thread, you may already have looked into that sort of thing. All my other files are under a terabyte.mkretzer wrote:For us the new driver does change nothing. Memory still goes way over 100 GB (the total data written was ~2 TB) and > 70 GB Memory usage the filesystem is no longer browsable for some time...
-
- Veeam Legend
- Posts: 1202
- Liked: 416 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
Yes, we installed this morning and tested.Iain_Green wrote: Which driver are you writing about? Are you talking about the new one that has just been released?
-
- Veeam Legend
- Posts: 1202
- Liked: 416 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
No! We just started using this repo&proxy and only backed up VMs < 100 GB (~200 VMs now) with per-VM backup files.ejenner wrote: But is it possible your repository is full of such files and the way you've configured your backups is really pushing the limits as to what can be handled causing your setup to struggle? Just a thought, it's a long thread, you may already have looked into that sort of thing. All my other files are under a terabyte.
We just created a MS premier case - lets see what they say...
-
- Chief Product Officer
- Posts: 31783
- Liked: 7283 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
Please send me the Microsoft case ID over PM. Hopefully it's something else in your case, as I have not yet seen ReFS consume 100 GB and misbehave with mere 2TB worth of backups written to it - even in its early days. We could never reproduce the issues internally in this kind of small labs.
-
- Veeam Legend
- Posts: 1202
- Liked: 416 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
Done. If you want you can look at it yourself...
-
- Service Provider
- Posts: 22
- Liked: 3 times
- Joined: Aug 29, 2016 1:57 pm
- Full Name: Jarno Arajärvi
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
For me, problem with refs is that it's very slow.
Case: 03146435
I have created copy pools with 64KB REFS repos and the copy jobs run very slow and erratically. If I use NTFS repos on the same server on same storage, they work fine. Merges are not exactly quick either.
Newest KB didn't help.
Case: 03146435
I have created copy pools with 64KB REFS repos and the copy jobs run very slow and erratically. If I use NTFS repos on the same server on same storage, they work fine. Merges are not exactly quick either.
Newest KB didn't help.
-
- Enthusiast
- Posts: 57
- Liked: 8 times
- Joined: May 09, 2011 12:43 pm
- Full Name: Sebastian
- Location: Germany
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
Unfourtainley, the KB434884-patch is definitely not the holy grail, I didn´t see an performance enhancement.
The main problem, that the OS becomes very sluggish on Fast Clone operations is still there.
The main problem, that the OS becomes very sluggish on Fast Clone operations is still there.
-
- Chief Product Officer
- Posts: 31783
- Liked: 7283 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
Just to correct the expectations for this latest KB: the issues that we've been working with Microsoft for the past two years were not related to performance of Active Fulls or fast clone operations, but rather OS stability due to the retention processing and specifically, deleting large amount of backup files with ReFS block cloning in use. This was causing the system to completely freeze (clock not updating in the task bar), and often BSOD. These are the issues that were being discussed in this topic.
I don't expect the patch to be fixing any other issues except perhaps implicitly, and actually I've not been aware of the two specific ones mentioned above to exist. For example, we have definitely not seen full backup performance problems in our labs, and the only times we saw fast clone performance issues was when the corresponding regression was temporarily introduced in the May 2018 Windows update.
So I would say, these issues will have to be investigated with Microsoft separately.
I don't expect the patch to be fixing any other issues except perhaps implicitly, and actually I've not been aware of the two specific ones mentioned above to exist. For example, we have definitely not seen full backup performance problems in our labs, and the only times we saw fast clone performance issues was when the corresponding regression was temporarily introduced in the May 2018 Windows update.
So I would say, these issues will have to be investigated with Microsoft separately.
-
- Veteran
- Posts: 391
- Liked: 56 times
- Joined: Feb 03, 2017 2:34 pm
- Full Name: MikeO
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
Does it work ?mkretzer wrote:Yes, we installed this morning and tested.
Who is online
Users browsing this forum: aheath, ikov, oscarm, Paul.Loewenkamp, pybfr, RubinCompServ and 186 guests