-
- Enthusiast
- Posts: 58
- Liked: never
- Joined: Aug 24, 2010 8:55 am
- Full Name: Alex
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
We've been using a 36TB REFS repo, for Backup Copy jobs only, with much success for almost a year now, it holds about 500 VM's spread over 50+ jobs or so (5 day retention, since the copy is for disaster recovery purposes only). Using 4k cluster size btw, the feb update fixed our slowdowns, no cpu and/or memory issues notices even before the feb update, just slow downs of the fast clone part. Never experienced any freeze what so ever as far as I can recall.
Earlier this month we added a new volume (4U60G2), this one 400TB using 64k blocks, and started moving large jobs here, we're talking about 10 jobs of 2,5TB with 100GB incrementals and another 5 jobs of 7TB and 200GB incrementals. Almost from the start we experienced slowdowns and freezes. Tried a lot, misc refs registry settings, lowering concurrency, reverted the refs.sys driver but still freezes, only when I limit the bandwidth in the repo setup the freezes (mostly) seem to stop.
There is a lot posted, both here and around the net, but I am a little confused about the current state of affairs
a) are any ReFS registry settings recommended and/or needed?
b) could it possibly be that with 4k blocks I would have less freezes than with 64k blocks (disregarding any possibly cpu/mem issues which should have been resolved with the feb patch anyway)?
c) should I still expect any freezes, considering I have no throthling in place, any registry settings in place if needed and the 'correct' refs driver (feb)?
d) would using per-vm files make a difference?
e) is the ingestion rate the culprit or the large files/deltas/deletions?
Any help would be grately appreciated, we are in the process of buying a Cisco S3260 for our primary backup jobs and would hate to see the same problems on that box since that one will also be used for our largest jobs.
Regards,
Alex
Earlier this month we added a new volume (4U60G2), this one 400TB using 64k blocks, and started moving large jobs here, we're talking about 10 jobs of 2,5TB with 100GB incrementals and another 5 jobs of 7TB and 200GB incrementals. Almost from the start we experienced slowdowns and freezes. Tried a lot, misc refs registry settings, lowering concurrency, reverted the refs.sys driver but still freezes, only when I limit the bandwidth in the repo setup the freezes (mostly) seem to stop.
There is a lot posted, both here and around the net, but I am a little confused about the current state of affairs
a) are any ReFS registry settings recommended and/or needed?
b) could it possibly be that with 4k blocks I would have less freezes than with 64k blocks (disregarding any possibly cpu/mem issues which should have been resolved with the feb patch anyway)?
c) should I still expect any freezes, considering I have no throthling in place, any registry settings in place if needed and the 'correct' refs driver (feb)?
d) would using per-vm files make a difference?
e) is the ingestion rate the culprit or the large files/deltas/deletions?
Any help would be grately appreciated, we are in the process of buying a Cisco S3260 for our primary backup jobs and would hate to see the same problems on that box since that one will also be used for our largest jobs.
Regards,
Alex
-
- Enthusiast
- Posts: 73
- Liked: 8 times
- Joined: Sep 29, 2017 8:07 pm
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
The 2018 may and june cumulative updates for 2016 server really broke down performance from what I can tell on this thread (both updated the refs driver). I don't think there were any registry tweaks recommended after the feb fix.
-
- SVP, Product Management
- Posts: 23602
- Liked: 3113 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
@Mgamerz that is correct, moreover any prior registry tweaks were recommended to be removed.
@Alex how much RAM you have on your 400TB repository server? Probably not 10x more than on that server with 36TB repo, right? This could be the culprit as all complaints have largely stopped since Feb ReFS update, however I know Microsoft was still working to optimize ReFS memory consumption, and they told someone who was still having issues that those optimizations should help his case.
@Alex how much RAM you have on your 400TB repository server? Probably not 10x more than on that server with 36TB repo, right? This could be the culprit as all complaints have largely stopped since Feb ReFS update, however I know Microsoft was still working to optimize ReFS memory consumption, and they told someone who was still having issues that those optimizations should help his case.
-
- Service Provider
- Posts: 13
- Liked: 1 time
- Joined: May 14, 2013 1:46 am
- Full Name: KFM
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
Hi Bill,billcouper wrote:@KFM
I have found the only reliable way to delete files is through the operating system. I just login to a repository server and delete the files/folders/whatever I need using Explorer, then run a rescan on the associated sobr in Veeam. When I delete files the server runs a high cpu/ram for a while and in disk management and if I keep refreshing I can see the amount of free space going up slowly. This always works. I have never had a repo server freeze doing it through Explorer.
You're luckier than I! I see the same behaviour as you when deleting files through Explorer, except on most occasions where I'm deleting a large number of large files (3TB+) the system will eventually hang and a reset is the only way to recover.
A lot of the focus of this thread is on high memory or slow clone/transforms with not a lot on the server lockups, which leads me to ask if this is even the right forum or should I be opening a case with Microsoft?
-
- Service Provider
- Posts: 62
- Liked: 15 times
- Joined: Dec 18, 2017 8:58 am
- Full Name: Bill Couper
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
@KFM
Things that helped with server freezes during backup in our environment:
* Lower the limit of tasks per extent.
* Lower the limit of tasks per backup proxy.
* If you have 100% cpu usage (on the repo server) for an extended period during backup add more vCPU's.
* If you have a high memory pressure (on the repo server) for an extended period during backup add more GB's.
Things that helped with server freezes during backup in our environment:
* Lower the limit of tasks per extent.
* Lower the limit of tasks per backup proxy.
* If you have 100% cpu usage (on the repo server) for an extended period during backup add more vCPU's.
* If you have a high memory pressure (on the repo server) for an extended period during backup add more GB's.
-
- Enthusiast
- Posts: 58
- Liked: never
- Joined: Aug 24, 2010 8:55 am
- Full Name: Alex
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
@Gostev:
It is the same server that got the extra volume so obviously the physical memory stayed the same, 64GB, ram usage went from 60-70% free down to 40-50% free over the last month. As stated, I experience freezes without cpu issues (2 sockets, 12 cores each, cpu usage hardly ever above 10%) and without memory pressure.
Last night I removed all registry settings except RefsEnableLargeWorkingSetTrim, also I had only (manually) replaced refs.sys, I also replaced the refsv1.sys driver and rebooted. Now 12 hours later it seems better.
It is the same server that got the extra volume so obviously the physical memory stayed the same, 64GB, ram usage went from 60-70% free down to 40-50% free over the last month. As stated, I experience freezes without cpu issues (2 sockets, 12 cores each, cpu usage hardly ever above 10%) and without memory pressure.
Last night I removed all registry settings except RefsEnableLargeWorkingSetTrim, also I had only (manually) replaced refs.sys, I also replaced the refsv1.sys driver and rebooted. Now 12 hours later it seems better.
-
- Enthusiast
- Posts: 73
- Liked: 8 times
- Joined: Sep 29, 2017 8:07 pm
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
Is the refsv1 driver supposed to be replaced? Some of the earlier instructions didn't mention it, not sure I was supposed to also replace that one. (I only replaced refs.sys).
-
- Novice
- Posts: 7
- Liked: never
- Joined: Jun 26, 2018 11:33 pm
- Full Name: Raleigh
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
I'm very new to Veeam (since late March). Yes, what you describe above is more or less what we're experiencing. During certain backup jobs, the repository server CPU will jump up to 30-60% (it bounces around), memory usage climbs to almost 50%, and the server is essentially unresponsive. It still responds to ping over the network, and if I happened to have a Remote Desktop session open to it, that screen will update, and I can move the mouse around. However, I can't do much of anything else. I can't log into the server console. I can't gracefully restart the server. When the server enters this state it is essentially "crashed" for all practical purposes. I have to hard reset the server. I have had a case open with both Veeam Support and Microsoft support for almost three months now, but there has been no resolution.KFM wrote:I certainly hope so! We're on 10.0.14393.2097 and I still have problems with high CPU causing to lock the server up. I can isolate this to outside of Veeam by simply deleting a large number of large (VBK) files in Windows File Explorer. The repository is passing down the UNMAPs to the underlying storage array (DisableDeleteNotify=0). An hour (or so) after the deletes the CPU on the repository servers goes to 100% and hangs the host. Reset is the only way to recover from it.
I'm assuming this is also what people are seeing? Just want to make sure we're on the same page with this refs problem else I might have to open a support case directly with Microsoft.
--Raleigh
-
- Influencer
- Posts: 21
- Liked: 3 times
- Joined: Oct 06, 2013 8:48 am
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
Yes it should be from what I read
-
- SVP, Product Management
- Posts: 23602
- Liked: 3113 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
Please note that ReFS does not support thin provisioning, TRIM/UNMAP, or Offloaded Data Transfer (ODX) features enabled on the underlying storage array serving as the backup target.KFM wrote:The repository is passing down the UNMAPs to the underlying storage array
-
- Enthusiast
- Posts: 73
- Liked: 8 times
- Joined: Sep 29, 2017 8:07 pm
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
Aye, our offsite server just locked up, I assume due to this issue. I had not yet downgraded the refs driver. On the bright side now I get to learn how to use IP KVM.
-
- Lurker
- Posts: 2
- Liked: 1 time
- Joined: Jun 19, 2015 5:23 pm
- Full Name: Robert Downs
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
I can confirm a back-rev'd refs.sys-2312 and resv1.sys-2312 to 2097/2214 respectively on a fully patched Server 2016 -1607, build 14393.2339 returns the server to former glory in my tests for the fast clone process. Next up, a production run.
Memory usage was much higher than with the 2312 version of the drivers, so I will be massaging this a little.
Want to also mention that none of the keys related to ReFS have been modified from their original settings on this server.
Memory usage was much higher than with the 2312 version of the drivers, so I will be massaging this a little.
Want to also mention that none of the keys related to ReFS have been modified from their original settings on this server.
-
- Novice
- Posts: 7
- Liked: never
- Joined: Jun 26, 2018 11:33 pm
- Full Name: Raleigh
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
OK, my repository server just locked up again this morning. And, I just received an email update from Microsoft Support on my open ticket: "the engineer has been documenting the analysis at the moment, however the analysis and action plan are not completed yet." They've been analyzing this issue since I opened the ticket with them in mid-April. This is really getting old. I've had this ticket open with MS since mid-April, and they don't yet seem to have a clue as to what is causing the problem. Or, they know, and they're just not sharing with me...
After weeks and weeks of troubleshooting this issue on my own, I narrowed it down to a particular backup job, and then to a particular file server being backed up. With Veeam Support help, we identified the operation that caused the problem: deleting a large (~5TB) vbk file from the repository. This causes a problem only on nights when retention policy calls for a deletion of the oldest vbk chain. It's definitely not a Veeam software issue causing the problem: I can crash our repository server just by trying to delete the 5TB .vbk file manually, using Windows Explorer. It doesn't do this on every job; only on the job that involves the large .vbk file. Thus, there exists some threshold file size that causes this problem. My jobs that have 1.3 and 2.4 TB vbk files seem to run just fine. It's only the job with a 4.6 TB vbk file that causes the server to become unresponsive when retention policy calls for the deletion of that file.
Is this what others are experiencing? Backup jobs involving smaller (<4TB) .vbk files don't seem to cause the repository server to become unresponsive, while jobs with large .vbk files do.
FWIW, our backup repository server's basic specs:
Dell PowerEdge R740XD
16GB RAM
29TB (ReFS 64K) storage volume
--Raleigh
After weeks and weeks of troubleshooting this issue on my own, I narrowed it down to a particular backup job, and then to a particular file server being backed up. With Veeam Support help, we identified the operation that caused the problem: deleting a large (~5TB) vbk file from the repository. This causes a problem only on nights when retention policy calls for a deletion of the oldest vbk chain. It's definitely not a Veeam software issue causing the problem: I can crash our repository server just by trying to delete the 5TB .vbk file manually, using Windows Explorer. It doesn't do this on every job; only on the job that involves the large .vbk file. Thus, there exists some threshold file size that causes this problem. My jobs that have 1.3 and 2.4 TB vbk files seem to run just fine. It's only the job with a 4.6 TB vbk file that causes the server to become unresponsive when retention policy calls for the deletion of that file.
Is this what others are experiencing? Backup jobs involving smaller (<4TB) .vbk files don't seem to cause the repository server to become unresponsive, while jobs with large .vbk files do.
FWIW, our backup repository server's basic specs:
Dell PowerEdge R740XD
16GB RAM
29TB (ReFS 64K) storage volume
--Raleigh
-
- Enthusiast
- Posts: 58
- Liked: never
- Joined: Aug 24, 2010 8:55 am
- Full Name: Alex
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
I have a feeling it is more the .vib size that is causing the trouble than the .vbk size, could that also be the case in your situation Raleigh?
-
- Novice
- Posts: 3
- Liked: 4 times
- Joined: Jun 20, 2016 8:30 am
- Full Name: Jesper Sorensen
- Contact:
Re: REFS issues (server lockups, high CPU, high RAM)
FWIW we had similar issues with large .vbk files (some of ours are in excess of 60+TB) and we pretty much resolved it with with refs.sys 2097 AND adding more ram to the Veeam server.Raleigh wrote:OK, my repository server just locked up again this morning. And, I just received an email update from Microsoft Support on my open ticket: "the engineer has been documenting the analysis at the moment, however the analysis and action plan are not completed yet." They've been analyzing this issue since I opened the ticket with them in mid-April. This is really getting old. I've had this ticket open with MS since mid-April, and they don't yet seem to have a clue as to what is causing the problem. Or, they know, and they're just not sharing with me...
After weeks and weeks of troubleshooting this issue on my own, I narrowed it down to a particular backup job, and then to a particular file server being backed up. With Veeam Support help, we identified the operation that caused the problem: deleting a large (~5TB) vbk file from the repository. This causes a problem only on nights when retention policy calls for a deletion of the oldest vbk chain. It's definitely not a Veeam software issue causing the problem: I can crash our repository server just by trying to delete the 5TB .vbk file manually, using Windows Explorer. It doesn't do this on every job; only on the job that involves the large .vbk file. Thus, there exists some threshold file size that causes this problem. My jobs that have 1.3 and 2.4 TB vbk files seem to run just fine. It's only the job with a 4.6 TB vbk file that causes the server to become unresponsive when retention policy calls for the deletion of that file.
Is this what others are experiencing? Backup jobs involving smaller (<4TB) .vbk files don't seem to cause the repository server to become unresponsive, while jobs with large .vbk files do.
FWIW, our backup repository server's basic specs:
Dell PowerEdge R740XD
16GB RAM
29TB (ReFS 64K) storage volume
--Raleigh
Basically the 2097 driver would eliminate the performance issues and the added ram helped with the server crashes.
Who is online
Users browsing this forum: No registered users and 32 guests