Availability for the Always-On Enterprise
Post Reply
AlexL
Enthusiast
Posts: 58
Liked: never
Joined: Aug 24, 2010 8:55 am
Full Name: Alex
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by AlexL » Jun 27, 2018 8:35 pm

We've been using a 36TB REFS repo, for Backup Copy jobs only, with much success for almost a year now, it holds about 500 VM's spread over 50+ jobs or so (5 day retention, since the copy is for disaster recovery purposes only). Using 4k cluster size btw, the feb update fixed our slowdowns, no cpu and/or memory issues notices even before the feb update, just slow downs of the fast clone part. Never experienced any freeze what so ever as far as I can recall.
Earlier this month we added a new volume (4U60G2), this one 400TB using 64k blocks, and started moving large jobs here, we're talking about 10 jobs of 2,5TB with 100GB incrementals and another 5 jobs of 7TB and 200GB incrementals. Almost from the start we experienced slowdowns and freezes. Tried a lot, misc refs registry settings, lowering concurrency, reverted the refs.sys driver but still freezes, only when I limit the bandwidth in the repo setup the freezes (mostly) seem to stop.

There is a lot posted, both here and around the net, but I am a little confused about the current state of affairs

a) are any ReFS registry settings recommended and/or needed?
b) could it possibly be that with 4k blocks I would have less freezes than with 64k blocks (disregarding any possibly cpu/mem issues which should have been resolved with the feb patch anyway)?
c) should I still expect any freezes, considering I have no throthling in place, any registry settings in place if needed and the 'correct' refs driver (feb)?
d) would using per-vm files make a difference?
e) is the ingestion rate the culprit or the large files/deltas/deletions?

Any help would be grately appreciated, we are in the process of buying a Cisco S3260 for our primary backup jobs and would hate to see the same problems on that box since that one will also be used for our largest jobs.

Regards,
Alex

Mgamerz
Enthusiast
Posts: 54
Liked: 7 times
Joined: Sep 29, 2017 8:07 pm
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Mgamerz » Jun 27, 2018 10:00 pm

The 2018 may and june cumulative updates for 2016 server really broke down performance from what I can tell on this thread (both updated the refs driver). I don't think there were any registry tweaks recommended after the feb fix.

Gostev
Veeam Software
Posts: 22808
Liked: 2801 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Jun 27, 2018 10:24 pm

@Mgamerz that is correct, moreover any prior registry tweaks were recommended to be removed.

@Alex how much RAM you have on your 400TB repository server? Probably not 10x more than on that server with 36TB repo, right? This could be the culprit as all complaints have largely stopped since Feb ReFS update, however I know Microsoft was still working to optimize ReFS memory consumption, and they told someone who was still having issues that those optimizations should help his case.

KFM
Service Provider
Posts: 13
Liked: 1 time
Joined: May 14, 2013 1:46 am
Full Name: KFM
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by KFM » Jun 28, 2018 1:20 am

billcouper wrote:@KFM
I have found the only reliable way to delete files is through the operating system. I just login to a repository server and delete the files/folders/whatever I need using Explorer, then run a rescan on the associated sobr in Veeam. When I delete files the server runs a high cpu/ram for a while and in disk management and if I keep refreshing I can see the amount of free space going up slowly. This always works. I have never had a repo server freeze doing it through Explorer.
Hi Bill,

You're luckier than I! I see the same behaviour as you when deleting files through Explorer, except on most occasions where I'm deleting a large number of large files (3TB+) the system will eventually hang and a reset is the only way to recover.

A lot of the focus of this thread is on high memory or slow clone/transforms with not a lot on the server lockups, which leads me to ask if this is even the right forum or should I be opening a case with Microsoft?

billcouper
Service Provider
Posts: 55
Liked: 13 times
Joined: Dec 18, 2017 8:58 am
Full Name: Bill Couper
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by billcouper » Jun 28, 2018 2:54 am 1 person likes this post

@KFM
Things that helped with server freezes during backup in our environment:
* Lower the limit of tasks per extent.
* Lower the limit of tasks per backup proxy.
* If you have 100% cpu usage (on the repo server) for an extended period during backup add more vCPU's.
* If you have a high memory pressure (on the repo server) for an extended period during backup add more GB's.

AlexL
Enthusiast
Posts: 58
Liked: never
Joined: Aug 24, 2010 8:55 am
Full Name: Alex
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by AlexL » Jun 28, 2018 8:19 am

@Gostev:
It is the same server that got the extra volume so obviously the physical memory stayed the same, 64GB, ram usage went from 60-70% free down to 40-50% free over the last month. As stated, I experience freezes without cpu issues (2 sockets, 12 cores each, cpu usage hardly ever above 10%) and without memory pressure.

Last night I removed all registry settings except RefsEnableLargeWorkingSetTrim, also I had only (manually) replaced refs.sys, I also replaced the refsv1.sys driver and rebooted. Now 12 hours later it seems better.

Mgamerz
Enthusiast
Posts: 54
Liked: 7 times
Joined: Sep 29, 2017 8:07 pm
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Mgamerz » Jun 28, 2018 6:19 pm

Is the refsv1 driver supposed to be replaced? Some of the earlier instructions didn't mention it, not sure I was supposed to also replace that one. (I only replaced refs.sys).

Raleigh
Novice
Posts: 7
Liked: never
Joined: Jun 26, 2018 11:33 pm
Full Name: Raleigh
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Raleigh » Jun 28, 2018 11:30 pm

KFM wrote:I certainly hope so! We're on 10.0.14393.2097 and I still have problems with high CPU causing to lock the server up. I can isolate this to outside of Veeam by simply deleting a large number of large (VBK) files in Windows File Explorer. The repository is passing down the UNMAPs to the underlying storage array (DisableDeleteNotify=0). An hour (or so) after the deletes the CPU on the repository servers goes to 100% and hangs the host. Reset is the only way to recover from it.

I'm assuming this is also what people are seeing? Just want to make sure we're on the same page with this refs problem else I might have to open a support case directly with Microsoft.
I'm very new to Veeam (since late March). Yes, what you describe above is more or less what we're experiencing. During certain backup jobs, the repository server CPU will jump up to 30-60% (it bounces around), memory usage climbs to almost 50%, and the server is essentially unresponsive. It still responds to ping over the network, and if I happened to have a Remote Desktop session open to it, that screen will update, and I can move the mouse around. However, I can't do much of anything else. I can't log into the server console. I can't gracefully restart the server. When the server enters this state it is essentially "crashed" for all practical purposes. I have to hard reset the server. I have had a case open with both Veeam Support and Microsoft support for almost three months now, but there has been no resolution.

--Raleigh

opg70
Influencer
Posts: 20
Liked: 3 times
Joined: Oct 06, 2013 8:48 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by opg70 » Jun 29, 2018 7:39 am

Yes it should be from what I read

Gostev
Veeam Software
Posts: 22808
Liked: 2801 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Jun 29, 2018 3:07 pm

KFM wrote:The repository is passing down the UNMAPs to the underlying storage array
Please note that ReFS does not support thin provisioning, TRIM/UNMAP, or Offloaded Data Transfer (ODX) features enabled on the underlying storage array serving as the backup target.

Mgamerz
Enthusiast
Posts: 54
Liked: 7 times
Joined: Sep 29, 2017 8:07 pm
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Mgamerz » Jun 29, 2018 5:45 pm

Aye, our offsite server just locked up, I assume due to this issue. I had not yet downgraded the refs driver. On the bright side now I get to learn how to use IP KVM.

DesertBlizzard
Lurker
Posts: 2
Liked: 1 time
Joined: Jun 19, 2015 5:23 pm
Full Name: Robert Downs
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by DesertBlizzard » Jun 29, 2018 8:00 pm 1 person likes this post

I can confirm a back-rev'd refs.sys-2312 and resv1.sys-2312 to 2097/2214 respectively on a fully patched Server 2016 -1607, build 14393.2339 returns the server to former glory in my tests for the fast clone process. Next up, a production run.

Memory usage was much higher than with the 2312 version of the drivers, so I will be massaging this a little.

Want to also mention that none of the keys related to ReFS have been modified from their original settings on this server.

Raleigh
Novice
Posts: 7
Liked: never
Joined: Jun 26, 2018 11:33 pm
Full Name: Raleigh
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Raleigh » Jun 29, 2018 11:00 pm

OK, my repository server just locked up again this morning. And, I just received an email update from Microsoft Support on my open ticket: "the engineer has been documenting the analysis at the moment, however the analysis and action plan are not completed yet." They've been analyzing this issue since I opened the ticket with them in mid-April. This is really getting old. I've had this ticket open with MS since mid-April, and they don't yet seem to have a clue as to what is causing the problem. Or, they know, and they're just not sharing with me...

After weeks and weeks of troubleshooting this issue on my own, I narrowed it down to a particular backup job, and then to a particular file server being backed up. With Veeam Support help, we identified the operation that caused the problem: deleting a large (~5TB) vbk file from the repository. This causes a problem only on nights when retention policy calls for a deletion of the oldest vbk chain. It's definitely not a Veeam software issue causing the problem: I can crash our repository server just by trying to delete the 5TB .vbk file manually, using Windows Explorer. It doesn't do this on every job; only on the job that involves the large .vbk file. Thus, there exists some threshold file size that causes this problem. My jobs that have 1.3 and 2.4 TB vbk files seem to run just fine. It's only the job with a 4.6 TB vbk file that causes the server to become unresponsive when retention policy calls for the deletion of that file.

Is this what others are experiencing? Backup jobs involving smaller (<4TB) .vbk files don't seem to cause the repository server to become unresponsive, while jobs with large .vbk files do.

FWIW, our backup repository server's basic specs:
Dell PowerEdge R740XD
16GB RAM
29TB (ReFS 64K) storage volume

--Raleigh

AlexL
Enthusiast
Posts: 58
Liked: never
Joined: Aug 24, 2010 8:55 am
Full Name: Alex
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by AlexL » Jun 30, 2018 7:49 pm

I have a feeling it is more the .vib size that is causing the trouble than the .vbk size, could that also be the case in your situation Raleigh?

jslic
Novice
Posts: 3
Liked: 4 times
Joined: Jun 20, 2016 8:30 am
Full Name: Jesper Sorensen
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by jslic » Jul 02, 2018 6:25 am 1 person likes this post

Raleigh wrote:OK, my repository server just locked up again this morning. And, I just received an email update from Microsoft Support on my open ticket: "the engineer has been documenting the analysis at the moment, however the analysis and action plan are not completed yet." They've been analyzing this issue since I opened the ticket with them in mid-April. This is really getting old. I've had this ticket open with MS since mid-April, and they don't yet seem to have a clue as to what is causing the problem. Or, they know, and they're just not sharing with me...

After weeks and weeks of troubleshooting this issue on my own, I narrowed it down to a particular backup job, and then to a particular file server being backed up. With Veeam Support help, we identified the operation that caused the problem: deleting a large (~5TB) vbk file from the repository. This causes a problem only on nights when retention policy calls for a deletion of the oldest vbk chain. It's definitely not a Veeam software issue causing the problem: I can crash our repository server just by trying to delete the 5TB .vbk file manually, using Windows Explorer. It doesn't do this on every job; only on the job that involves the large .vbk file. Thus, there exists some threshold file size that causes this problem. My jobs that have 1.3 and 2.4 TB vbk files seem to run just fine. It's only the job with a 4.6 TB vbk file that causes the server to become unresponsive when retention policy calls for the deletion of that file.

Is this what others are experiencing? Backup jobs involving smaller (<4TB) .vbk files don't seem to cause the repository server to become unresponsive, while jobs with large .vbk files do.

FWIW, our backup repository server's basic specs:
Dell PowerEdge R740XD
16GB RAM
29TB (ReFS 64K) storage volume

--Raleigh
FWIW we had similar issues with large .vbk files (some of ours are in excess of 60+TB) and we pretty much resolved it with with refs.sys 2097 AND adding more ram to the Veeam server.
Basically the 2097 driver would eliminate the performance issues and the added ram helped with the server crashes.

Post Reply

Who is online

Users browsing this forum: No registered users and 20 guests