Comprehensive data protection for all workloads
Locked
kubimike
Expert
Posts: 334
Liked: 40 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by kubimike » Sep 04, 2018 3:20 pm

Gostev wrote:Just to correct the expectations: the issues that we kept working with Microsoft for the past two years were not related to performance of Active Fulls or fast clone operations, but rather OS stability due to the retention processing and specifically, deleting large amount of backup files with ReFS block cloning in use. This was causing the system to completely freeze (clock not updating in the task bar), and often BSOD. These are the issues that were being discussed in this topic.

I don't expect the patch to be fixing any other issues, and actually I've not been aware of the two specific ones mentioned above to exist. For example, we have definitely not seen full backup performance in our labs, and the only times we saw fast clone performance issues was when the corresponding regression was temporarily introduced in the May 2018 Windows Updates.
Crap. Can you find out exactly what they've done in the beta 2 version and compare to whats been done today? I s*** you not its perfect, well at least for me. The symptoms you describe above were occurring on my machine (" This was causing the system to completely freeze (clock not updating in the task bar), and often BSOD. )" . I have 50 TBs of storage and 192 gigs of ram and the freezing still occurred. Granted I haven't' tried the latest driver but from what Im seeing its still a no go for me.

soehl
Enthusiast
Posts: 52
Liked: 8 times
Joined: May 09, 2011 12:43 pm
Full Name: Sebastian
Location: Germany
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by soehl » Sep 04, 2018 3:31 pm

Gostev wrote:Just to correct the expectations: the issues that we kept working with Microsoft for the past two years were not related to performance of Active Fulls or fast clone operations, but rather OS stability due to the retention processing and specifically, deleting large amount of backup files with ReFS block cloning in use. This was causing the system to completely freeze (clock not updating in the task bar), and often BSOD. These are the issues that were being discussed in this topic.

I don't expect the patch to be fixing any other issues, and actually I've not been aware of the two specific ones mentioned above to exist. For example, we have definitely not seen full backup performance in our labs, and the only times we saw fast clone performance issues was when the corresponding regression was temporarily introduced in the May 2018 Windows Updates.
To clarify my post, i don´t expect an higher backup performance for full backups. Today and with the last two or three? MS patches that are ReFS relevant, we fill the 10GBit/s network link on full backup operations.

So far so good.

The only struggle is, that the OS become unresponsive (SNMP, WMI) on Fast Clone and/or delete operations. So our monitoring stays in alert state for the VEEAM repositories in the main backup window.
It seems that a high count of Fast Clone operations is responsible for this behaviour. An option for simultaneous Fast Clone operations per repository would be nice. :?:

Thanks.

l0stb@ackup
Influencer
Posts: 14
Liked: 4 times
Joined: Jul 19, 2018 2:10 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by l0stb@ackup » Sep 05, 2018 1:10 am

@Gostev, thanks for the clarification, I can't hide my huge disappointment though :(

We are experiencing extremely slow Fast Clone/Synthetic Full operations.

We have two identical servers (backup and replication), each with the following config:
128GB RAM
152TB repository
Cisco S3260 M4
24 SATA drives in RAID6

This system has been benchmarked to write 500-600MBps sustainable to the repository's ReFS partition with 1.1GBps respectable peaks. However backup job write speeds are an avg. of 50-100MBps and Fast Clone/Synthetic Full times can range between 10-50 hrs. Half of the job runs show source as bottleneck (~90%), but other half shows target (~70%). What in your opinion could be causing the slow performance we're getting on target?

tsightler
VP, Product Management
Posts: 5453
Liked: 2270 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by tsightler » Sep 05, 2018 1:46 am 1 person likes this post

soehl wrote:The only struggle is, that the OS become unresponsive (SNMP, WMI) on Fast Clone and/or delete operations. So our monitoring stays in alert state for the VEEAM repositories in the main backup window.
It seems that a high count of Fast Clone operations is responsible for this behaviour. An option for simultaneous Fast Clone operations per repository would be nice. :?:
Do you have your repository tasks limited or do you have the task limit unchecked (the default)? What concurrency do you have set and what is your memory/CPU configuration?

soehl
Enthusiast
Posts: 52
Liked: 8 times
Joined: May 09, 2011 12:43 pm
Full Name: Sebastian
Location: Germany
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by soehl » Sep 05, 2018 9:09 am

We have several HP(E) boxes, mainly Apollo 4510 Gen9/10.
One example:
HPE Apollo 4510 Gen9
2x E5-2690V4 = 28 Cores
196GB RAM
60x 4TB NL SAS Disk on one HPE Smart Array P840 (4GB) Cache with enabled SSD SmartCache (800GB)

The RAID-Configuration is, 2x RAID 60 with each 30 Disk = 2x approximately 100TB netto filesystem, formatted with ReFS 64KB blocksize
Each filesystem is one VEEAM repository, with "Use per-VM backup files"-option activated and a task limit of 20.
I played around with the concurrent tasks limit option, from 5 to unchecked, but don´t found an value that brings a real improvement. Besides that backup duration is higher on a lower concurrent task limit.

Thanks!

mkretzer
Expert
Posts: 566
Liked: 127 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by mkretzer » Sep 05, 2018 10:31 am

l0stb@ackup wrote: This system has been benchmarked to write 500-600MBps sustainable to the repository's ReFS partition with 1.1GBps respectable peaks. However backup job write speeds are an avg. of 50-100MBps and Fast Clone/Synthetic Full times can range between 10-50 hrs. Half of the job runs show source as bottleneck (~90%), but other half shows target (~70%). What in your opinion could be causing the slow performance we're getting on target?
We have a similar configuration with an external FTS storage with 24 disks and write speed is also between 60 - 140 MB/s which is quite slow... I know the focus is on REFS synthetic performance but still active/incr performance is bad as well...

ASG
Enthusiast
Posts: 41
Liked: never
Joined: Aug 08, 2018 10:19 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ASG » Sep 05, 2018 1:26 pm

I've had 4 lockups today - 2 using the old driver (where the clock got stuck) and 2 using the new driver (where the clock keeps on running)

I'm evacuating 3 ReFS FC-Storages to an internal ReFS Storage and this is sort of an nightmare... The FC-Storages aren't Blockcopy since they are the evacuated internal RAID (from yesterday) so there is no Blockcopy on these volumes...

2 x E5-2620v4
128GB Ram
3 x FC each ~12TB
1 x internal ~50TB

Gostev
SVP, Product Management
Posts: 24972
Liked: 3628 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Sep 05, 2018 1:50 pm

Try reducing concurrent tasks on the repo. If the clock keeps on running, but the server is slow to respond, this is usually due to a heavy I/O or CPU load.

ASG
Enthusiast
Posts: 41
Liked: never
Joined: Aug 08, 2018 10:19 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ASG » Sep 05, 2018 2:49 pm

Even if I'm evacuating only 1 of the FCs onto the local repo (that's 4 Tasks) the drive becomes unresponding (showing no information about used/free in explorer and so on) and Backups targeting the repo (15 minutes SQL transaction log backups) are not running... Problem occured out of nowhere after evacuating the drive yesterday...

Edit:
And you can't browse the drive if it hangs - only way is to reboot (which sometimes fails and a reset is required)

ejenner
Expert
Posts: 425
Liked: 66 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ejenner » Sep 05, 2018 3:21 pm

What happens if you try a robocopy of a similar sort of data going through the same sort of route? Does the system still freeze up? i.e. eliminate Veeam from the equation to see whether or not it is directly connected with Veeam activities or is there something more fundamental wrong with your environment.

ASG
Enthusiast
Posts: 41
Liked: never
Joined: Aug 08, 2018 10:19 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ASG » Sep 06, 2018 6:11 am

Didn't happen before, the lockup/hang/slow happens everytime a evacuation of a .vbk (or a large .vib as it seems) reaches 100% - I think B&R then tries to verify - that read seems to be the Problem in my case. Since one of the backups is ~3TB that is one looooong lockup

When the verify is over (and the clock kept on running) the server is all good again. I'm talking about 1 task at a time at the repo (not like 3 x 4 Tasks (4 tasks for each evacuated FC storage) or something)

I've copied about ~4TB of data worth with the explorer which worked without a Problem. I didn't experience problems with normal B&R actions, only the evacuation gave me headaches

ejenner
Expert
Posts: 425
Liked: 66 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ejenner » Sep 06, 2018 8:26 am

Using explorer to copy files isn't going to put as much of a strain on things as it does not do multi-threading. Robocopy would.

When you look at the job actions in the Home / Jobs screen what does it show you for the Load reading and Primary bottleneck?

ASG
Enthusiast
Posts: 41
Liked: never
Joined: Aug 08, 2018 10:19 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ASG » Sep 06, 2018 9:10 am

As the evacuation is a system task it doesn't show this - it's not a normal job.

Agreed that explorer copy isn't multitask, should have clarified that there were about 8-9 simultaneous copies done (so this is sort of multithreaded) - robocopy was a lot more overhead for these copies so I sticked to normal explorer copies

Gostev
SVP, Product Management
Posts: 24972
Liked: 3628 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Sep 09, 2018 8:29 pm

Guys, it's not about multi-threading (although just for the record, Robocopy does have /MT switch that makes it multi-threaded). Neither Explorer nor Robocopy is a good tool for comparison for a different reason: unlike Veeam, they don't enable ReFS data integrity streams on the target file. And having that enabled completely changes ReFS I/O pattern.

daveuu
Novice
Posts: 3
Liked: 1 time
Joined: Oct 31, 2014 1:23 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by daveuu » Sep 10, 2018 2:51 am 1 person likes this post

Just a report from the field. Was suffering 10x performance degradation during backup file merge operations on ReFS after July updates. No issues before this, no Memory issues or server lockups. (24TB repo with 64GB memory)

Rolling the driver back resolved this for me temporarily, installing KB4343884 also seems to have worked as a permanent fix.

ASG
Enthusiast
Posts: 41
Liked: never
Joined: Aug 08, 2018 10:19 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ASG » Sep 10, 2018 4:37 am

Gostev wrote:Guys, it's not about multi-threading (although just for the record, Robocopy does have /MT switch that makes it multi-threaded). Neither Explorer nor Robocopy is a good tool for comparison for a different reason: unlike Veeam, they don't enable ReFS data integrity streams on the target file. And having that enabled completely changes ReFS I/O pattern.
Okay, but is there any explanation why the lockup for me is only happening if the verify of an evacuation is in progress? Or is this expected behaviour? I'm having no problem so far using the repos with normal synthetic full and so on.

Gostev
SVP, Product Management
Posts: 24972
Liked: 3628 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Sep 10, 2018 12:07 pm

I suppose this is when the deletion of the evacuated backup file occurs. ReFS has some known issues when deleting large files that use block cloning. These issues were supposedly fixed in the most recent update (KB4343884), we're testing it right now to confirm.

ASG
Enthusiast
Posts: 41
Liked: never
Joined: Aug 08, 2018 10:19 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ASG » Sep 10, 2018 12:33 pm

But not the volume with the source is locking up (no information about free/used showing up in explorer) - it's the target that's locking up. So I'm not quite sure the deletion is the problem? Maybe evacuation is doing a verify read or something?

kevin.boddy
Service Provider
Posts: 5
Liked: never
Joined: Jan 30, 2018 3:24 pm
Full Name: Kevin Boddy
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by kevin.boddy » Sep 10, 2018 1:12 pm

Hi,

We're deploying a new Windows 2016 backup repository. 2x 8c cpu, 256GB memory, 600TB raw storage, hardware raid. Should we look at ReFS again or stick to NTFS?

Thanks

Iain_Green
Service Provider
Posts: 148
Liked: 8 times
Joined: Dec 05, 2014 2:13 pm
Full Name: Iain Green
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Iain_Green » Sep 10, 2018 1:28 pm

I'd wait for the results from Gostevs testing, however, you appear to have sufficient memory for REFS.
Many thanks

Iain Green

ejenner
Expert
Posts: 425
Liked: 66 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ejenner » Sep 10, 2018 2:53 pm

Gostev wrote:Guys, it's not about multi-threading (although just for the record, Robocopy does have /MT switch that makes it multi-threaded). Neither Explorer nor Robocopy is a good tool for comparison for a different reason: unlike Veeam, they don't enable ReFS data integrity streams on the target file. And having that enabled completely changes ReFS I/O pattern.
It was a test on his setup without using Veeam. He said he was trying to use Explorer file copy to stress-test his environment, only saying a file copy using Explorer isn't going to max-out the server.

Mgamerz
Expert
Posts: 129
Liked: 21 times
Joined: Sep 29, 2017 8:07 pm
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Mgamerz » Sep 11, 2018 5:34 pm 1 person likes this post

I'm using the August update and I haven't had any issues since I installed it. 160GB ram / ~~55TB repo (25 used)

Iain_Green
Service Provider
Posts: 148
Liked: 8 times
Joined: Dec 05, 2014 2:13 pm
Full Name: Iain Green
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Iain_Green » Sep 12, 2018 5:47 am

Would be interested to hear from a user who is not following the 1gb /1tb rule and is using the new update?
Many thanks

Iain Green

ejenner
Expert
Posts: 425
Liked: 66 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ejenner » Sep 12, 2018 10:32 am

Just found out while focusing all my attention on the repository we're currently loading up that one of our other repositories (with only 6 small jobs on) was quietly having STOP errors. Two at the beginning of this month. 1st and 8th between 6 and 8pm.

That repository has the recommended amount of RAM and is only being used for 6 very small jobs, only 800gb of a 54tb volume.

The ReFS.sys version is 10.0.14393.2395 - which is different from the other repository. It seems to be later but I can't find any guide which explains which version was released when and in which update, I know they juggled them around a bit so the higher number does not necessarily mean it is the latest distribution.

I'm updating all our Veeam servers to the latest version in KB4343884.

Gostev
SVP, Product Management
Posts: 24972
Liked: 3628 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Sep 12, 2018 1:19 pm

KB4343884 has ReFS driver version 10.0.14393.2457

Nick-SAC
Enthusiast
Posts: 41
Liked: 6 times
Joined: Oct 27, 2017 5:42 pm
Full Name: Nick
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Nick-SAC » Sep 12, 2018 3:57 pm

We’re not experiencing the problems in this thread’s topic (we’re doing very small backups right now) but just as an observation; tracking (and getting) these Update KB Releases and ReFS Versions is a bit of a head scratcher...

2 Systems, Both:
Win Server 2016 Version 1607
Hyper-V Hosts / VB&R All-In-One Backup Servers
Getting updates directly from Windows Update (not WSUS controlled)

System-1
KB 4343887 (Installed Sept 1)
KB 4343884 (Not Installed and not showing as Available)
O/S Build 14393.2430
REFS.SYS v10.0.14393.2395

System-2
KB 4343887 (Installed Aug 25)
KB 4343884 (Installed Sept 10)
O/S Build 14393.2457
REFS.SYS v10.0.14393.2457

So apparently the numerically Lower KB installed a numerically Higher ReFS version (which happens to match the OS Build Suffix).

And the other system isn’t even being offered the KB with the Higher ReFS version?!

Nick

ASG
Enthusiast
Posts: 41
Liked: never
Joined: Aug 08, 2018 10:19 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ASG » Sep 13, 2018 4:42 am

@Gostev
Are the lockups we have encountered while evacuating part of the evaluations? Maybe you could try to do an evacuation from an REFS Volume to another REFS Volume inside the same SOBR and see if the lockups occurs as soon as any evacuation hits 100%?

ejenner
Expert
Posts: 425
Liked: 66 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ejenner » Sep 13, 2018 10:49 am

Nick-SAC wrote:We’re not experiencing the problems in this thread’s topic (we’re doing very small backups right now) but just as an observation; tracking (and getting) these Update KB Releases and ReFS Versions is a bit of a head scratcher...

2 Systems, Both:
Win Server 2016 Version 1607
Hyper-V Hosts / VB&R All-In-One Backup Servers
Getting updates directly from Windows Update (not WSUS controlled)

System-1
KB 4343887 (Installed Sept 1)
KB 4343884 (Not Installed and not showing as Available)
O/S Build 14393.2430
REFS.SYS v10.0.14393.2395

System-2
KB 4343887 (Installed Aug 25)
KB 4343884 (Installed Sept 10)
O/S Build 14393.2457
REFS.SYS v10.0.14393.2457

So apparently the numerically Lower KB installed a numerically Higher ReFS version (which happens to match the OS Build Suffix).

And the other system isn’t even being offered the KB with the Higher ReFS version?!

Nick

I'm afraid this is unlikely to help demystify the situation... I saw similar confusion when updating three of mine yesterday. I think the 'Date modified' refers to the time the patch was applied rather than the time Microsoft made changes to it, it took me a few moments to see that properly.

Image

ejenner
Expert
Posts: 425
Liked: 66 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ejenner » Sep 13, 2018 10:52 am

The version at the top-right of my screenshot was having STOP errors.

JimmyO
Enthusiast
Posts: 55
Liked: 9 times
Joined: Apr 27, 2014 8:19 pm
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by JimmyO » Sep 19, 2018 8:18 am

I´ve been postponing updates on my ReFS repos for quite some time now. I´ve been running KB4093119 (refs.sys 2097) for 5 very stable months, no issues what so ever.

Reading the latest in this tread I decided to go for the latest update (KB4457131, refs.sys 2457) and after 2 days I can definitely say that it´s slower. Difficult to say by how much but I estimate 50%. This is not good, but it may be good enough since I´v been struggling with ReFS for more than a year until 2097 was released whith thousands of percent or more performance degradation. Memory or CPU has never been an issue (24 cores, 384GB RAM).

Still - it makes me wonder - if you have a perfect working version (2097) what went wrong in future releases...

Locked

Who is online

Users browsing this forum: davidwatts71 and 43 guests