Comprehensive data protection for all workloads
Locked
Raleigh
Novice
Posts: 7
Liked: never
Joined: Jun 26, 2018 11:33 pm
Full Name: Raleigh
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Raleigh » Jul 13, 2018 10:29 pm

Thank you for the reply, Gostev.

Point taken regarding opening support cases with Microsoft regarding ReFS issues. If having many Veeam customers open cases with Microsoft will better motivate them to resolve the issues, then I was happy to participate with that. Microsoft Support has not yet admitted to me that my issue is the result of any known bug yet (they are *still* in the process of analyzing my memory.dmp file). So I may have to push at them on that.

You are also correct: I have no idea what facilities and resources Veeam engineers have engaged on this issue. The Veeam support technicians I worked with never mentioned that Veeam was working directly with MS to resolve the ReFS issues. Actually, the first Veeam Support tech that I worked with when I opened the ticket (this was back in early April) told me that it was her understanding that the ReFS issues were resolved by the February Windows Updates. So apparently, she was not aware of any ongoing initiative with Microsoft either, or at least didn’t feel it was relevant to my issue.

Yes, it’s true that I created my login to the Veeam Community Forum only several weeks ago, but I have been reading this topic thread since my problem began. The first Veeam Support tech told me about this forum topic. I did not need to create an account until I wanted to submit a post. I only wish I had done that much sooner. I will not make that mistake again, since this forum is where the solution to my repo server issue came from.

Finally, I want to be clear that I offered my suggestions for constructive purposes. I do not mean to come off like I’m simply “bagging” on Veeam. I would truly like to help make it better. FYI, prior to becoming a Veeam customer at the end of March, we (for many, many years) were a Symantec Backup Exec shop. I just got tired of that product. I felt like I was constantly babysitting the system, dealing with agent updates on servers, dealing with backups that failed for this reason or that reason, dealing with (IMHO) a very poor support, and simply having to work with a product that simply was not designed from the beginning to work with VMware VMs. So yes, I hope you can appreciate that I was a bit frustrated when I found myself babysitting my shiny new Veeam backup system only two weeks into using it, and I am sure that frustration came through in my post. But I intend to be a Veeam customer for the foreseeable future, so if I do comment, it is meant constructively. And do feel free to correct me when I’m wrong or misinformed. I can take it!

Thanks,
Raleigh

Gostev
SVP, Product Management
Posts: 24638
Liked: 3467 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Jul 13, 2018 10:54 pm

Hi, Raleigh - no worries, I understand. And thank you for understanding!

JimmyO
Enthusiast
Posts: 55
Liked: 9 times
Joined: Apr 27, 2014 8:19 pm
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by JimmyO » Jul 16, 2018 7:02 am

So - we have som confirmations that the latest refs.sys does the trick. Have we got some figures that indicates we´re back to same performance as before?

reaperhammer
Service Provider
Posts: 27
Liked: 7 times
Joined: Aug 18, 2016 7:59 pm
Full Name: Will S
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by reaperhammer » Jul 16, 2018 9:23 am

When will Veeam feature RAM requirements for Refs block clone on the the official system requirements page?

Humphro
Novice
Posts: 4
Liked: 1 time
Joined: Mar 09, 2017 1:35 pm
Full Name: Matthew Humphreys
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Humphro » Jul 16, 2018 10:41 am

I can confirm, for our environment, that applying KB4338814 the refs driver changed from 10.0.14393.2312 to 10.0.14393.2363. After this update was applied to both the source (Veeam server) and remote repository the time taken for full backup merge to complete dropped from over 60 hours, eventually, after a few iterations, down to less than 3 hours, which is near enough to what the job was taking before.

LBegnaud
Service Provider
Posts: 19
Liked: 7 times
Joined: Jan 24, 2018 12:08 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by LBegnaud » Jul 16, 2018 6:59 pm

Just throwing out our experience here. Probably not worth much without some additional info, but after fighting for 4 days I feel like sharing regardless.

We have an SoBR with 200TB+ of usable storage spread across 7 physical servers and 12 extents (we try not to have our ReFS volumes be larger than 20TB, because of issues in the past). Of these 7 servers, 6 have had their performance improved after the update. We updated because we were having issues with one of the server's performance. This server having issues is actually identical hardware-wise with one of the others in the SoBR, but it TANKED after the update. Would become unstable after ~2 hours of running small backups. Seems like a warning sign for these ReFS issues is an ever-growing value for "Modified" RAM (not sure if that was mentioned in this thread already).

Image

Anyway, Modified RAM would go higher and higher, then RPC / WMI would start failing on the repo (same old story throughout this thread). You'll notice the graph actually started going down, this is because around 2am the majority of our jobs were outright failed and past the 3 retries, so operations were mostly stopped on the offender, rs-bkptar-1.

Just replaced the newest refs.sys with refs.sys version 2097 on that single server and the server is now rock solid. Catching up from the failed backups last night at record pace. Before, we were seeing disk response times measured in seconds, now everthing is sitting pretty at <50ms. Running more concurrent jobs than we were when it would slow to a crawl.

I don't quite understand how refs.sys can be interchangeable like this, but I really hope it doesn't cause some silent corruption that pops up 3 weeks from now...

Gostev
SVP, Product Management
Posts: 24638
Liked: 3467 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Jul 16, 2018 9:59 pm

Did you check to see if this misbehaving server has some software installed that other servers don't? And my other guess, just by looking at server naming (likely your oldest ReFS repository), perhaps it still has some former ReFS tweaks left in the registry? I would try to reinstall Windows on that server first and foremost, as indeed something is very wrong looking at how healthy all other backup repositories are.

Mgamerz
Expert
Posts: 125
Liked: 21 times
Joined: Sep 29, 2017 8:07 pm
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Mgamerz » Jul 17, 2018 3:33 pm

Newest server 2016 update (July 16th) now contains DHCP fix. Installing update now. I need to learn to not do this before support calls with companies working on my server though, it never works out for me doing this :)

LBegnaud
Service Provider
Posts: 19
Liked: 7 times
Joined: Jan 24, 2018 12:08 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by LBegnaud » Jul 19, 2018 12:09 am 1 person likes this post

Yea gostev all great suggestions. I actually brought up a VM on the host, did disk passthrough, renamed things so veeam saw the vm as the original hardware, and got similar behavior.

Ended up digging through performance metrics and saw a couple disks in the storage pool with very high max read and write latency...so we replaced those disks and the host seems to be performing like the others. Tonight will be the first night with the vm out of the picture, but testing was good. Looks like this is a common issue work Windows storage space: disks that are failing but not fully failed

Seems like all in all the new update does resolve refs issues once again.

ejenner
Expert
Posts: 376
Liked: 55 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ejenner » Jul 20, 2018 2:11 pm

Seem to have this problem here.

The DL380 G9 has STOP error three times, x2 in June and x1 last night.

We see 0x00000133 logged on iLo integrated management log

Analysis of the dump file shows ntoskrnl.exe sometimes but also refs.sys

Our refs.sys version is the 10.0.14393.2273 with the 28/04/18 date on it which has been mentioned as problematic.

We're going to try 10.0.14393.2363 and hopefully that'll fix it.

jonesg
Novice
Posts: 3
Liked: never
Joined: Jul 31, 2018 10:44 am
Full Name: Jonas Groth
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by jonesg » Jul 31, 2018 10:52 am

We are still seeing issues with the .2363 driver with slow merges and repository servers becoming temporarily unresponsive on volumes with ReFS (SNMP disk checks fail due to timeout and we are unable to browse the volumes). Seems to be a larger problem when merging larger files. We see a much rate of failure when merging increments into fulls when the increment is about 250GB in size. This may just be because it runs for a longer period of time but definitely seems more prone to error when large merges are running.

We tried reverting back to the .2097 in the start of July to see if stability would return but unfortunately not. This prompted us to open a Veeam case (#03085972) and later a Microsoft case on the subject.

Microsoft returned with the answer that this was a known problem and to install the .2097 version of the driver - which was already on the host. After a short discussion it was decided to install KB4338814 and the .2363 driver. This did not change anything however.

Having returned from holiday and picking up this issue again I can see Microsoft have released a new KB (KB4338822) that, despite it not being mentioned in the release notes, contains an even newer ReFS driver DLL with version ending in .2395. Has anyone tried this newer update?

Gostev
SVP, Product Management
Posts: 24638
Liked: 3467 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Jul 31, 2018 11:59 am

How much RAM do you have on your backup repository server, and what is the ReFS volume size?

jonesg
Novice
Posts: 3
Liked: never
Joined: Jul 31, 2018 10:44 am
Full Name: Jonas Groth
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by jonesg » Jul 31, 2018 12:27 pm

It is a bit of mix across 6 physical boxes. Most have 128GB with an exception with a single one that has 256 and another that has 512.

5 boxes have 5x80TB ReFS volumes and the last has 2x90TB ReFS volumes and 1x90TB NTFS.

Only one box seems to be heavily hit by this and I am in the process of getting a service window for that particular box to add an additional 128GB RAM to see if that will lessen the problems.

Is there an official guideline out that states the 1GB memory per 1TB ReFS or is it still just an unofficial recommendation?

ejenner
Expert
Posts: 376
Liked: 55 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ejenner » Jul 31, 2018 12:56 pm

Since posting last week the repository server in question crashed again on Saturday evening. Glad to hear it isn't just our one doing it though... reassuring that we've not stuffed anything up and that it seems quite normal... :lol:

Edit: ours is 16GB with 55TB ReFS volume.

I'm noticing a common theme that it isn't always the same cause highlighted in the dump files. ccmexec.exe was in the latest dump file. So it's crashed on a different process more or less every time.

Gostev
SVP, Product Management
Posts: 24638
Liked: 3467 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Jul 31, 2018 1:21 pm 1 person likes this post

jonesg wrote:Is there an official guideline out that states the 1GB memory per 1TB ReFS or is it still just an unofficial recommendation?
Consider it official from Veeam. I am still trying to get a word on this from Microsoft, just pinged ReFS PM again.

DaveWatkins
Expert
Posts: 348
Liked: 92 times
Joined: Dec 13, 2015 11:33 pm
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by DaveWatkins » Jul 31, 2018 9:19 pm 3 people like this post

We've got ~200TB running perfectly fine on 96GB of RAM just as a guideline. It crashed with only 32GB but it appears that 1GB per TB is a good indicator until you get to about 60TB and from then on it seems to drop off in required resources

zx81
Service Provider
Posts: 11
Liked: 1 time
Joined: Nov 24, 2016 6:57 am
Location: Perth, Australia
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by zx81 » Aug 01, 2018 12:05 pm

Gostev wrote: Consider it official from Veeam. I am still trying to get a word on this from Microsoft, just pinged ReFS PM again.
Hi,

How is the 1GB/1TB calculated? I have 8 ReFS 10TB repositories on my backup repository VM, each of which is ~70% used. I have 16 backup copy jobs (~60 VMs) that write to these volumes. I'm currently limiting concurrent tasks to 3 per repository in an effort to reduce the number of hard hangs on the VM. What sort of RAM figure should I be allocating to the repository VM?

dellock6
Veeam Software
Posts: 5714
Liked: 1610 times
Joined: Jul 26, 2009 3:39 pm
Full Name: Luca Dell'Oca
Location: Varese, Italy
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by dellock6 » Aug 01, 2018 4:54 pm 1 person likes this post

I can speak as part of the Solutions Architect team, and this number has been tested on the field with the many many ReFS deployments we faced and we keep working on. This number proved over time to be a good solution to be on the safe side, and it was also a "savior" before Microsoft released all their 2018 patches for ReFS. As we deal with very large companies, we "stress tested" this value up to the point we are suggesting it around as a rule of thumb.
Yes, your mileage may vary, and in fact people have deployments with a smaller ratio, but as we are engaged with critical customers, we prefer to be safe, so we use this value since we know it will not fail on anyone.

In your case, you have 80TB, so you may need 80GB following our rule. It's not about how many jobs, but how many metadata are stored in the volume that need to be calculated each time blockcloning kicks in.
Luca Dell'Oca
Principal EMEA Cloud Architect @ Veeam Software

@dellock6
https://www.virtualtothecore.com/
vExpert 2011 -> 2019
Veeam VMCE #1

EzE
Influencer
Posts: 19
Liked: never
Joined: Feb 06, 2015 3:48 pm
Full Name: Eric H
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by EzE » Aug 01, 2018 5:02 pm

jonesg wrote:Having returned from holiday and picking up this issue again I can see Microsoft have released a new KB (KB4338822) that, despite it not being mentioned in the release notes, contains an even newer ReFS driver DLL with version ending in .2395. Has anyone tried this newer update?
OMG :x I'm having slow merge issues again and found my ReFS version at .2395! only at 53% after 6 hours for 1.1TB Full backup file. I'm not 100% sure .2395 is the problem yet but beware! Please post your observed performance if you have applied KB4338822. I really would like this merge to finish so it will be a while until I can uninstall the update and test.

jonesg
Novice
Posts: 3
Liked: never
Joined: Jul 31, 2018 10:44 am
Full Name: Jonas Groth
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by jonesg » Aug 02, 2018 7:03 am

From the Microsoft case I got confirmation yesterday that the .2395 version should contain performance updates but not validated enough to warrant it being written in the change logs. I am in the process of rolling the update onto our servers now and will return with results at a later time. Probably first sometime next week. We still haven't managed to get a memory upgrade for the servers to bump from 128GB for 400TB to 256GB (the memory modules I had seem to be incompatible for some reason...).

ejenner
Expert
Posts: 376
Liked: 55 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ejenner » Aug 06, 2018 3:58 pm

We upped our new repository from 16GB to 32GB by borrowing some RAM from our other new repository (which is off, not installed yet). Had another crash last night. Just reading the posts above it seems like we really ought to double what we have before trying to find any other possible causes!

Iain_Green
Service Provider
Posts: 148
Liked: 8 times
Joined: Dec 05, 2014 2:13 pm
Full Name: Iain Green
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Iain_Green » Aug 10, 2018 10:10 am

Appears we have hit this issue again.

Running version 10.0.14393.2312, being told to downgrade by Veeam support to 10.0.14393.2097, don't want to uninstall updates.
Is there a way of getting hold of the driver and installing?
Many thanks

Iain Green

Gostev
SVP, Product Management
Posts: 24638
Liked: 3467 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Aug 10, 2018 11:54 am

It's a strange advice from our support, I will investigate why they are telling people this. Version 2312 is from a couple of months ago, and as per this thread this driver version is known to be problematic due to containing the performance regression introduced back in May. You should simply install the latest Windows updates, or at least July Cumulative Update (KB4338814), which has the newer ReFS driver version that many people have confirmed to work well. Thanks!

Iain_Green
Service Provider
Posts: 148
Liked: 8 times
Joined: Dec 05, 2014 2:13 pm
Full Name: Iain Green
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Iain_Green » Aug 10, 2018 1:36 pm

@gostev

thanks, case number is 03141694.

He even quoted your old post from this thread :)

I will arrange the install of this update.
Many thanks

Iain Green

MERBAG
Novice
Posts: 4
Liked: never
Joined: Aug 13, 2018 7:12 am
Full Name: Rolf
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by MERBAG » Aug 13, 2018 7:50 am

Hi all,

We also had issues using ReFS and specially with the time of the fast clone process, which took up to 12 hours, we did work through this forum and did all steps provided like, update the refs.sys driver version etc.

At the end of a long case, the support asked us the change the advanced settings for each job to storage optimization "Local target (16TB+ backup files)" even our backup files are not larger then 16TB - now the time for the fast clone process reduced from 8-12 hours to 10mins.

Hope this is helpful for someone.

Gostev
SVP, Product Management
Posts: 24638
Liked: 3467 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Aug 13, 2018 10:03 am

This setting makes Veeam operate with 8MB blocks (as opposed to 1MB blocks by default), which basically reduces the amount of cloned blocks 8x. I supposed it can be an alternative to increasing RAM size on the backup repository, although this will also increase incremental backup file sizes quite significantly.

l0stb@ackup
Influencer
Posts: 14
Liked: 4 times
Joined: Jul 19, 2018 2:10 am
Contact:

Feedback on newer updates

Post by l0stb@ackup » Aug 14, 2018 12:15 am

Can anyone share more feedback on the newer ReFS driver versions and CU updates? Have looked at KB4338814 (ReFS driver 2363) but this CU contains some nasty bugs I don't want my customer exposed to. There have been 3 newer CUs since. I cannot believe how difficult Microsoft is making this - including an improved driver in a bad update, sheesh....

jim3cantos
Enthusiast
Posts: 51
Liked: 10 times
Joined: Jan 08, 2013 6:14 pm
Full Name: José Ignacio Martín Jiménez
Location: Madrid, Spain
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by jim3cantos » Aug 14, 2018 9:12 am 1 person likes this post

KB4338822 (Last 2018-07 Cumulative Update) installed and checked version of ReFS.sys file at .2395. Will post again if problems are detected.

wingphil
Novice
Posts: 7
Liked: never
Joined: Jun 11, 2018 8:51 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by wingphil » Aug 14, 2018 9:36 am

I had a server lockup again last week and again this morning. Refs driver version 2363, and we have about 14TB of repository space and four vCPUs (running under VMware).

We had 20GB ram assigned. I've upped it to 24GB. Should I still be expecting to see problems? I know this is a lot less than the rest of you have allocated, but it's a lot more than the 1GB/1TB recommended.

Thanks,

Phil

Ctek
Service Provider
Posts: 69
Liked: 9 times
Joined: Nov 11, 2015 3:50 pm
Location: Canada
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Ctek » Aug 14, 2018 7:28 pm 1 person likes this post

ReFS.sys 2395 (2018-07 CU) has been abysmal in our environments, we have severe server locking issues and downtime during fast cloning since that cumulative update.

Today fresh off the Microsoft presses, there is KB4343887 which does not change the ReFS.sys driver version.

Fun times.
VMCE 9 Certified - Systems Administrator

Locked

Who is online

Users browsing this forum: Google [Bot] and 54 guests