REFS issues (server lockups, high CPU, high RAM)

Raleigh · Post by **Raleigh** » Jul 13, 2018 10:29 pm this post

Thank you for the reply, Gostev.

Point taken regarding opening support cases with Microsoft regarding ReFS issues. If having many Veeam customers open cases with Microsoft will better motivate them to resolve the issues, then I was happy to participate with that. Microsoft Support has not yet admitted to me that my issue is the result of any known bug yet (they are *still* in the process of analyzing my memory.dmp file). So I may have to push at them on that.

You are also correct: I have no idea what facilities and resources Veeam engineers have engaged on this issue. The Veeam support technicians I worked with never mentioned that Veeam was working directly with MS to resolve the ReFS issues. Actually, the first Veeam Support tech that I worked with when I opened the ticket (this was back in early April) told me that it was her understanding that the ReFS issues were resolved by the February Windows Updates. So apparently, she was not aware of any ongoing initiative with Microsoft either, or at least didn’t feel it was relevant to my issue.

Yes, it’s true that I created my login to the Veeam Community Forum only several weeks ago, but I have been reading this topic thread since my problem began. The first Veeam Support tech told me about this forum topic. I did not need to create an account until I wanted to submit a post. I only wish I had done that much sooner. I will not make that mistake again, since this forum is where the solution to my repo server issue came from.

Finally, I want to be clear that I offered my suggestions for constructive purposes. I do not mean to come off like I’m simply “bagging” on Veeam. I would truly like to help make it better. FYI, prior to becoming a Veeam customer at the end of March, we (for many, many years) were a Symantec Backup Exec shop. I just got tired of that product. I felt like I was constantly babysitting the system, dealing with agent updates on servers, dealing with backups that failed for this reason or that reason, dealing with (IMHO) a very poor support, and simply having to work with a product that simply was not designed from the beginning to work with VMware VMs. So yes, I hope you can appreciate that I was a bit frustrated when I found myself babysitting my shiny new Veeam backup system only two weeks into using it, and I am sure that frustration came through in my post. But I intend to be a Veeam customer for the foreseeable future, so if I do comment, it is meant constructively. And do feel free to correct me when I’m wrong or misinformed. I can take it!

Thanks,
Raleigh

Post by **Gostev** » Jul 13, 2018 10:54 pm this post

Hi, Raleigh - no worries, I understand. And thank you for understanding!

JimmyO · Post by **JimmyO** » Jul 16, 2018 7:02 am this post

So - we have som confirmations that the latest refs.sys does the trick. Have we got some figures that indicates we´re back to same performance as before?

Post by **reaperhammer** » Jul 16, 2018 9:23 am this post

When will Veeam feature RAM requirements for Refs block clone on the the official system requirements page?

Humphro · Post by **Humphro** » Jul 16, 2018 10:41 am this post

I can confirm, for our environment, that applying KB4338814 the refs driver changed from 10.0.14393.2312 to 10.0.14393.2363. After this update was applied to both the source (Veeam server) and remote repository the time taken for full backup merge to complete dropped from over 60 hours, eventually, after a few iterations, down to less than 3 hours, which is near enough to what the job was taking before.

Post by **LBegnaud** » Jul 16, 2018 6:59 pm this post

Just throwing out our experience here. Probably not worth much without some additional info, but after fighting for 4 days I feel like sharing regardless.

We have an SoBR with 200TB+ of usable storage spread across 7 physical servers and 12 extents (we try not to have our ReFS volumes be larger than 20TB, because of issues in the past). Of these 7 servers, 6 have had their performance improved after the update. We updated because we were having issues with one of the server's performance. This server having issues is actually identical hardware-wise with one of the others in the SoBR, but it TANKED after the update. Would become unstable after ~2 hours of running small backups. Seems like a warning sign for these ReFS issues is an ever-growing value for "Modified" RAM (not sure if that was mentioned in this thread already).

Anyway, Modified RAM would go higher and higher, then RPC / WMI would start failing on the repo (same old story throughout this thread). You'll notice the graph actually started going down, this is because around 2am the majority of our jobs were outright failed and past the 3 retries, so operations were mostly stopped on the offender, rs-bkptar-1.

Just replaced the newest refs.sys with refs.sys version 2097 on that single server and the server is now rock solid. Catching up from the failed backups last night at record pace. Before, we were seeing disk response times measured in seconds, now everthing is sitting pretty at <50ms. Running more concurrent jobs than we were when it would slow to a crawl.

I don't quite understand how refs.sys can be interchangeable like this, but I really hope it doesn't cause some silent corruption that pops up 3 weeks from now...

Post by **Gostev** » Jul 16, 2018 9:59 pm this post

Did you check to see if this misbehaving server has some software installed that other servers don't? And my other guess, just by looking at server naming (likely your oldest ReFS repository), perhaps it still has some former ReFS tweaks left in the registry? I would try to reinstall Windows on that server first and foremost, as indeed something is very wrong looking at how healthy all other backup repositories are.

Mgamerz · Post by **Mgamerz** » Jul 17, 2018 3:33 pm this post

Newest server 2016 update (July 16th) now contains DHCP fix. Installing update now. I need to learn to not do this before support calls with companies working on my server though, it never works out for me doing this

Jul 19, 2018 12:09 am

Yea gostev all great suggestions. I actually brought up a VM on the host, did disk passthrough, renamed things so veeam saw the vm as the original hardware, and got similar behavior.

Ended up digging through performance metrics and saw a couple disks in the storage pool with very high max read and write latency...so we replaced those disks and the host seems to be performing like the others. Tonight will be the first night with the vm out of the picture, but testing was good. Looks like this is a common issue work Windows storage space: disks that are failing but not fully failed

Seems like all in all the new update does resolve refs issues once again.

ejenner · Post by **ejenner** » Jul 20, 2018 2:11 pm this post

Seem to have this problem here.

The DL380 G9 has STOP error three times, x2 in June and x1 last night.

We see 0x00000133 logged on iLo integrated management log

Analysis of the dump file shows ntoskrnl.exe sometimes but also refs.sys

Our refs.sys version is the 10.0.14393.2273 with the 28/04/18 date on it which has been mentioned as problematic.

We're going to try 10.0.14393.2363 and hopefully that'll fix it.

jonesg · Post by **jonesg** » Jul 31, 2018 10:52 am this post

We are still seeing issues with the .2363 driver with slow merges and repository servers becoming temporarily unresponsive on volumes with ReFS (SNMP disk checks fail due to timeout and we are unable to browse the volumes). Seems to be a larger problem when merging larger files. We see a much rate of failure when merging increments into fulls when the increment is about 250GB in size. This may just be because it runs for a longer period of time but definitely seems more prone to error when large merges are running.

We tried reverting back to the .2097 in the start of July to see if stability would return but unfortunately not. This prompted us to open a Veeam case (#03085972) and later a Microsoft case on the subject.

Microsoft returned with the answer that this was a known problem and to install the .2097 version of the driver - which was already on the host. After a short discussion it was decided to install KB4338814 and the .2363 driver. This did not change anything however.

Having returned from holiday and picking up this issue again I can see Microsoft have released a new KB (KB4338822) that, despite it not being mentioned in the release notes, contains an even newer ReFS driver DLL with version ending in .2395. Has anyone tried this newer update?

Post by **Gostev** » Jul 31, 2018 11:59 am this post

How much RAM do you have on your backup repository server, and what is the ReFS volume size?

jonesg · Post by **jonesg** » Jul 31, 2018 12:27 pm this post

It is a bit of mix across 6 physical boxes. Most have 128GB with an exception with a single one that has 256 and another that has 512.

5 boxes have 5x80TB ReFS volumes and the last has 2x90TB ReFS volumes and 1x90TB NTFS.

Only one box seems to be heavily hit by this and I am in the process of getting a service window for that particular box to add an additional 128GB RAM to see if that will lessen the problems.

Is there an official guideline out that states the 1GB memory per 1TB ReFS or is it still just an unofficial recommendation?

ejenner · Post by **ejenner** » Jul 31, 2018 12:56 pm this post

Since posting last week the repository server in question crashed again on Saturday evening. Glad to hear it isn't just our one doing it though... reassuring that we've not stuffed anything up and that it seems quite normal...

Edit: ours is 16GB with 55TB ReFS volume.

I'm noticing a common theme that it isn't always the same cause highlighted in the dump files. ccmexec.exe was in the latest dump file. So it's crashed on a different process more or less every time.

Jul 31, 2018 1:21 pm

jonesg wrote:Is there an official guideline out that states the 1GB memory per 1TB ReFS or is it still just an unofficial recommendation?

Consider it official from Veeam. I am still trying to get a word on this from Microsoft, just pinged ReFS PM again.

DaveWatkins · Jul 31, 2018 9:19 pm

We've got ~200TB running perfectly fine on 96GB of RAM just as a guideline. It crashed with only 32GB but it appears that 1GB per TB is a good indicator until you get to about 60TB and from then on it seems to drop off in required resources

Post by **zx81** » Aug 01, 2018 12:05 pm this post

Gostev wrote: Consider it official from Veeam. I am still trying to get a word on this from Microsoft, just pinged ReFS PM again.

Hi,

How is the 1GB/1TB calculated? I have 8 ReFS 10TB repositories on my backup repository VM, each of which is ~70% used. I have 16 backup copy jobs (~60 VMs) that write to these volumes. I'm currently limiting concurrent tasks to 3 per repository in an effort to reduce the number of hard hangs on the VM. What sort of RAM figure should I be allocating to the repository VM?

Aug 01, 2018 4:54 pm

I can speak as part of the Solutions Architect team, and this number has been tested on the field with the many many ReFS deployments we faced and we keep working on. This number proved over time to be a good solution to be on the safe side, and it was also a "savior" before Microsoft released all their 2018 patches for ReFS. As we deal with very large companies, we "stress tested" this value up to the point we are suggesting it around as a rule of thumb.
Yes, your mileage may vary, and in fact people have deployments with a smaller ratio, but as we are engaged with critical customers, we prefer to be safe, so we use this value since we know it will not fail on anyone.

In your case, you have 80TB, so you may need 80GB following our rule. It's not about how many jobs, but how many metadata are stored in the volume that need to be calculated each time blockcloning kicks in.

EzE · Post by **EzE** » Aug 01, 2018 5:02 pm this post

jonesg wrote:Having returned from holiday and picking up this issue again I can see Microsoft have released a new KB (KB4338822) that, despite it not being mentioned in the release notes, contains an even newer ReFS driver DLL with version ending in .2395. Has anyone tried this newer update?

OMG

I'm having slow merge issues again and found my ReFS version at .2395! only at 53% after 6 hours for 1.1TB Full backup file. I'm not 100% sure .2395 is the problem yet but beware! Please post your observed performance if you have applied KB4338822. I really would like this merge to finish so it will be a while until I can uninstall the update and test.

jonesg · Post by **jonesg** » Aug 02, 2018 7:03 am this post

From the Microsoft case I got confirmation yesterday that the .2395 version should contain performance updates but not validated enough to warrant it being written in the change logs. I am in the process of rolling the update onto our servers now and will return with results at a later time. Probably first sometime next week. We still haven't managed to get a memory upgrade for the servers to bump from 128GB for 400TB to 256GB (the memory modules I had seem to be incompatible for some reason...).

ejenner · Post by **ejenner** » Aug 06, 2018 3:58 pm this post

We upped our new repository from 16GB to 32GB by borrowing some RAM from our other new repository (which is off, not installed yet). Had another crash last night. Just reading the posts above it seems like we really ought to double what we have before trying to find any other possible causes!

Post by **Iain_Green** » Aug 10, 2018 10:10 am this post

Appears we have hit this issue again.

Running version 10.0.14393.2312, being told to downgrade by Veeam support to 10.0.14393.2097, don't want to uninstall updates.
Is there a way of getting hold of the driver and installing?

Post by **Gostev** » Aug 10, 2018 11:54 am this post

It's a strange advice from our support, I will investigate why they are telling people this. Version 2312 is from a couple of months ago, and as per this thread this driver version is known to be problematic due to containing the performance regression introduced back in May. You should simply install the latest Windows updates, or at least July Cumulative Update (KB4338814), which has the newer ReFS driver version that many people have confirmed to work well. Thanks!

Post by **Iain_Green** » Aug 10, 2018 1:36 pm this post

@gostev

thanks, case number is 03141694.

He even quoted your old post from this thread

I will arrange the install of this update.

MERBAG · Post by **MERBAG** » Aug 13, 2018 7:50 am this post

Hi all,

We also had issues using ReFS and specially with the time of the fast clone process, which took up to 12 hours, we did work through this forum and did all steps provided like, update the refs.sys driver version etc.

At the end of a long case, the support asked us the change the advanced settings for each job to storage optimization "Local target (16TB+ backup files)" even our backup files are not larger then 16TB - now the time for the fast clone process reduced from 8-12 hours to 10mins.

Hope this is helpful for someone.

Post by **Gostev** » Aug 13, 2018 10:03 am this post

This setting makes Veeam operate with 8MB blocks (as opposed to 1MB blocks by default), which basically reduces the amount of cloned blocks 8x. I supposed it can be an alternative to increasing RAM size on the backup repository, although this will also increase incremental backup file sizes quite significantly.

l0stb@ackup · Post by **l0stb@ackup** » Aug 14, 2018 12:15 am this post

Can anyone share more feedback on the newer ReFS driver versions and CU updates? Have looked at KB4338814 (ReFS driver 2363) but this CU contains some nasty bugs I don't want my customer exposed to. There have been 3 newer CUs since. I cannot believe how difficult Microsoft is making this - including an improved driver in a bad update, sheesh....

jim3cantos · Aug 14, 2018 9:12 am

KB4338822 (Last 2018-07 Cumulative Update) installed and checked version of ReFS.sys file at .2395. Will post again if problems are detected.

wingphil · Post by **wingphil** » Aug 14, 2018 9:36 am this post

I had a server lockup again last week and again this morning. Refs driver version 2363, and we have about 14TB of repository space and four vCPUs (running under VMware).

We had 20GB ram assigned. I've upped it to 24GB. Should I still be expecting to see problems? I know this is a lot less than the rest of you have allocated, but it's a lot more than the 1GB/1TB recommended.

Thanks,

Phil

Aug 14, 2018 7:28 pm

ReFS.sys 2395 (2018-07 CU) has been abysmal in our environments, we have severe server locking issues and downtime during fast cloning since that cumulative update.

Today fresh off the Microsoft presses, there is KB4343887 which does not change the ReFS.sys driver version.

Fun times.

R&D Forums

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Feedback on newer updates

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Re: REFS issues (server lockups, high CPU, high RAM)

Who is online