Availability for the Always-On Enterprise
Locked
Raleigh
Novice
Posts: 7
Liked: never
Joined: Jun 26, 2018 11:33 pm
Full Name: Raleigh
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Raleigh » Jul 13, 2018 10:29 pm

Thank you for the reply, Gostev.

Point taken regarding opening support cases with Microsoft regarding ReFS issues. If having many Veeam customers open cases with Microsoft will better motivate them to resolve the issues, then I was happy to participate with that. Microsoft Support has not yet admitted to me that my issue is the result of any known bug yet (they are *still* in the process of analyzing my memory.dmp file). So I may have to push at them on that.

You are also correct: I have no idea what facilities and resources Veeam engineers have engaged on this issue. The Veeam support technicians I worked with never mentioned that Veeam was working directly with MS to resolve the ReFS issues. Actually, the first Veeam Support tech that I worked with when I opened the ticket (this was back in early April) told me that it was her understanding that the ReFS issues were resolved by the February Windows Updates. So apparently, she was not aware of any ongoing initiative with Microsoft either, or at least didn’t feel it was relevant to my issue.

Yes, it’s true that I created my login to the Veeam Community Forum only several weeks ago, but I have been reading this topic thread since my problem began. The first Veeam Support tech told me about this forum topic. I did not need to create an account until I wanted to submit a post. I only wish I had done that much sooner. I will not make that mistake again, since this forum is where the solution to my repo server issue came from.

Finally, I want to be clear that I offered my suggestions for constructive purposes. I do not mean to come off like I’m simply “bagging” on Veeam. I would truly like to help make it better. FYI, prior to becoming a Veeam customer at the end of March, we (for many, many years) were a Symantec Backup Exec shop. I just got tired of that product. I felt like I was constantly babysitting the system, dealing with agent updates on servers, dealing with backups that failed for this reason or that reason, dealing with (IMHO) a very poor support, and simply having to work with a product that simply was not designed from the beginning to work with VMware VMs. So yes, I hope you can appreciate that I was a bit frustrated when I found myself babysitting my shiny new Veeam backup system only two weeks into using it, and I am sure that frustration came through in my post. But I intend to be a Veeam customer for the foreseeable future, so if I do comment, it is meant constructively. And do feel free to correct me when I’m wrong or misinformed. I can take it!

Thanks,
Raleigh

Gostev
Veeam Software
Posts: 23116
Liked: 2917 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Jul 13, 2018 10:54 pm

Hi, Raleigh - no worries, I understand. And thank you for understanding!

JimmyO
Enthusiast
Posts: 55
Liked: 9 times
Joined: Apr 27, 2014 8:19 pm
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by JimmyO » Jul 16, 2018 7:02 am

So - we have som confirmations that the latest refs.sys does the trick. Have we got some figures that indicates we´re back to same performance as before?

reaperhammer
Enthusiast
Posts: 26
Liked: 7 times
Joined: Aug 18, 2016 7:59 pm
Full Name: Will S
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by reaperhammer » Jul 16, 2018 9:23 am

When will Veeam feature RAM requirements for Refs block clone on the the official system requirements page?

Humphro
Novice
Posts: 4
Liked: 1 time
Joined: Mar 09, 2017 1:35 pm
Full Name: Matthew Humphreys
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Humphro » Jul 16, 2018 10:41 am

I can confirm, for our environment, that applying KB4338814 the refs driver changed from 10.0.14393.2312 to 10.0.14393.2363. After this update was applied to both the source (Veeam server) and remote repository the time taken for full backup merge to complete dropped from over 60 hours, eventually, after a few iterations, down to less than 3 hours, which is near enough to what the job was taking before.

LBegnaud
Service Provider
Posts: 17
Liked: 6 times
Joined: Jan 24, 2018 12:08 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by LBegnaud » Jul 16, 2018 6:59 pm

Just throwing out our experience here. Probably not worth much without some additional info, but after fighting for 4 days I feel like sharing regardless.

We have an SoBR with 200TB+ of usable storage spread across 7 physical servers and 12 extents (we try not to have our ReFS volumes be larger than 20TB, because of issues in the past). Of these 7 servers, 6 have had their performance improved after the update. We updated because we were having issues with one of the server's performance. This server having issues is actually identical hardware-wise with one of the others in the SoBR, but it TANKED after the update. Would become unstable after ~2 hours of running small backups. Seems like a warning sign for these ReFS issues is an ever-growing value for "Modified" RAM (not sure if that was mentioned in this thread already).

Image

Anyway, Modified RAM would go higher and higher, then RPC / WMI would start failing on the repo (same old story throughout this thread). You'll notice the graph actually started going down, this is because around 2am the majority of our jobs were outright failed and past the 3 retries, so operations were mostly stopped on the offender, rs-bkptar-1.

Just replaced the newest refs.sys with refs.sys version 2097 on that single server and the server is now rock solid. Catching up from the failed backups last night at record pace. Before, we were seeing disk response times measured in seconds, now everthing is sitting pretty at <50ms. Running more concurrent jobs than we were when it would slow to a crawl.

I don't quite understand how refs.sys can be interchangeable like this, but I really hope it doesn't cause some silent corruption that pops up 3 weeks from now...

Gostev
Veeam Software
Posts: 23116
Liked: 2917 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Jul 16, 2018 9:59 pm

Did you check to see if this misbehaving server has some software installed that other servers don't? And my other guess, just by looking at server naming (likely your oldest ReFS repository), perhaps it still has some former ReFS tweaks left in the registry? I would try to reinstall Windows on that server first and foremost, as indeed something is very wrong looking at how healthy all other backup repositories are.

Mgamerz
Enthusiast
Posts: 62
Liked: 8 times
Joined: Sep 29, 2017 8:07 pm
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Mgamerz » Jul 17, 2018 3:33 pm

Newest server 2016 update (July 16th) now contains DHCP fix. Installing update now. I need to learn to not do this before support calls with companies working on my server though, it never works out for me doing this :)

LBegnaud
Service Provider
Posts: 17
Liked: 6 times
Joined: Jan 24, 2018 12:08 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by LBegnaud » Jul 19, 2018 12:09 am 1 person likes this post

Yea gostev all great suggestions. I actually brought up a VM on the host, did disk passthrough, renamed things so veeam saw the vm as the original hardware, and got similar behavior.

Ended up digging through performance metrics and saw a couple disks in the storage pool with very high max read and write latency...so we replaced those disks and the host seems to be performing like the others. Tonight will be the first night with the vm out of the picture, but testing was good. Looks like this is a common issue work Windows storage space: disks that are failing but not fully failed

Seems like all in all the new update does resolve refs issues once again.

ejenner
Expert
Posts: 135
Liked: 15 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ejenner » Jul 20, 2018 2:11 pm

Seem to have this problem here.

The DL380 G9 has STOP error three times, x2 in June and x1 last night.

We see 0x00000133 logged on iLo integrated management log

Analysis of the dump file shows ntoskrnl.exe sometimes but also refs.sys

Our refs.sys version is the 10.0.14393.2273 with the 28/04/18 date on it which has been mentioned as problematic.

We're going to try 10.0.14393.2363 and hopefully that'll fix it.

jonesg
Novice
Posts: 3
Liked: never
Joined: Jul 31, 2018 10:44 am
Full Name: Jonas Groth
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by jonesg » Jul 31, 2018 10:52 am

We are still seeing issues with the .2363 driver with slow merges and repository servers becoming temporarily unresponsive on volumes with ReFS (SNMP disk checks fail due to timeout and we are unable to browse the volumes). Seems to be a larger problem when merging larger files. We see a much rate of failure when merging increments into fulls when the increment is about 250GB in size. This may just be because it runs for a longer period of time but definitely seems more prone to error when large merges are running.

We tried reverting back to the .2097 in the start of July to see if stability would return but unfortunately not. This prompted us to open a Veeam case (#03085972) and later a Microsoft case on the subject.

Microsoft returned with the answer that this was a known problem and to install the .2097 version of the driver - which was already on the host. After a short discussion it was decided to install KB4338814 and the .2363 driver. This did not change anything however.

Having returned from holiday and picking up this issue again I can see Microsoft have released a new KB (KB4338822) that, despite it not being mentioned in the release notes, contains an even newer ReFS driver DLL with version ending in .2395. Has anyone tried this newer update?

Gostev
Veeam Software
Posts: 23116
Liked: 2917 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Jul 31, 2018 11:59 am

How much RAM do you have on your backup repository server, and what is the ReFS volume size?

jonesg
Novice
Posts: 3
Liked: never
Joined: Jul 31, 2018 10:44 am
Full Name: Jonas Groth
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by jonesg » Jul 31, 2018 12:27 pm

It is a bit of mix across 6 physical boxes. Most have 128GB with an exception with a single one that has 256 and another that has 512.

5 boxes have 5x80TB ReFS volumes and the last has 2x90TB ReFS volumes and 1x90TB NTFS.

Only one box seems to be heavily hit by this and I am in the process of getting a service window for that particular box to add an additional 128GB RAM to see if that will lessen the problems.

Is there an official guideline out that states the 1GB memory per 1TB ReFS or is it still just an unofficial recommendation?

ejenner
Expert
Posts: 135
Liked: 15 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by ejenner » Jul 31, 2018 12:56 pm

Since posting last week the repository server in question crashed again on Saturday evening. Glad to hear it isn't just our one doing it though... reassuring that we've not stuffed anything up and that it seems quite normal... :lol:

Edit: ours is 16GB with 55TB ReFS volume.

I'm noticing a common theme that it isn't always the same cause highlighted in the dump files. ccmexec.exe was in the latest dump file. So it's crashed on a different process more or less every time.

Gostev
Veeam Software
Posts: 23116
Liked: 2917 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Jul 31, 2018 1:21 pm 1 person likes this post

jonesg wrote:Is there an official guideline out that states the 1GB memory per 1TB ReFS or is it still just an unofficial recommendation?
Consider it official from Veeam. I am still trying to get a word on this from Microsoft, just pinged ReFS PM again.

Locked

Who is online

Users browsing this forum: Google [Bot], Majestic-12 [Bot] and 38 guests