REFS issues (server lockups, high CPU, high RAM)

Post by **Gostev** » May 24, 2017 10:39 pm this post

mkretzer wrote:@gostev: Is the fix also for the "Filesystem getting slow over time" issue?

No, it's for the core issue discussed in this thread (server becoming totally irresponsive and eventually crashing).

Post by **thomas.raabo** » May 25, 2017 7:07 am this post

Hi All.

Dont know why Microsoft and Veeam do not fix it? Its so easy to replicate.

Issues have had from day one on veeam 9.5 and ReFS 64K

Server crashing
High memory usage
Disk going "virtual offline" and stalling all jobs.

etc....
I cant imagine Veeam did not get it in testing out of 10 veeam installations for our hosting setup ALL have this is.

Microsoft ticket = Yes!
Veeam ticket = Yes!

So here is a short term solution until veeam and microsoft gets their act together ... Disable blockclone

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SOFTWARE\Veeam\Veeam Backup and Replication]
"RefsVirtualSyntheticDisabled"=dword:00000001

rfn · Post by **rfn** » May 25, 2017 8:41 am this post

OK what a coincidence... the server got unresponsive this night! It has been running for much longer this time, but apparently the issue is not fixed. Maybe we're still being hit by ReFS?

I can see that the processor went 100% at 2.00 AM and at around 2.55-3.00 the server didn't respond to WMI anymore (we're using PRTG Network Monitor). Memory usage didn't seem to have changed during that time. Pagefile usage did increase, but from 0% to 0,3%, so that doesn't look very dramatic, but it was connected to this as the increase happened when the CPU usage climbed.

I will try to add some more sensors to monitor disk activity.

EDIT: Now PRTG can't get any WMI data from the server, even though it has been restarted... Weird!

kubimike · Post by **kubimike** » May 25, 2017 6:10 pm this post

@Gostev thanks for letting us know microsoft might have found the problem. Your post is a bit vague, could you fill us in on what they think the issue is ? I still have my microsoft ticket open still no answers.

kubimike · Post by **kubimike** » May 25, 2017 6:12 pm this post

@rfn, for all your jobs which has the largest retention period ? And what's it set to. I have a feeling this issue comes up from pruning old backups. Also have you tried disabling the disk integrity scans ? I'm not running the latest microsoft patches, I've disabled disk integrity scans and I've limited my retention periods. the problem has gone away.

Post by **Gostev** » May 25, 2017 7:56 pm this post

thomas.raabo wrote:I cant imagine Veeam did not get it in testing out of 10 veeam installations for our hosting setup ALL have this is.

Unfortunately that is true, but we're trying our best to reproduce the issue still. By now it is clear that the impacted deployments are somehow special, but we're yet to figure out what difference is causing the issue.

thomas.raabo wrote:Dont know why Microsoft and Veeam do not fix it?

Because it would appear no one impacted is actually interested in pursuing the resolution? Look, I've posted the invitation on the previous page, and I've got zero volunteers over the past 2 days...

kubimike wrote:Your post is a bit vague, could you fill us in on what they think the issue is ? I still have my microsoft ticket open still no answers.

Sorry, its vague "by design" as the detailed information was provided to me under NDA.

rfn wrote:I can see that the processor went 100% at 2.00 AM and at around 2.55-3.00 the server didn't respond to WMI anymore

Can you please check the job logs to see if some jobs have possibly been processing retention before that time (deleting a lot of files belonging to the oldest backup chains, especially with per-VM jobs)? This is one of my ideas.

kubimike · Post by **kubimike** » May 25, 2017 9:17 pm this post

I'd love to help out and test but that would require me to undo my bandaids and deal with a busted box again. There were so many people here with the issue. I guess everyone here went back to NTFS or another backup solution ?

rkovhaev · Post by **rkovhaev** » May 26, 2017 3:02 am this post

Just for visibility, I would like to share code 0x139 bugcheck pattern (user-space pattern) that I have observed in support case with my customer:

Our user-space process VeeamAgent.exe executes WriteFile() WinAPI and execution control goes to kernel-space. Windows Kernel then gets stuck on execution of that single(!) WriteFile() WinAPI and Windows kernel doesn't return the control back to our user-space process within 1 hour.
In our software (in our transport service) we have detection for ReadFile() WriteFile() WinAPI call hangs - because we have observed similar Windows Kernel behavior before (for example with Windows Dedup) - and after 1 hour our transport service terminates user-space process VeeamAgent.exe.
As soon as VeeamAgent.exe is terminated - shortly after Windows kernel bugchecks with code 0x139.

Apart from bugcheck 0x139, in my case, I also observe bugcheck 0x133 and issue with server being unresponsive.
Microsoft premier support case no: 117052015772881

example from logs:
Svc.VeeamTransport.3.log:[23.04.2017 01:09:54] < 4528> tpl| Agent '{d0696b4c-ed9e-47de-9422-1b2c1a4f9d20}' will be terminated due to reason: some I/O operation has hanged.
Svc.VeeamTransport.3.log:[21.04.2017 07:55:13] < 8404> tpl| Agent '{f45b5725-7a96-48bb-adf0-753d1433281b}' will be terminated due to reason: some I/O operation has hanged.
Svc.VeeamTransport.3.log:[19.04.2017 23:05:20] < 8136> tpl| Agent '{da5f340a-8cdc-4f01-87f9-f23dcfb57b96}' will be terminated due to reason: some I/O operation has hanged.
corresponding bugchecks:
Error,4/23/2017 1:14:31 AM,EventLog,6008,None,The previous system shutdown at 1:09:51 AM on ‎4/‎23/‎2017 was unexpected.
Error,4/21/2017 8:02:17 AM,EventLog,6008,None,The previous system shutdown at 7:55:30 AM on ‎4/‎21/‎2017 was unexpected.
Error,4/19/2017 11:13:25 PM,EventLog,6008,None,The previous system shutdown at 11:07:23 PM on ‎4/‎19/‎2017 was unexpected.

That 1 hour timeout in transport service could be extended, but in my case increasing that timeout to 6 hours did not help. Windows kernel was stuck on single WriteFile() API call going to REFS for 6 hours (!). And when VeeamAgent.exe process was terminated - Windows kernel bugchecked with 0x139 code. At the moment of 0x139 bugcheck there were no Veeam kernel modules loaded - so our software did not corrupt linked list structure in kernel-space (because again, there were no Veeam kernel modules loaded). However by terminating user-space process we catalyst 0x139 bugcheck.
So the bug that causes 0x139 bug-check lies purely in kernel-space, it is either Windows kernel bug or bug in one of the kernel modules (standard Windows drivers, or 3rd party drivers).

May 26, 2017 9:36 am

Hi Gustav.

I´m willing to give you guys full access to our backup setup.
We have disabled syn full.

From our point of view it looks like everytime you touch the meta data! But for sure doing blockclone and delete is where everything blows up.

https://snag.gy/Ge401z.jpg

Disk goes offline and veeam just slows down - Sometimes the only way to get the disk back is with reboot.
Performance counters stops working on all disks

So what are our setup.

Cisco C3260 running with either RAID controllor or HBA( with storage spaces )
256gb ram pr node
200 TB pr node

Large trim set enabled
Per VM backup enabled

Btw did a VM demo setup last week ( veeam running on all VM ) and it was 100% the same.

Post by **Gostev** » May 26, 2017 2:01 pm this post

@kb1ibt @thomas.raabo thanks for your help, I have forwarded your details to our support org, they will be reaching out to you soon with the private fix.

@All I just realized that one other potential suspect here could be the antivirus presence (this could be that dramatic difference between our QC labs and your environments that I've been looking for). Anti-viruses trigger volume flushes, which can add a lot of latency to ongoing ReFS IOs - and during times when the repository is already heavily loaded, it may be "the last straw" (and an elephant-sized one).

kubimike · Post by **kubimike** » May 26, 2017 2:25 pm this post

@Gostev, my system is without 3rd party antivirus engines, just microsofts malware engine/scanner . I wanted to get the system stable before introducing more variables.

kb1ibt · Post by **kb1ibt** » May 26, 2017 2:33 pm this post

@gostev, like @kubimike, I do not have 3rd party AV installed on either of the 2 repository servers impacted by this issue.

Post by **Gostev** » May 26, 2017 4:39 pm this post

OK thanks for confirming.

rfn · Post by **rfn** » May 26, 2017 11:01 pm this post

kubimike wrote:@rfn, for all your jobs which has the largest retention period ? And what's it set to. I have a feeling this issue comes up from pruning old backups. Also have you tried disabling the disk integrity scans ? I'm not running the latest microsoft patches, I've disabled disk integrity scans and I've limited my retention periods. the problem has gone away.

I have two jobs... one if 7 days retention and one with 30 days retention. For some reason I haven't figured out yet, then the two jobs actually have a lot more restore points.

The server died this night as well... It had been stable for two weeks and now it's just hanging all the time! So frustrating! Our backups start at 22.00 and at 23.22 our PRTG monitoring tool didn't get any WMI data from the server, and I confirmed that it was now responding to RDP. I reset the server and it came up as usual. Every time this happens then I have to rescan the pository to make the jobs work again. It then updates one repository and everything works again. Usually it just works for some time but this time I went to retry the failed job and after it completed the server began to be unresponsive. The RDP session is still connected, but the server is really slow to react and the ReFS volume is impossible to access.

I have the jobs set to run synthetic full on saturdays, but I guess that shouldn't do anything until the job runs on saturday night? I haven't disabled the disk integrity scans.

When I look at the data from PRTG. then i can see that the CPU usage climbed strait up just before the WMI data stopped and Disk IO climbed just as fast, just before WMI data stopped. So something CPU and IO demanding started and then the server was unresponsive...

kubimike · Post by **kubimike** » May 27, 2017 2:29 pm this post

@rfn go into task scheduler turn off the jobs for data integrity checks. There is two. Let me know how it goes. Job purges only happen after a successful backup. If it's constantly crashing it won't prune

rfn · Post by **rfn** » May 28, 2017 11:36 am this post

kubimike wrote:@rfn go into task scheduler turn off the jobs for data integrity checks. There is two. Let me know how it goes. Job purges only happen after a successful backup. If it's constantly crashing it won't prune

I disabled them but the server died yesterday anyway... It seems like it died before Satursdays backup, because the last run was from the night between Friday and Saturday and was a success, so it hadn't started the backup on saturday night.

We run synthetic fulls on Saturday.

JimmyO · Post by **JimmyO** » May 29, 2017 7:15 am this post

What if I disable blockclone ("RefsVirtualSyntheticDisabled"=dword:00000001) ? Will I be able to enable it again without having to create new fulls etc.?

Post by **Gostev** » May 29, 2017 11:35 am this post

In theory, you won't need to create new fulls.

Delo123 · May 29, 2017 12:06 pm

Hmm, that's interesting. We do not use retention (we keep all backups until the JBODs are full and then replace it completely), maybe that is one of the reasons we are seeing no issues at all (until now), just over 100TB's full (100TB used/190TB data) on the current JBOD with synthetic fulls.

Post by **Gostev** » May 29, 2017 3:53 pm this post

@JimmyO and @Nilsn i've got your case IDs.

kubimike · Post by **kubimike** » May 29, 2017 3:57 pm this post

@rfn what type of server / storage ?

rfn · Post by **rfn** » May 30, 2017 8:06 am this post

kubimike wrote:@rfn what type of server / storage ?

It's a HPE DL380 Gen9 LFF with a SmartArray P840ar. It has 12 6 TB NL-SAS drives setup as one RAID 6 drive, which ends up as 55 TB in Windows, which is obviously Windows Server 2016 (Standard).

The server has a Xeon E5-2620v4 CPU and 192 GB RAM. The RAM has been temporarily increased to 192 GB RAM but hasn't really helped.

kubimike · Post by **kubimike** » May 30, 2017 10:23 am this post

@rfn OK this might sound like a silly question but whats the strip size set to on your array ? Also have you patched with HPs latest April update PSP ? We are running the same gear and I feel like we've both now faced these issues. I finally got mine stable.

rfn · Post by **rfn** » May 30, 2017 11:40 am this post

There's no such thing as a silly question

. If we can get this problem solved it would be fantastic...

The stripe size is 256 KB, and I have patched the server with the latest April PSP.

kubimike · Post by **kubimike** » May 30, 2017 1:26 pm this post

@RFN . Ok So you're running the firmware 5.05 for sure ? I had to lose my array for HP to fix the bug I discovered with RAID 6. Also per their recommendation I was set to stripe size 512k. This was causing a buffer over run because the memory was always full. If you navigate to your iLO and take a look at the IML, do you see any alerts at all from the P840 ? Was 256KB the default when you were setting up striping ? I believe mine defaulted to 128k

Post by **suprnova** » May 30, 2017 1:33 pm this post

I was hoping to avoid the issue by not using synthetic fulls, but this issue is also happening for incremental merges with block cloning. My CPU and RAM are fine, but during the merge I am unable to browse the Veeam repo drive in Windows.

I am fully patched and I have RefsEnableLargeWorkingSetTrim set to 1.

rfn · Post by **rfn** » May 30, 2017 1:40 pm this post

The firmware is 5.04... I don't see a download for 5.05 anywhere?

The 256KB was the default because I didn't set anything manually.

There's no alerts regarding the P840 in the IML. The hardware seems to be working fine. BUT... I do have one recurring error in the Windows Application log that is annoying me A LOT. Every few minutes I get an error from "PerfNet" with the message "Unable to read Server Queue performance data from the Server service. The first four bytes (DWORD) of the Data section contains the status code, the second four bytes contains the IOSB.Status and the next four bytes contains the IOSB.Information."

I have googled for an answer but with no luck. Nothing that seems to relate to Windows Server 2016 or this hardware. I can't help but think that something is wrong somewhere, but I just don't know where. I didn't install Windows on the server, as the consultant from the supplier had already done that. The rest of the configuration has been done by me.

kubimike · Post by **kubimike** » May 30, 2017 4:14 pm this post

@rfn are you using Win 2016 NIC teaming ? I didn't have any luck with that stuff. Caused my server to bomb. Got a 10GB HP PCI instead. 5.04 is the latest I just had a typeo sorry

As for the error you must have some 32 bit app installed causing that error ?? https://support.microsoft.com/en-us/hel ... processors

rfn · Post by **rfn** » May 30, 2017 4:50 pm this post

Yes, I'm using NIC teaming... I have a HP 10G NIC where I have teamed the two connectors and connected them to two HPE 5900 series switches that are stacked for redundancy.

I literally only have Windows Server, Veeam Backup & Replication and the HPE drivers and tools on there server. Nothing else... I also got the search result that you're linking to.

EricJ · Post by **EricJ** » May 30, 2017 6:18 pm this post

rfn -

Sorry if you mentioned this already, but are you using any of the registry keys from the Microsoft KB?

We still had frequent lockups after applying RefsEnableLargeWorkingSetTrim. I bumped the RAM from 16GB to 20GB, and also set the key RefsNumberOfChunksToTrim to "32" (decimal). Since just those two changes, we have been stable for over two months now.

Here is an animation of RAMMap after the RAM bump and second registry key during two big synthetic full fast clones. You can see the Metafile active usage levels off after 5GB. It would be interesting to see what RamMap shows for you during a fast clone operation that causes a lockup.

R&D Forums

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Who is online