Windows 2016 REFS failure on high load during active full.

Post by **ashleyw** » Dec 14, 2016 3:53 am this post

Hi,

We have a problem which we are battling with since we have changed to 2016 server (all parts of our backup infrastructure)

We have a VMware host with storage being consumed from an iscsi SAN. We present the block storage to a windows 2016 VM and create a REFS volume of 60TB on this.

What we are seeing at random times (especially during the active full backups), is that the REFS layer is flat lining on CPU (at 100% across 4 cores). Before this we are seeing nothing at the SAN layer or Veeam layer to indicate anything wrong/unusual. Before it randomly flat lines, we are seeing consistent speeds at around 150-250MB/s with nothing at any layer to indicate any issues.

When the box goes to 100% CPU its always the SYSTEM task consuming all the CPU.
We initially thought the issue was around windows defender but see the issue with windows defender enabled (with Veeam exceptions in) and disabled.

Has anyone seen this behavior or any strategies on how we can debug this as its impacting our backup cycles?

Post by **Mike Resseler** » Dec 14, 2016 7:03 am this post

Ashley,

Based on this information it is difficult to tell you where the problem is I'm afraid. I do have a question about that setup. You are using ReFS 3.1 I assume (2016). Why aren't you using synthetic full to get the advantages of blockcloning? That should make a difference at that point in time.

Also, it might be interesting to log a support call (and post the ID and resolution afterwards here)

Thanks

Mike

Post by **tsightler** » Dec 14, 2016 7:27 pm this post

Any chance you have formatted this ReFS volume with 4K clusters? I don't know if your issue is the same, but I actually have seen a similar issue in a lab test with ReFS on a 120TB volume formatted with 4K clusters, it happened multiple nights in a row, but it was during synthetic full operations, not active full. In my case the system remained unresponsive (you could no longer login remotely) although it was still somewhat "there". I was unable to reproduce the results after reformatting ReFS with 64K cluster. I plan to reformat back to 4K clusters and see if I can reproduce again.

Post by **ashleyw** » Dec 15, 2016 3:49 am this post

hi, Originally we did have the LUN formatted as 4k. We created a new LUN and formatted that with 64K clusters and then we started an active full on the 64K cluster LUN. (and we have now decommissioned the 4K lun).

Since then we did see the scenario once on 64K lun, but then we shifted resources around and increased the RAM on our ZFS iscsi target to 32GB ram (was 16GB previously) and this seems to have dramatically improved the reliability and the processing rate has increased to consistently over 300MB/s with ample headroom - we may push it through to 48GB ram when the job chain finishes.

We are also switching primary SANs over the next week moving form an old IBM DS SAN to a new Nimble AF3000 (All flash), but our backup target will remain the same, so we'll monitor and see how things go and update via this forum post. I will attempt to log a ticket when there are fewer changes going on, otherwise it will be very hard for Veeam to identify the issue I guess.

The reason we are running an active full is because we need to to make sure the files are aligned on the 64K clusters properly.

Post by **Mike Resseler** » Dec 15, 2016 6:33 am this post

Ashley,

Yes, the first one indeed needs to be an active full, which you can do manually, so that everything is aligned. After that, all incrementals will be written correctly and instead of doing a weekly active full again (or something like that) you can use a synthetic method to get the advantages of fast clone. It is described here: https://helpcenter.veeam.com/docs/backu ... tml?ver=95

PS: Good to read that you have found the issue. Please let us know what the results are when you increase the RAM again. I'm interested

Thanks
Mike

Post by **ashleyw** » Dec 15, 2016 7:02 am this post

thanks Tom/Mike,

Well our full backup crashed in a different way now.

Hi Mike/Tom,

This time the REFS layer stayed up without flat lining the CPU - the first job in the job chain got through to the end. As soon as the second job in the chain started the 2nd job eventually fails with

ERROR: The RPC server is unavailable RPC function all failed. Functon name: [GetSvcVersion]. Target Machine: [backuphost]
At the stage, the disks have been hot added to the proxies.i
The job errors out still leaving the disks mounted on the proxies.
When I try to manually remove the disks from the proxies I get
"Hot removal is already in progress.
Failed to remove scsi0:1.
"
subsequent jobs in the chain fail with the same message.

The only way we've found to get around this is to reboot the esxi host hosting the proxies and then we are able to remove the hot added disks.

When I look at the REFS layer - I see that this time the CPU isn't flat lining but the VM is giving a virtual memory usage warning.
(the REFS layer is configured with 4vCPU and 12GB ram).

What might be causing this?

cheers
Ashley

Post by **Mike Resseler** » Dec 15, 2016 7:04 am this post

Ashley,

I think it is time to create a support call

. We might come up with some other ideas but that would be guessing (at least from my side, Tom might have seen this already as he tested ReFS a lot if I am not mistaken)

But with a support call we probably get to the problem sooner.

Sorry I can't help you further

Mike

Post by **ashleyw** » Dec 15, 2016 7:25 am this post

ok, thanks, fully understand. I'll log a formal support call in about 12 hours time when i get into our office.
but there is definitely some sort of pattern here.... either we are early adopter fools (quite probably!) or there is a generic issue here waiting to burn people!

Post by **ashleyw** » Dec 15, 2016 8:51 am this post

Considering the layer running the REFS repository ran out of RAM, it sounds suspiciously like; veeam-backup-replication-f2/9-5-refs-se ... 39625.html

ours is only running 4 cores and we have the repository limit set to 4 but we were running 12GB ram, so if the recommended minimum is 4GB per core at the refs later, I've now upped the ram on that box to 16GB, so I'll try again.

(I do see now that there are yet another cumulative set of updates that have just appeared, on top of the cumulative update; KB3201845 (11th December). KB3206632 (2 days ago, but I think this is mainly to fix the dhcp bug), but nothing in here to indicate any memory related issues that I can see.

lando_uk · Post by **lando_uk** » Dec 15, 2016 11:00 am this post

ashleyw wrote:When I look at the REFS layer - I see that this time the CPU isn't flat lining but the VM is giving a virtual memory usage warning.
(the REFS layer is configured with 4vCPU and 12GB ram).

Just a couple of thoughts...

What is your disk queue length on the ReFS volume when you're experiencing the issues? Have a look at resource manager to see what the system is actually writing/reading to when its pushing 200MB/s.

To get rid of virtual memory warnings, make sure your pagefile matches or is greater than your system RAM. If you're like us, our VM template has a fixed pagefile that often gets ignored when the VM is scaled up.

Mark

Post by **tsightler** » Dec 15, 2016 2:52 pm this post

As I noted in the other thread, during testing of Windows 2016 I did see where it was more willing to use memory for cache than 2012R2. In my case I never saw it crash, but I saw an identical setup of 2016 perform far less than 2012R2 as it began to swap. Perhaps it would have crashed if I had continued. I tried the 2016 system with both ReFS and NTFS and the problem did not appear to be ReFS specific (it seemed slightly more pronounced there).

More memory seemed to effectively make the problem go away. The problem seemed significantly more pronounced when data was being received faster than in could be written to disk. If you watched memory in resource monitor, you could see the "Standby" memory slowly chewed. In my case it never went completely away, however, perhaps if I had pushed it more and run out of swap, it could have.

Dec 16, 2016 2:34 am

thanks guys,

all very good points.

touch wood - we've made a few adjustments and so far things are working ok, but they always do until they break

@lando_uk, yes you were spot on, we had deployed our REFS server from a template with the virtual memory settings set to be automatic. I've manually uped the REFS ram to be 16GB (the REFS server itself is virtual like everything else in our environment). I have manually defined the virtual ram to be have an initial size of 3GB and a max size of 48GB. I haven't seen the swap being used for any more than the initial allocation so far. I wonder if manually defining the size has any impact on the way memory is being used at the OS level. disk queues are typically in the 10-20 zone?

@tsighter, it would be interesting if you could run a test setup in a hyper-converged way so that you can adjust various ram allocations to the various layers and see how to control them to prevent failures.
in our situation, we have a Supermicro with 24 data spindles, we have the 24 spindles direct pass through via an OmniOS ZFS layer via iscsi to a windows 2016
currently ram utilisation (which seems to be stable – atleast until the next crash

MGMTDB: 2vCPU, 6GB ram: SQL server 2016 on top of Windows 2016 (DBs in SQL 2008 compat mode).
OmniOS: 4vCPU, 32GB ram: OmniOS ZFS iscsi SAN presenting 60TB block device.
Server-refs: 4vCPU, 16GB ram: Windows 2016, 60TB disk (drive F) attached via iscsi (multi-pathed) on a vswitch with no uplinks, drive F formatted as REFS.
vCentre: 4vCPU, 16GB ram: Windows 2016
Veeam-console: 4vCPU, 6GB ram: Windows 2016
VeeamProxy01-proxy04: 4vCPU, 8GB ram, Windows 2016

Post by **tsightler** » Dec 16, 2016 3:20 am this post

I actually have a conceptually similar hypercoverged setup in my lab, although my NAS is Linux/XFS/LVM/dm_cache based instead of OmniOS/ZFS. I don't think this problem can possibly be anywhere other than the Windows 2016 system because my setup has been running solidly with Windows 2012R2 for quite some time.

I was previously able to run my 2012R2 repo with no problems even with only 8GB of RAM (which was under best practice recommendation, but hey, it's a lab). I never actually saw a crash in my lab, but I saw performance fall significnatly and the extra memory pressure was obvious, so I had to bump the RAM to make the setup perform to the same level as 2016. None of the other components changed or experienced any issues and I made no changed there.

Please keep us updated on how the system behaves now that you are running with 16GB of RAM. Also, watching the resource monitor on the memory tab, and watching standby memory was the "tell" in my case, once that was completely exhausted, performance would fall quickly.

lando_uk · Post by **lando_uk** » Dec 16, 2016 10:37 am this post

ashleyw wrote:@lando_uk, yes you were spot on, we had deployed our REFS server from a template with the virtual memory settings set to be automatic. I've manually uped the REFS ram to be 16GB (the REFS server itself is virtual like everything else in our environment). I have manually defined the virtual ram to be have an initial size of 3GB and a max size of 48GB. I haven't seen the swap being used for any more than the initial allocation so far. I wonder if manually defining the size has any impact on the way memory is being used at the OS level. disk queues are typically in the 10-20 zone?

Something isn't great if you're getting 10-20 disk queue, the system waiting for disk availability will make your CPU's busy.

I just had a look at one of my repo/proxies and its currently merging 5 large offsite copy jobs, pretty much maxing out the crappy Raid6 IO - A total of 80-150MB/s, and the disk queue is maybe 2-5. The RAM is currently 20GB used out of 40GB.

Post by **ashleyw** » Dec 19, 2016 1:09 am this post

@tsightler, Tom, Our system was 100% stable after pushing the REFS layer to 16GB RAM. on active full backups - so that appears to be the sweet spot for our sized jobs (each job between about 4 and 9TB of backed up machines). (Previously we were going straight to ZFS native CIFS which is why we never hit this issue). Our proxies also seem stable dropping them back to 8GB RAM which gives us more headroom on our existing whitebox (we'll be replacing it shortly anyway so next one will have more of everything).

@lando_uk, I will check out the white box SAN again after xmas - I know we had a l2arc cache device failure a while ago (SSD waved the white flag) which is bottlnecking our reads, and I saw a read /write mix of about 40/60 during full loads. I'll replace the SSD cache device in the new year and then report back.

Currently I'm migrating off the primary workload off our old IBM DS SAN onto the Nimble AFA3000 so I'm going to let that complete before I do any more investigation at our backup layer (as the all flash array will no doubt shift the existing bottlenecks in our backup process again).

I must say that we are loving the block copy REFS feature - it makes a massive difference!

thanks guys for all your help!

R&D Forums

Windows 2016 REFS failure on high load during active full.

Re: Windows 2016 REFS failure on high load during active ful

Re: Windows 2016 REFS failure on high load during active ful

Re: Windows 2016 REFS failure on high load during active ful

Re: Windows 2016 REFS failure on high load during active ful

Re: Windows 2016 REFS failure on high load during active ful

Re: Windows 2016 REFS failure on high load during active ful

Re: Windows 2016 REFS failure on high load during active ful

Re: Windows 2016 REFS failure on high load during active ful

Re: Windows 2016 REFS failure on high load during active ful

Re: Windows 2016 REFS failure on high load during active ful

Re: Windows 2016 REFS failure on high load during active ful

Re: Windows 2016 REFS failure on high load during active ful

Re: Windows 2016 REFS failure on high load during active ful

Re: Windows 2016 REFS failure on high load during active ful

Who is online