9.5/ReFS/Server 2016 Memory Consumption

Availability for the Always-On Enterprise

9.5/ReFS/Server 2016 Memory Consumption

Veeam Logoby DaveWatkins » Wed Dec 14, 2016 10:28 pm

Hi All

We're experiencing crashing VeeamAgent.exe on our repo/SAN attached proxy using 9.5 and ReFS repositories. This particular server is running proxy/repo and tape drive and to each of it's 2 repository drives it was set to 16 threads.

Under 9U1 this was running fine and there were no memory issues but after upgrading to 9.5 and Server 2016 with ReFS repositories we're seeing the crashes overnight when things get busy and these crashes are due to them running out of memory. The host has 16GB of memory but I'm wondering if it's expected to have an increased memory load with 9.5 and ReFS due to the block cloning or other data movement speed improvements or if there is some memory leak in VeeamAgent that I've uncovered.

We're going to try and add more memory to this box anyway but I thought I'd see if higher memory load is expected or not.

I dropped my threads/repo down to 12 for last nights backup but still had process crashes, I've dropped it again down to 8 now but given we have 60 x 4TB disks I'm impacting performance with only 8 threads per repo

Thanks
DaveWatkins
Expert
 
Posts: 230
Liked: 59 times
Joined: Sun Dec 13, 2015 11:33 pm

Re: 9.5/ReFS/Server 2016 Memory Consumption

Veeam Logoby tsightler » Wed Dec 14, 2016 10:55 pm 1 person likes this post

Normally the minimum recommendation would be at least 4GB per core, but that's based on an assumption that there will be 1 core per task, so a box that running 16 tasks would ideally have more than 16GB of memory in any case. Are you running per-VM backups?

I've also seen higher memory pressure with Windows 2016 in my lab testing however, I have not seen it crash. I tested with both NTFS and ReFS and it did not, at the time, appear to be related to ReFS specifically, but rather Windows 2016 memory management as it seemed much more willing to steal memory from processes to use as cache when compared to 2012R2 so perhaps this is adding to the problem. Have you applied all cumulative updates for Windows 2016?

BTW, when you say crash, do you mean the entire server crashes, or is it killing off processes?
tsightler
Veeam Software
 
Posts: 4660
Liked: 1680 times
Joined: Fri Jun 05, 2009 12:57 pm
Full Name: Tom Sightler

Re: 9.5/ReFS/Server 2016 Memory Consumption

Veeam Logoby DaveWatkins » Wed Dec 14, 2016 11:02 pm

Hi Tom

It's just killing off the processes so we'll have a VM or 2 fail while the rest of a job succeeds. If it's normal and you've seen as much I'll chase down dropping more memory in the box. We should have a decommissioned machine we can grab memory form that will work so I don't expect it'll even cost us anything to do.

I'm pretty sure the box is up to date but I'll double check that too

Thanks
DaveWatkins
Expert
 
Posts: 230
Liked: 59 times
Joined: Sun Dec 13, 2015 11:33 pm

Re: 9.5/ReFS/Server 2016 Memory Consumption

Veeam Logoby graham8 » Tue Jan 17, 2017 4:22 pm

We're seeing major issues in this regard also. We were having out-of-memory issues where the system (multiple 2016 servers) would go into a non-responsive state and have to be hard-powered off. In particular, the memory would balloon up to 100% and kill the system when the "Microsoft->Windows->Data Integrity->Data Integrity Scan for Crash Recovery" task in task scheduler would auto-trigger following an unclean shutdown.

The latest round of MS updates seems to have updated the ReFS driver version and we haven't had it crash again when going through a scrub, but we *are* seeing very bizarre memory issues. For example... I just deleted ~10TB of old Veeam backup data. Memory utilization (out of 32GB) climbed from practically nothing being used to 98% of memory being used. Free space (as explorer reports) slowly, slowly climbed over an hour or two. The disks were staying busy the entire time as well. After that finished, and the free space all showed up and the disks went idle, memory continued to remain at 76% utilized, and when I check sysinternal's RAMMAP, it reports 21GB of "active" Metafile allocation. IE: ReFS. I also confirmed this with poolmon and tracking down the driver tags.

It's possible that this memory will be yielded when something else needs it, but in that case it should be reported as "Standby" in RAMMAP or "Cached" in Task Manager.

We also had unrepairable integrity failures in our Veeam backup chain after extending Storage Spaces and the underlying filesystem following addition of new disks. It's a mirrored arrangement. None of the disks are reporting errors. The odds that there were unreported disk errors in the same location on two separate disks is infinitesimal.

On a *separate* 2016 system than the one that reported integrity errors following its expansion, we ran into a bug in which the ReFS filesystem was expanded, but explorer couldn't see the additional space. Adding more disks and expanding further didn't fix the problem. And ReFS, unlike NTFS, doesn't support shrinking. We had to create a separate vdisk, sync non-veeam data (because the veeam data is block-aligned on refs), destroy the "bad" vdisk, create a new one, confirm it expands successfully, sync data back, and re-copy veeam data from the primary backup server.

In short... ReFS + Storage Spaces is a hot mess. It has some advantages, and I'm not saying people shouldn't use it - but be very cautious. None of this is Veeam's fault, but be aware that even today, years after it was released for the first time, this is still very much alpha/beta-quality stuff from Microsoft. I remember when Storage Spaces was first launched - it would literally *crash* when a filesystem reached 100% disk utilization and would be unmountable to even resolve the problem.

I'm not sure how much any of these issues relate to the block clone API, which is quite new, or are just general ReFS+Storage Spaces issues, but... everyone needs to be careful here. When you are adding disks and extending Storage Spaces filesystems, do it *one server at a time* and then run a full Data Integrity scrub (see above Task Manager path) and wait the day(s) it takes for it to finish before moving on to the next server. That way, if you need to do something crazy like nuke a vdisk to fix a MS glitch, you won't find (like I did), that you suddenly have bizarre integrity failures in the Veeam backups on your *main* backup repository. In the end, I got everything fixed and new backup chains repopulated, but it was nerve-wracking there for a few hours.

Also - if anyone else has the unclean-shutdown-scrub-of-death thing going on like I did ... I didn't know about the task at the time. If stopping it works, great. Do that quickly before it kills the system, and then run updates to get the latest version of the ReFS driver that seems to fix that. If you can't stop the process, you can do what I did - kill the system (it won't shut down properly), and then pull every disk from the server except the OS disk. Then, run updates, shut down, and reinsert the disks.
graham8
Enthusiast
 
Posts: 47
Liked: 16 times
Joined: Wed Dec 14, 2016 1:56 pm

Re: 9.5/ReFS/Server 2016 Memory Consumption

Veeam Logoby suprnova » Wed Jan 25, 2017 4:46 pm

I am also having a similar issue. I ensured the data integrity scan for crash recovery task was not running and I disabled it for good measure. However, I boot the server and within minutes the CPU and memory usage goes to 100% and the server completely freezes. If it boot it without the 13TB repository, the server is completely fine. I have also tried both a 2016 Core and a desktop experience VM in case it was OS related. This repository was built on 12/14/16 and the issue started 1/25/17.
suprnova
Service Provider
 
Posts: 8
Liked: never
Joined: Fri Apr 08, 2016 5:15 pm

Re: 9.5/ReFS/Server 2016 Memory Consumption

Veeam Logoby graham8 » Wed Jan 25, 2017 4:49 pm 1 person likes this post

The solution for us was shutting down, pulling all the storage array drives, applying all the latest Server 2016 updates, and then shutting down and reinserting the disks. Then the crash recovery task was able to finish (over a long, long time) without freezing the system. It still uses tons of memory, but it seemed to give that memory up when there was demand for it after the latest updates were applied.
graham8
Enthusiast
 
Posts: 47
Liked: 16 times
Joined: Wed Dec 14, 2016 1:56 pm

Re: 9.5/ReFS/Server 2016 Memory Consumption

Veeam Logoby suprnova » Wed Jan 25, 2017 6:31 pm

Interesting...unfortunately I am up to date.
suprnova
Service Provider
 
Posts: 8
Liked: never
Joined: Fri Apr 08, 2016 5:15 pm

Re: 9.5/ReFS/Server 2016 Memory Consumption

Veeam Logoby graham8 » Wed Jan 25, 2017 7:36 pm

I was furiously trying to resolve the issue when this was happening, and among many things, I disabled all Veeam-related services. I believe that it was crashing even after disabling all services, but I might have left them disabled after updates were applied and while the crash integrity scan ran...

I did confirm that the refsv1.sys driver version was incremented following those updates, and I had confirmed via poolmon that the thing consuming all the memory was indeed that driver, so I thought it was fixed entirely by the update. Maybe the services being disabled was also important, though...who knows? With ReFS Block Clone being quite new, it's hard to say. You might try disabling all Veeam/etc services.

Also, what's the version of your refsv1.sys driver file? I can compare it to the one on our 2016 servers.

Oh, and also, how much ram does your server have?
graham8
Enthusiast
 
Posts: 47
Liked: 16 times
Joined: Wed Dec 14, 2016 1:56 pm

Re: 9.5/ReFS/Server 2016 Memory Consumption

Veeam Logoby suprnova » Wed Jan 25, 2017 7:45 pm

Currently has 24GB of RAM, but it started out with 12GB.

I actually have two ReFS repos, and one works fine (2TB allocated but 100GB in use). The larger one seems to cause this issue. I did try disabling Veeam and VMware Tools services, but it had zero effect.

I'm going to let the server sit for a few days, I am able to click on something once every 3 minutes or so, maybe it is actually doing something. I had to move my daily jobs to another repo.

I am currently seeing 100% CPU usage and 41% memory usage (seems to get to 9.9GB and then the freeze happens). I'll check the refsv1.sys once I lose patience and give it a reset again.

Thanks!
suprnova
Service Provider
 
Posts: 8
Liked: never
Joined: Fri Apr 08, 2016 5:15 pm

Re: 9.5/ReFS/Server 2016 Memory Consumption

Veeam Logoby graham8 » Wed Jan 25, 2017 7:56 pm 1 person likes this post

Ours have 32GB ram, and 64TB raw space, 32TB available with ~10TB in use. Sequential disk IO for it was around ~500MB/s.

Before the updates, memory would climb to 100% and "freeze" (system becomes entirely unresponsive). After the updates, it again would rapidly balloon up, but the memory would stop at around 98% and it would dip up and down slightly...presumably yielding to other memory demands in the system. It still took a very long time to finish (forget exactly...12-24 hours as I recall). In our case, it was the primary Veeam repo and it had also happened to the offsite Veeam repo.

We also have a similarly-configured hyper-v server storing some huge VM VHDXs. Not sure if that one ever crashed so as to trigger the crash recovery, but it never happened on that server.

...this whole experience has given me quite an aversion to ReFS/StorageSpaces. I know both had their fair share of bugs years ago, but I thought most of that was behind us.
graham8
Enthusiast
 
Posts: 47
Liked: 16 times
Joined: Wed Dec 14, 2016 1:56 pm

Re: 9.5/ReFS/Server 2016 Memory Consumption

Veeam Logoby tsightler » Wed Jan 25, 2017 8:58 pm

Is everyone in this thread using 4K or 64K cluster size? We've had a significant number of problems reported from customers using 4K cluster size, but most seem to be stable when using 64K clusters so that would be a very interesting data point.
tsightler
Veeam Software
 
Posts: 4660
Liked: 1680 times
Joined: Fri Jun 05, 2009 12:57 pm
Full Name: Tom Sightler

Re: 9.5/ReFS/Server 2016 Memory Consumption

Veeam Logoby graham8 » Wed Jan 25, 2017 9:07 pm

I went with defaults when creating the VDisk, aside from customizing the number of data columns/etc. How can I check the cluster size? I did $(Get-VirtualDisk | fl) but there's no mention of cluster. There's allocation unit size, but that's set to 1073741824 (~1GB?), and then a logical sector size of 512 with a physical sector size of 4096.

I'm seeing a lot of talk online that 64KB is the only cluster size option for ReFS.
graham8
Enthusiast
 
Posts: 47
Liked: 16 times
Joined: Wed Dec 14, 2016 1:56 pm

Re: 9.5/ReFS/Server 2016 Memory Consumption

Veeam Logoby tsightler » Wed Jan 25, 2017 9:54 pm 1 person likes this post

graham8 wrote:I'm seeing a lot of talk online that 64KB is the only cluster size option for ReFS.

64KB was the only cluster size with ReFS v1 that was part of Windows 2012/2012R2, but with Windows 2016 there is now ReFS v3 (really v3.1). In many ways calling it ReFS v3 is a misnomer, in my opinion, as it's quite different in many ways, which is why it's not possible to simply upgrade from ReFS v1 to v3.

With Windows 2016 and ReFS 3.1 the default format options use 4K clusters in every case I've tested. You can check cluster size of an existing volume with fsutil like so:

Code: Select all
C:\>fsutil fsinfo refsInfo E:
REFS Volume Serial Number :       0x7cba64e8ba649ffe
REFS Version   :                  3.1
Number Sectors :                  0x00000000577a0000
Total Clusters :                  0x000000000aef4000
Free Clusters  :                  0x00000000059a04c8
Total Reserved :                  0x00000000000bb7b4
Bytes Per Sector  :               512
Bytes Per Physical Sector :       512
Bytes Per Cluster :               4096
Checksum Type:                    CHECKSUM_TYPE_NONE

Specifically you are looking at the Bytes Per Cluster value which, in the output above, is 4096 (4K). To this point every case I've seen with problems has been with 4K clusters, while 64K has seemed to be stable. More details about available from our very own Luca Dell'Oca on his blog here:

http://www.virtualtothecore.com/en/refs ... kb-or-4kb/
tsightler
Veeam Software
 
Posts: 4660
Liked: 1680 times
Joined: Fri Jun 05, 2009 12:57 pm
Full Name: Tom Sightler

Re: 9.5/ReFS/Server 2016 Memory Consumption

Veeam Logoby DaveWatkins » Wed Jan 25, 2017 10:11 pm 1 person likes this post

Ironically, as the one that started the thread, we only had problems when I had my 4k ReFS test drive. All my drives are now 64k and my issues have disappeared
DaveWatkins
Expert
 
Posts: 230
Liked: 59 times
Joined: Sun Dec 13, 2015 11:33 pm

Re: 9.5/ReFS/Server 2016 Memory Consumption

Veeam Logoby graham8 » Thu Jan 26, 2017 5:54 pm 1 person likes this post

Thanks Tom. That was an interesting article. All our 2016 ReFS volumes are, unsurprisingly, 4k.

Unfortunately, we're pretty much SOL on this count since wiping the volumes and recreating is incredibly impractical at this point. I hate to use the word "hope" and "Microsoft" in the same sentence, but I guess all we can do is hope that they resolve these issues that seem to be relating to poor scaling with 4k clusters.
graham8
Enthusiast
 
Posts: 47
Liked: 16 times
Joined: Wed Dec 14, 2016 1:56 pm

Next

Return to Veeam Backup & Replication



Who is online

Users browsing this forum: Google [Bot], MSNbot Media and 19 guests