9.5/ReFS/Server 2016 Memory Consumption

DaveWatkins · Post by **DaveWatkins** » Dec 14, 2016 10:28 pm this post

Hi All

We're experiencing crashing VeeamAgent.exe on our repo/SAN attached proxy using 9.5 and ReFS repositories. This particular server is running proxy/repo and tape drive and to each of it's 2 repository drives it was set to 16 threads.

Under 9U1 this was running fine and there were no memory issues but after upgrading to 9.5 and Server 2016 with ReFS repositories we're seeing the crashes overnight when things get busy and these crashes are due to them running out of memory. The host has 16GB of memory but I'm wondering if it's expected to have an increased memory load with 9.5 and ReFS due to the block cloning or other data movement speed improvements or if there is some memory leak in VeeamAgent that I've uncovered.

We're going to try and add more memory to this box anyway but I thought I'd see if higher memory load is expected or not.

I dropped my threads/repo down to 12 for last nights backup but still had process crashes, I've dropped it again down to 8 now but given we have 60 x 4TB disks I'm impacting performance with only 8 threads per repo

Thanks

Dec 14, 2016 10:55 pm

Normally the minimum recommendation would be at least 4GB per core, but that's based on an assumption that there will be 1 core per task, so a box that running 16 tasks would ideally have more than 16GB of memory in any case. Are you running per-VM backups?

I've also seen higher memory pressure with Windows 2016 in my lab testing however, I have not seen it crash. I tested with both NTFS and ReFS and it did not, at the time, appear to be related to ReFS specifically, but rather Windows 2016 memory management as it seemed much more willing to steal memory from processes to use as cache when compared to 2012R2 so perhaps this is adding to the problem. Have you applied all cumulative updates for Windows 2016?

BTW, when you say crash, do you mean the entire server crashes, or is it killing off processes?

DaveWatkins · Post by **DaveWatkins** » Dec 14, 2016 11:02 pm this post

Hi Tom

It's just killing off the processes so we'll have a VM or 2 fail while the rest of a job succeeds. If it's normal and you've seen as much I'll chase down dropping more memory in the box. We should have a decommissioned machine we can grab memory form that will work so I don't expect it'll even cost us anything to do.

I'm pretty sure the box is up to date but I'll double check that too

Thanks

graham8 · Post by **graham8** » Jan 17, 2017 4:22 pm this post

We're seeing major issues in this regard also. We were having out-of-memory issues where the system (multiple 2016 servers) would go into a non-responsive state and have to be hard-powered off. In particular, the memory would balloon up to 100% and kill the system when the "Microsoft->Windows->Data Integrity->Data Integrity Scan for Crash Recovery" task in task scheduler would auto-trigger following an unclean shutdown.

The latest round of MS updates seems to have updated the ReFS driver version and we haven't had it crash again when going through a scrub, but we *are* seeing very bizarre memory issues. For example... I just deleted ~10TB of old Veeam backup data. Memory utilization (out of 32GB) climbed from practically nothing being used to 98% of memory being used. Free space (as explorer reports) slowly, slowly climbed over an hour or two. The disks were staying busy the entire time as well. After that finished, and the free space all showed up and the disks went idle, memory continued to remain at 76% utilized, and when I check sysinternal's RAMMAP, it reports 21GB of "active" Metafile allocation. IE: ReFS. I also confirmed this with poolmon and tracking down the driver tags.

It's possible that this memory will be yielded when something else needs it, but in that case it should be reported as "Standby" in RAMMAP or "Cached" in Task Manager.

We also had unrepairable integrity failures in our Veeam backup chain after extending Storage Spaces and the underlying filesystem following addition of new disks. It's a mirrored arrangement. None of the disks are reporting errors. The odds that there were unreported disk errors in the same location on two separate disks is infinitesimal.

On a *separate* 2016 system than the one that reported integrity errors following its expansion, we ran into a bug in which the ReFS filesystem was expanded, but explorer couldn't see the additional space. Adding more disks and expanding further didn't fix the problem. And ReFS, unlike NTFS, doesn't support shrinking. We had to create a separate vdisk, sync non-veeam data (because the veeam data is block-aligned on refs), destroy the "bad" vdisk, create a new one, confirm it expands successfully, sync data back, and re-copy veeam data from the primary backup server.

In short... ReFS + Storage Spaces is a hot mess. It has some advantages, and I'm not saying people shouldn't use it - but be very cautious. None of this is Veeam's fault, but be aware that even today, years after it was released for the first time, this is still very much alpha/beta-quality stuff from Microsoft. I remember when Storage Spaces was first launched - it would literally *crash* when a filesystem reached 100% disk utilization and would be unmountable to even resolve the problem.

I'm not sure how much any of these issues relate to the block clone API, which is quite new, or are just general ReFS+Storage Spaces issues, but... everyone needs to be careful here. When you are adding disks and extending Storage Spaces filesystems, do it *one server at a time* and then run a full Data Integrity scrub (see above Task Manager path) and wait the day(s) it takes for it to finish before moving on to the next server. That way, if you need to do something crazy like nuke a vdisk to fix a MS glitch, you won't find (like I did), that you suddenly have bizarre integrity failures in the Veeam backups on your *main* backup repository. In the end, I got everything fixed and new backup chains repopulated, but it was nerve-wracking there for a few hours.

Also - if anyone else has the unclean-shutdown-scrub-of-death thing going on like I did ... I didn't know about the task at the time. If stopping it works, great. Do that quickly before it kills the system, and then run updates to get the latest version of the ReFS driver that seems to fix that. If you can't stop the process, you can do what I did - kill the system (it won't shut down properly), and then pull every disk from the server except the OS disk. Then, run updates, shut down, and reinsert the disks.

Post by **suprnova** » Jan 25, 2017 4:46 pm this post

I am also having a similar issue. I ensured the data integrity scan for crash recovery task was not running and I disabled it for good measure. However, I boot the server and within minutes the CPU and memory usage goes to 100% and the server completely freezes. If it boot it without the 13TB repository, the server is completely fine. I have also tried both a 2016 Core and a desktop experience VM in case it was OS related. This repository was built on 12/14/16 and the issue started 1/25/17.

graham8 · Jan 25, 2017 4:49 pm

The solution for us was shutting down, pulling all the storage array drives, applying all the latest Server 2016 updates, and then shutting down and reinserting the disks. Then the crash recovery task was able to finish (over a long, long time) without freezing the system. It still uses tons of memory, but it seemed to give that memory up when there was demand for it after the latest updates were applied.

Post by **suprnova** » Jan 25, 2017 6:31 pm this post

Interesting...unfortunately I am up to date.

graham8 · Post by **graham8** » Jan 25, 2017 7:36 pm this post

I was furiously trying to resolve the issue when this was happening, and among many things, I disabled all Veeam-related services. I believe that it was crashing even after disabling all services, but I might have left them disabled after updates were applied and while the crash integrity scan ran...

I did confirm that the refsv1.sys driver version was incremented following those updates, and I had confirmed via poolmon that the thing consuming all the memory was indeed that driver, so I thought it was fixed entirely by the update. Maybe the services being disabled was also important, though...who knows? With ReFS Block Clone being quite new, it's hard to say. You might try disabling all Veeam/etc services.

Also, what's the version of your refsv1.sys driver file? I can compare it to the one on our 2016 servers.

Oh, and also, how much ram does your server have?

Post by **suprnova** » Jan 25, 2017 7:45 pm this post

Currently has 24GB of RAM, but it started out with 12GB.

I actually have two ReFS repos, and one works fine (2TB allocated but 100GB in use). The larger one seems to cause this issue. I did try disabling Veeam and VMware Tools services, but it had zero effect.

I'm going to let the server sit for a few days, I am able to click on something once every 3 minutes or so, maybe it is actually doing something. I had to move my daily jobs to another repo.

I am currently seeing 100% CPU usage and 41% memory usage (seems to get to 9.9GB and then the freeze happens). I'll check the refsv1.sys once I lose patience and give it a reset again.

Thanks!

graham8 · Jan 25, 2017 7:56 pm

Ours have 32GB ram, and 64TB raw space, 32TB available with ~10TB in use. Sequential disk IO for it was around ~500MB/s.

Before the updates, memory would climb to 100% and "freeze" (system becomes entirely unresponsive). After the updates, it again would rapidly balloon up, but the memory would stop at around 98% and it would dip up and down slightly...presumably yielding to other memory demands in the system. It still took a very long time to finish (forget exactly...12-24 hours as I recall). In our case, it was the primary Veeam repo and it had also happened to the offsite Veeam repo.

We also have a similarly-configured hyper-v server storing some huge VM VHDXs. Not sure if that one ever crashed so as to trigger the crash recovery, but it never happened on that server.

...this whole experience has given me quite an aversion to ReFS/StorageSpaces. I know both had their fair share of bugs years ago, but I thought most of that was behind us.

Post by **tsightler** » Jan 25, 2017 8:58 pm this post

Is everyone in this thread using 4K or 64K cluster size? We've had a significant number of problems reported from customers using 4K cluster size, but most seem to be stable when using 64K clusters so that would be a very interesting data point.

graham8 · Post by **graham8** » Jan 25, 2017 9:07 pm this post

I went with defaults when creating the VDisk, aside from customizing the number of data columns/etc. How can I check the cluster size? I did $(Get-VirtualDisk | fl) but there's no mention of cluster. There's allocation unit size, but that's set to 1073741824 (~1GB?), and then a logical sector size of 512 with a physical sector size of 4096.

I'm seeing a lot of talk online that 64KB is the only cluster size option for ReFS.

Jan 25, 2017 9:54 pm

graham8 wrote:I'm seeing a lot of talk online that 64KB is the only cluster size option for ReFS.

64KB was the only cluster size with ReFS v1 that was part of Windows 2012/2012R2, but with Windows 2016 there is now ReFS v3 (really v3.1). In many ways calling it ReFS v3 is a misnomer, in my opinion, as it's quite different in many ways, which is why it's not possible to simply upgrade from ReFS v1 to v3.

With Windows 2016 and ReFS 3.1 the default format options use 4K clusters in every case I've tested. You can check cluster size of an existing volume with fsutil like so:

Code: Select all

C:\>fsutil fsinfo refsInfo E:
REFS Volume Serial Number :       0x7cba64e8ba649ffe
REFS Version   :                  3.1
Number Sectors :                  0x00000000577a0000
Total Clusters :                  0x000000000aef4000
Free Clusters  :                  0x00000000059a04c8
Total Reserved :                  0x00000000000bb7b4
Bytes Per Sector  :               512
Bytes Per Physical Sector :       512
Bytes Per Cluster :               4096
Checksum Type:                    CHECKSUM_TYPE_NONE

Specifically you are looking at the Bytes Per Cluster value which, in the output above, is 4096 (4K). To this point every case I've seen with problems has been with 4K clusters, while 64K has seemed to be stable. More details about available from our very own Luca Dell'Oca on his blog here:

http://www.virtualtothecore.com/en/refs ... kb-or-4kb/

DaveWatkins · Jan 25, 2017 10:11 pm

Ironically, as the one that started the thread, we only had problems when I had my 4k ReFS test drive. All my drives are now 64k and my issues have disappeared

graham8 · Jan 26, 2017 5:54 pm

Thanks Tom. That was an interesting article. All our 2016 ReFS volumes are, unsurprisingly, 4k.

Unfortunately, we're pretty much SOL on this count since wiping the volumes and recreating is incredibly impractical at this point. I hate to use the word "hope" and "Microsoft" in the same sentence, but I guess all we can do is hope that they resolve these issues that seem to be relating to poor scaling with 4k clusters.

Post by **dellock6** » Jan 31, 2017 6:18 am this post

I'm sad to here these issues, but at least I'm happy that you are all confirming the issues are coming from volumes formatted with 4K clusters, and no issue has been reported with 64k clusters. The main issue at this point is that the default block size is 4KB, so many would just read about our new integration and go straight to format a volume with ReFS, and risk then to end up with these issues. Hopefully the news about our suggestion to go 64KB will spread more and more.

Feb 01, 2017 8:34 am

If it's this important couldn't you do a check when adding a ReFS repo and warn to format with 64k?

Post by **dellock6** » Feb 01, 2017 9:15 am this post

Christian, I was almost about to post the same idea...

DaveWatkins · Post by **DaveWatkins** » Feb 01, 2017 7:02 pm this post

The issues will (presumably) get fixed by Microsoft, although in saying that 2016 has been out for some time now and they aren't fixed yet, so it may still be some time

Post by **Gostev** » Feb 01, 2017 9:14 pm this post

2016 is still really, really fresh - those few months it's been out is nothing in Windows terms. It's a huge and complex piece of software, so Microsoft needs time to prioritize and address all major issues gradually (by the way, opening support cases is one thing that does really help to raise priority of addressing the particular issue - at least in Veeam).

But generally speaking, there's nothing unexpected here - there's the reason why most companies practice good old "no upgrade until SP1" rule with new major releases of any software at all, Veeam included. Early adopters should always be prepared to run into those harder to find bugs that has slipped by QC.

christiankelly wrote:If it's this important couldn't you do a check when adding a ReFS repo and warn to format with 64k?

We have this penciled for Update 2 - this is simple change so won't add this immediately, but rather keep monitoring the feedback for the next couple of months, and make more educated decision whether we should make this change closer to the actual update release.

The whole memory issue does look pretty simple at a first sight (there's apparent lack of system resources consumption management for some ReFS maintenance process), and hopefully should be easy for Microsoft to fix.

Post by **mkretzer** » Feb 01, 2017 10:14 pm this post

@Gostev: But is it really an issue of not enough RAM? On our system RAM was never over 60 - 70 % and still the system crashed badly. Is there a "hard limit" which cannot be overcome even if there is more RAM avaiable?

DaveWatkins · Post by **DaveWatkins** » Feb 01, 2017 10:51 pm this post

Gostev wrote:But generally speaking, there's nothing unexpected here - there's the reason why most companies practice good old "no upgrade until SP1" rule with new major releases of any software at all, Veeam included. Early adopters should always be prepared to run into those harder to find bugs that has slipped by QC.

The slight wrinkle there is of course there will never be an SP1 for 2016, nor any service pack

. Cumulative updates help when updating a new install, so that is at least no longer a problem, but picking when to start the migration isn't quite as easy as SP1 anymore

Post by **Gostev** » Feb 01, 2017 11:24 pm this post

@mkretzer not necessarily, can be some other system resource like handles or something (or just a deadlock on some shared resource). What I am saying is that these sort of massive issues are usually very easy to reproduce, troubleshoot and fix (unlike intermittent issues, which are real evil).

@Dave correct, but nevertheless there will still be "feature updates" at roughly the same cadence as service packs were previously.

rgarvelink · Post by **rgarvelink** » Feb 02, 2017 2:31 pm this post

I don't want to throw this thread into madness, but shouldn't we be careful before we immediately state that 64k is the only recommendation for ReFS?

The recommendation from Microsoft is still 4k for the majority of workloads: https://blogs.technet.microsoft.com/fil ... -and-ntfs/
Granted, they do state that, "64k clusters are applicable when working with large, sequential IO, but otherwise, 4K should be the default cluster size." Likely Veeam falls within that 64k recommendation, but why then was the initial recommendation from Veeam to utilize 4k unless the volume was large, 100TB I believe. IIRC, Gostev pointed that out in his video here: https://www.youtube.com/watch?v=V3vrsonuLE8&t=1841s

Due to the large IO size from Veeam we're sacrificing 5 - 10% of space for a problem that potentially could be resolved by sizing the repository server properly. As stated in this thread, 4Gb of memory per core is the recommendation wouldn't OP need 64 Gb just for the Veeam operations assuming he's at 16 threads and is hitting the recommendation of 1 core for every thread? We know that ReFS prioritizes data availability over everything else and it appears to do so via memory consumption. We might just need to take that into consideration when sizing repositories.

https://technet.microsoft.com/en-us/lib ... s.11).aspx

Availability. ReFS prioritizes the availability of data. Historically, file systems were often susceptible to data corruption that would require the system to be taken offline for repair. With ReFS, if corruption occurs, the repair process is both localized to the area of corruption and performed online, requiring no volume downtime. Although rare, if a volume does become corrupted or you choose not to use it with a mirror space or a parity space, ReFS implements salvage, a feature that removes the corrupt data from the namespace on a live volume and ensures that good data is not adversely affected by nonrepairable corrupt data. Because ReFS performs all repair operations online, it does not have an offline chkdsk command.

Proactive Error Correction. The integrity capabilities of ReFS are leveraged by a data integrity scanner, which is also known as a scrubber. The integrity scanner periodically scans the volume, identifying latent corruptions and proactively triggering a repair of that corrupt data.

graham8 · Feb 02, 2017 3:27 pm

rgarvelink wrote:As stated in this thread, 4Gb of memory per core is the recommendation wouldn't OP need 64 Gb just for the Veeam operations assuming he's at 16 threads and is hitting the recommendation of 1 core for every thread? We know that ReFS prioritizes data availability over everything else and it appears to do so via memory consumption. We might just need to take that into consideration when sizing repositories.

Nothing should ever result in crashes due to memory availability. Performance should suffer, but a system crash should never occur. If a system crashes, it's bad programming that was designed with poor assumptions that didn't take into account scaling considerations, failure to properly clean up memory allocation (leaks), etc. In our case, we had double 4GB per core, with Veeam completely disabled even, and refs integrity scans alone literally crashed the system repeatedly due to memory overconsumption. This appears to have been fixed with recent updates for us, but because these are production servers, I can't exactly test it repeatedly. I've been carefully monitoring ReFS driver memory consumption (via sysinternal's rammap) and I'm seeing it eat up huge amounts of memory during anything that hits a large segment of integrity-enabled data. That's fine and perfectly understandable and desirable if the memory it's filling is unused, but my confidence level is now low that it's always going to do a good job yielding to other memory demands gracefully, since it didn't once already (and I've had other refs problems as well...expansion/etc).

I do agree that 4k shouldn't be a *stability* problem if the refs code is good, and it appears that MS did fix some memory allocation issues in the refs driver code with post-2016 RTM updates. Here's hoping. To be fair, we're talking about a small sample size of people, and the 4k cluster size is new. I agree that it's too early to put a nail in the coffin on its recommendation - but I think the early feedback is pertinent for anyone trying to make a decision on maximizing stability for a production system at this time.

Post by **mkretzer** » Feb 02, 2017 4:09 pm this post

rgarvelink wrote:I don't want to throw this thread into madness, but shouldn't we be careful before we immediately state that 64k is the only recommendation for ReFS?

Please read our REFS 4 k horror story: veeam-backup-replication-f2/refs-4k-hor ... 40629.html

We have 128 GB of RAM and 16 processor cores. We never had more than 80 GB consumption. The server still crashed 3 times yesterday night. Now with NTFS we never had any crash.

This is not fixable with ressources!

lando_uk · Post by **lando_uk** » Feb 03, 2017 5:41 pm this post

For those experiencing issues, are they all RefS/StorageSpaces systems? What about ReFS/Raid10 using a Raid controller and skipping shoddy storage spaces?

Post by **suprnova** » Feb 03, 2017 7:01 pm this post

I do not use Storage Spaces and had the issue.

Post by **tsightler** » Feb 03, 2017 7:44 pm this post

Definitely happens without storage spaces.

mk2311 · Post by **mk2311** » Feb 06, 2017 9:10 am this post

Well.....

We don't use REFS, but having upgraded to Veeam 9.5 we have had lots of out of memory issues on the Veeam B&R server. These are the result of the VeeamAgent.exe module using memory for every proxy task we had.

So, we had a lot of proxies, and the proxy maximum concurrent tasks value was set to 16 and in some cases 24. In Veeam 9.0, this was never a problem. In 9.5, we started having memory issues

Ticket opened, lots of logs sent for analysis over a 3 week period

Our Veeam B&R server had 16gb ram and 8 CPU's

We had to double to 32gb ram and 10 CPU's. Still had issues. So had to increase to 40gb ram and 16 CPU's

So, at a particular time of the evening, we start around 20 jobs, each with many VM's. Many proxies will be used and we can see from the Veeam Resource log that many of the proxies used the full 16 tasks. A veeamagent.exe process runs for each task, so we had about 18 proxies running, each using up to 16 threads and we could see around 120-150 veeamagent processes running at any one time. Each one takes uses c250mb, but initially spikes at over 500mb for a few seconds - not all at the same time. Later in the evening we run 26 backup jobs and when complete, we run the last backups, around 22 of them.

So with other system related memory usage, it was killing the Veeam server

We were advised by Veeam to reduce the number of tasks to 8, which we have done, and we have not had the memory issues since. We can see now that it appears to peak at about 34gb memory.

Since upgraded to Update 1, still no problems

May, or may not, be of help?

R&D Forums

9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Re: 9.5/ReFS/Server 2016 Memory Consumption

Who is online