REFS issues (server lockups, high CPU, high RAM)

xudaiqing · Post by **xudaiqing** » Apr 25, 2017 6:05 am this post

alesovodvojce wrote:After week of tests, and after countless ReFS horror days personally lived here, we have copied our Refs 4k repo to different filesystems to see the size differences. Here it is.

In actual numbers
ReFS 4k: 21 TB (source repo)
ReFS 64k 31 TB at least - we had to stop the file copy as the underlying disks runs out of free space
NTFS: 31 TB at least - same reason to stop. Finally we have shrinked source repo to 13 TB by deleting it files. After that, the target NTFS partition copied that size to 24 TB. So 13 TB Refs 4k made 24 TB NTFS).

Generalized
Refs 4k - best space saver. But lot of troubles (as in this thread)
Refs 64k - not a win in space saving, whilst still lot of refs benefits. but, the troubles will theoretically start as well, they are just postponed for later (when the repo size grows over unsaid limit)
NTFS - not win in space saving, no special benefits. Main benefit is stable filesystem = backups secured

We migrated first repo to NTFS now, enjoying stable backups. Second repo remais in Refs 4k for now for experiments

I don't think you can copy backup repo to another volume and still keep block cloning.

Post by **richardkraal** » Apr 25, 2017 6:48 am this post

Gostev wrote: No, we really want everyone experiencing the issue to open a ticket with Microsoft instead, to help raise the priority of this issue on their side.

already did that, MS did not want to support Veeam.
So we tested with multiple Windows Backup jobs and had no problems, the destination FS was reachable all the time.
At this point the MS support stopped and is pointing at a Veeam software issue.

MS Case [REG:117040515557419]

xudaiqing · Post by **xudaiqing** » Apr 25, 2017 9:02 am this post

The way veeam use refs seem to create large number of metadata and windows try to load that into memory which caused the problem.
In my test it seems the biggest spike appears when i delete some thing from the volume, my throught is windows loads the entire metadata into memory in order to figure out which block is not in use.
64K will difinitely help here as they reduces the metadata needed.
Base on that i think synthetic full will make things worse as they will need own sets of metadata.
I current have a reverse increasement repo of 2T used and the metadata spike on delete is still below 4G.

Cicadymn · Post by **Cicadymn** » Apr 25, 2017 3:44 pm this post

Gostev wrote:Yes, we've been in touch with the ReFS development team on this issue for a while now. Right now, they are working with one of our customers who has the issue reproducing most consistently. Internally, we do not have a lab that replicates the issue (and based on our support statistics, it does not seems to be very common in general ).

I've been having this issue since we got going back in January. And I'm able to pretty reliably encounter the problem. I have one particular disk that just doesn't want to cooperate.

Let me know if I can be of any help!

Post by **Gostev** » Apr 25, 2017 8:47 pm this post

If you open a support case with Microsoft, please PM me the case ID.

Post by **Gostev** » Apr 25, 2017 9:09 pm this post

xudaiqing wrote:I don't think you can copy backup repo to another volume and still keep block cloning.

Correct, this test by @alesovodvojce was totally invalid, as simply copying files inflates them. ReFS space savings can be enormous, in the same league with deduplicating storage appliances. Also, the real difference in used space consumption between 4KB and 64KB ReFS volumes is only 10%.

Post by **Gostev** » Apr 25, 2017 9:13 pm this post

richardkraal wrote:So we tested with multiple Windows Backup jobs and had no problems, the destination FS was reachable all the time.
At this point the MS support stopped and is pointing at a Veeam software issue.

Nice, so it's like comparing Windows Calculator to Microsoft Excel. Windows backup does not integrate with ReFS at all (does not use block cloning or file integrity streams), so what's the point of using one for troubleshooting the issue clearly associated with said technologies? This makes zero sense.

I suggest you ask the case to be escalated to the support management, otherwise this issue will never get to R&D in sufficient volume. Needless to say, we will be more than happy to help them reproduce the issue without Veeam in the picture if so desired - all we do is basic public API calls against regular files on the ReFS volume. That is, IF they are actually interested in finding and fixing the issue with their next-gen file system, so that all of their customers could start leveraging one (tell this to whoever will review your escalation).

Cicadymn · Post by **Cicadymn** » Apr 25, 2017 9:26 pm this post

Gostev wrote:If you open a support case with Microsoft, please PM me the case ID.

I'll talk to my management and see if they're willing to pay up for a support ticket with MS. More likely I'm just going to go through a process to rebuild/migrate everything.

Post by **Gostev** » Apr 25, 2017 10:20 pm this post

Oh, I did not realize you'd have to pay for one. I will keep you in mind then. For example, Windows dedupe team wanted the affected users to work with them so they provided a way to open the support case at no charge for everyone. They only needed the case open so that all their processes work correctly though, most importantly they were actually interested in understanding and solving the issue - and needed customers with reliable repro.

alesovodvojce · Apr 25, 2017 11:29 pm

Gostev wrote: Correct, this test by @alesovodvojce was totally invalid, as simply copying files inflates them. ReFS space savings can be enormous, in the same league with deduplicating storage appliances. Also, the real difference in used space consumption between 4KB and 64KB ReFS volumes is only 10%.

agree - the test design was bad. Thanks for pointing that.
Refs is promising filesystem. I have green light from boss to pay for MS ticket and raise the count. Will do that tomorrow

RGijsen · Post by **RGijsen** » Apr 26, 2017 7:20 am this post

Note that when raising a ticket with Prosupport, you'll have to pay in advance since march 2016 if I remember correctly, could be 2015 as well. If it's identified as a bug, at the end of the ticket you'll get a refund. So have a company creditcard at hand when raising it. It used to be the other way around, you'd get an incoive when a ticket was closed which was not identified as a bug. Guess who gets the interest on all that cash now (although that might not be the primary reason

)

Cicadymn · Post by **Cicadymn** » Apr 27, 2017 3:06 pm this post

Gostev wrote:Oh, I did not realize you'd have to pay for one. I will keep you in mind then. For example, Windows dedupe team wanted the affected users to work with them so they provided a way to open the support case at no charge for everyone. They only needed the case open so that all their processes work correctly though, most importantly they were actually interested in understanding and solving the issue - and needed customers with reliable repro.

Sounds good. After poking through the thread I tried turning off synthetic fulls for all my jobs and what do you know, I got through a night without it locking up!

I'll run this for a while and see if locks up again. If not, I'll probably monitor this thread and see if there's ever a permanent fix. We'd definitely like to be able to take synthetic fulls.

Apr 27, 2017 7:10 pm

Just checked and the ReFS dev team is on it.

andersgustaf · Post by **andersgustaf** » May 02, 2017 8:41 am this post

Gostev wrote:Just checked and the ReFS dev team is on it.

Any news?
Any casenumber at MS I can refer to in my ticket?

//Anders

graham8 · Post by **graham8** » May 02, 2017 12:16 pm this post

I got three memory dumps submitted to MS the other day. Gostev has my case ID, but if there's some reference to give MS I could submit that to the tech dealing with my case as well, to be sure there's no internal communication breakdown.

jslic · May 03, 2017 9:16 am

Had to kill 2 different ReFS 64k repositories this week as both servers were completed locked due to the ReFS issue. I will be very hesitant to switch back to ReFS again, even if MS comes out with another "fix".

Post by **Gostev** » May 03, 2017 1:41 pm this post

One theory I have that would explain why some users have issues and other don't is the difference in the amount of concurrent tasks. So one troubleshooting step for those experiencing lockups would be to reduce the amount of concurrent tasks on the repository by half and see if that makes any difference to stability. Perhaps even change it to 1 task if you have a lab environment where this issue reproduces. Thanks!

Post by **mkretzer** » May 03, 2017 1:52 pm this post

@gostev: We use no concurrent task limit at all right now and all runs well. But our REFS storage now has 96 Disks and not 24 as before. But even with a limit of 4 tasks REFS locked up with the 24 disk storage...

So in our case the much faster storage seems to have fixed the issue for now...

graham8 · May 03, 2017 2:02 pm

Gostev wrote:One theory I have that would explain why some users have issues and other don't is the difference in a difference in the amount of concurrent tasks. So one troubleshooting step for those experiencing lockups would be to reduce the amount of concurrent tasks on the repository by half and see if that makes any difference to stability. Perhaps even change it to 1 task if you have a lab environment where this issue reproduces. Thanks!

I thought so at first too. One of the last series of lockups I got, though, was actually just trying to physically delete some Veeam backup images on disk, with Veeam completely disabled. The background "cleanup" process that slowly frees up available disk space after deletion spiked memory steadily up to 100% and nuked the server...you would get a few minutes to scramble around on each system boot before the ReFS kernel driver wrecked the machine. So, while concurrent tasks might be a factor in some cases, even just deleting some files (albeit, veeam block-cloned files) from disk without Veeam initiating any tasks at all can cause the issue. This is the series of memory dumps that Microsoft now has their hands on with my case.

I can only imagine that Veeam itself requesting that a repo remove some old block cloned files just like I manually did could result in the same thing happening.

I'm not sure if deleting similarly-sized files on 2016 ReFS which were *not* block cloned would result in the same sort of disaster or not. I'm just trying not to breathe too hard around any of our ReFS servers.

Meanwhile, as insurance, I've put another server in place offsite with ZFS, to which I'm running file-based copies and doing daily ZFS snapshotting of important servers, and covering other SQL/etc servers with another backup product also targeting the ZFS storage. I'd encourage everyone using ReFS on 2016 to do something similar (getting a copy to some other technology than ReFS), because there's serious danger of losing an entire repo. We barely got it to recover after endless attempts on our copy repo after manually deleting those files caused it to blow up. For a while I thought the entire volume was a loss. And if my primary repo, also running ReFS, had died at the same time, or during a re-seed? I don't want to think about that.

May 03, 2017 2:25 pm

Deleting files definitely seems to be one of the big triggers. In my testing that was always the point where Windows seemed to get crazy, either when Veeam was deleting lots of files or even if I just started deleting lots of block cloned files manually. I've almost wondered if it would be worthwhile to throttle file deletions on ReFS until Microsoft gets to the root of this problem.

It's been very difficult to find any correlation between users that are having problems vs those that are not, but one pattern that I think I'm starting to notice is that customers that have the most problems are using periodic synthetic fulls vs forever forward or reverse incremental. My theory is that this is because periodic synthetic fulls lead to much bigger file deletion events when a new synthetic full is created and prior full/incremental chains are deleted as a group. It may also be related to people that have much bigger jobs, so many files are merged/deleted near the same time since all of the synthetic processing/deletions happen at the end of the job, but it's really hard to tell because of how disparate the hardware and configurations are.

Note that I'm not saying people running forever forward or reverse couldn't have the problem, but it just seems to trigger more for those with periodic synthetic fulls, but admittedly the sample size is small. The majority of people I talk to that are using ReFS aren't having any issues.

graham8 · Post by **graham8** » May 03, 2017 2:39 pm this post

tsightler wrote:customers that have the most problems are using periodic synthetic fulls vs forever forward or reverse incremental

Wouldn't that be everyone using 2016+ReFS+Veeam?

I'd love to just use forever forward/reverse, but my understanding is that the only way to to have space-efficient backups with long-term grandfather-father-son points in time is by using 2016-ReFS+Synthetic Fulls? (using the "Keep the following restore points as full backups for archival purposes) ... I'm new to Veeam, so correct me if I'm wrong.

EDIT: Oh, nm, I see you're probably talking about the primary backup job options to the primary repo instead of copy jobs.

kubimike · Post by **kubimike** » May 03, 2017 3:11 pm this post

Just an update from me. Got a Private message from one of the board members here. I've been stable for almost 2 months no daily reboots by doing the following. Not installing latest KB4013429 that addresses the ReFS issues. Reformatting my array using a smaller block size on the controller(P841) not block size in windows, that is still set to 64k. At the same time I did move the raid controller into a different HP server as well. These changes contributed to my 100% uptime so far. No more failed backups or BSODs. One last thing I did also disable the Data Integrity Scans & The Data Integrity scan for Crash recovery

Code: Select all

Windows PowerShell
Copyright (C) 2016 Microsoft Corporation. All rights reserved.

PS C:\Users\administrator> get-hotfix

Source        Description      HotFixID      InstalledBy          InstalledOn
------        -----------      --------      -----------          -----------
KDNAP-UTIL2   Update           KB3192137     NT AUTHORITY\SYSTEM  9/12/2016 12:00:00 AM
KDNAP-UTIL2   Update           KB3211320     \adm... 3/23/2017 12:00:00 AM
KDNAP-UTIL2   Security Update  KB3213986     NT AUTHORITY\SYSTEM  3/23/2017 12:00:00 AM

Post by **tsightler** » May 03, 2017 3:19 pm this post

graham8 wrote:EDIT: Oh, nm, I see you're probably talking about the primary backup job options to the primary repo instead of copy jobs.

Yes, I was specifically referring to using forever forward/reverse on the primary, or simple retention on the backup copy job target. Believe it or not, the vast majority of customers I work with use simple retention of either 14 or 30 days on their Veeam repositories. Most have a dedupe appliance or some other means (tape) if they need longer retention, because those boxes provide global dedupe which is really useful for longer term retention, especially since the majority of customers I work with have >1000 VMs.

kubimike wrote:Reformatting my array using a smaller block size on the controller(P841) not block size in windows, that is still set to 64k.

What block size did you use? The general Veeam recommendation has been to recommend larger blocks sizes for RAID, typically 64-128K, but I can see how using something smaller might be useful with ReFS + block clone if the underlying device is RAID5/6 as large block sizes could lead to lots of stripe re-writes.

kubimike · Post by **kubimike** » May 03, 2017 5:38 pm this post

@tsightler I am RAID 6+0, per HPs crappy recommendation I was at 512k. I've since blown away the array left the settings default for strip size (128k) issues are gone. I was getting a warning in my controllers log about memory being maxed out all the time when I was 512k. My controller has 4GB on board, found it amazing that it would be full. Those errors are gone too.

bjackson@agi.com · May 03, 2017 8:10 pm

This is not a Veeam specific issue. I've had it for 6 months now. I'm running a Server 2016 cluster with S2D and Hyper-V. The S2D virtual disks are formatted with ReFS 4k. I've had different hosts hang(black screen, no response to keyboard or mouse, requires hard power reset) in the cluster. Initially, I thought it was a hardware problem. Worked with the hardware vendor for a few months. No problems found by them. Issue "seemed" to happen when one of the nodes in the cluster was over 50% RAM utilization. Only found this thread yesterday(the MS forums are not very helpful) because I didn't think it was a Veeam problem(we run 9.5 to backup all of our VMs, but not to ReFS). The Data Integrity Scan does not correlate to any times that I've had the system hang. For sure the problem is with ReFS. Not sure if it's only Storage Spaces users or anyone running ReFS. My problem I have not figured out how to reproduce consistently(anywhere from 2-90 days between crashes).

Gostev, is there anything that I should reference when opening my Microsoft ticket to help the ReFS team? All of the issues I'm seeing are identical to what other users in this forum are seeing.

For those of you who haven't opened a Microsoft ticket because of the cost, does anyone in your organization have an MSDN account? I learned yesterday that you can open 2 tickets a year for free with EACH MSDN account. I work in a .NET software dev shop, so I will be opening plenty of tickets with Microsoft now.

System Specs for each of the six nodes in our cluster:
-2x Intel Xeon E5-2640
-512GB RAM
-128GB SSD for OS
-2x 2TB Samsung SM863 for Journal in S2D
-10x 8TB Hitachi Ultrastar He8 for triple mirror in S2D
-virtual disks in S2D
--1x 5TB ReFS
--3x 10TB ReFS

Post by **Gostev** » May 03, 2017 8:18 pm this post

Brad, thanks for sharing you have the same issue with Hyper-V on S2D. This should certainly raise the priority of the issue in Microsoft's eyes, if that is the case (Hyper-V on S2D is the reference architecture for Hyper-V 2016 anyway). Also, this cancels one of my theories (that our usage of file integrity streams is potentially causing this).

As far as the support case - I offered the ReFS team more joint customers to work with if needed, but I am guessing they probably have enough to collect info from already. I suggest you mention in your support case that you are likely having the same issue as the ReFS team is currently troubleshooting with Veeam customers, this should help them to connect the dots faster once your ticket gets to the dev team through escalations.

Thanks!

May 04, 2017 7:15 am

I've upgraded my dedicated backend (SMB share server hosting the ReFS volume) from 64GB to 384GB and backups are running fine now.
The gateway server is running on a different dedicated machine.
Also the perfmon logs seems to be fine now, no gaps in de logs, also the system does not lockup anymore. RamMap shows a Metafile usage of 50GB (!).
let's see what's going to happen te next weeks.

fingers crossed

used hardware
DL380 Gen9, dual cpu, 384gb ram, 12x 8TB, P841/4GB (64k stripe, Raid6). Win2016 ReFS 64k

kubimike · Post by **kubimike** » May 04, 2017 2:50 pm this post

@richardkraal You running 4.58 on that P841 ?

Post by **richardkraal** » May 04, 2017 8:41 pm this post

sorry, it is the P840 (non external version)

we use firmware 4.52
just saw a new firmware for the controller -> 5.04(21 Apr 2017)

will postpone this for a few weeks to see if the memory upgrade works

kubimike · Post by **kubimike** » May 04, 2017 10:33 pm this post

@richardkraal Oh really? I just googled quickly didn't see it. Have the URL handy ?

** edit
found it.

http://h20564.www2.hpe.com/hpsc/swd/pub ... nvOid=4064

Even includes fixes for storage spaces direct.

R&D Forums

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Who is online