REFS issues (server lockups, high CPU, high RAM)

JVA@Alsic · Post by **JVA@Alsic** » Sep 22, 2017 1:14 pm this post

Indeed, not all disks support this.
Since we've implemented this all our problems have been resolved.
Maybe there is some other write caching configuration on your system available to speed up the process?
Tuning of the RAID controller perhaps?

Post by **mkretzer** » Sep 22, 2017 1:18 pm this post

For us the option is also greyed out (and disabled). I also can not understand how Veeam still is not warning users!

antipolis · Post by **antipolis** » Sep 22, 2017 1:26 pm this post

maybe it has been mentioned/debated before (and I am sorry if that's the case) but for the most impacted users here, I'm curious what raid stripe size are you using ? I put mine on 256k

https://www.virtualtothecore.com/en/vee ... ripe-size/

JVA@Alsic · Post by **JVA@Alsic** » Sep 22, 2017 1:30 pm this post

64K, to align with block size of ReFS.

antipolis · Post by **antipolis** » Sep 22, 2017 1:34 pm this post

Gostev wrote:@Richard, honestly it does not seem like your issue has anything to deal with what the issue discussed in this thread. Could be a simple I/O error due to heavy concurrent load on the target volume. By the way, you may consider increasing your RAID stripe size to 128KB or 256KB to better match Veeam workload (avg. block size is 512KB), this will cut IOPS from backup job significantly (and your backup storage IOPS capacity is not fantastic, so it could really use this).

^ this

I'm no expert here and by saying this I might be totally out of line but I'm not sure aligning the raid stripe size to the refs block size makes a lot of sense

antipolis · Post by **antipolis** » Sep 22, 2017 1:39 pm this post

https://bp.veeam.expert/job_configurati ... ssion.html

Local – this is the default setting and is recommended when using a disk-based repository. When this setting is selected, Veeam reads data and calculates hashes in 1 MB chunks.
LAN – this value is recommended when using a file-based repository such as SMB shares. When this setting is selected, Veeam reads data and calculates hashes in 512 KB chunks.
WAN – this value is recommended when backing up directly over a slow link or for replication as it creates the smallest backups files at the cost of memory and backup performance. When this setting is selected, Veeam reads data and calculates hashes in 256 KB chunks.
Local (>16 TB) – this setting is recommended for large backup jobs with more than 16 TB of source data in the job. When this setting is selected, Veeam reads data hashes and calculates data on 4 MB blocks.

my two big jobs : 8 TB (configured as LAN target so 512k) and 5 TB (configured as WAN target to 256k)

Post by **Gve** » Sep 22, 2017 2:19 pm this post

antipolis wrote:maybe it has been mentioned/debated before (and I am sorry if that's the case) but for the most impacted users here, I'm curious what raid stripe size are you using ? I put mine on 256k

https://www.virtualtothecore.com/en/vee ... ripe-size/

128k for me.

but if the problem was on the physical storage layer I will have to have a disk latency or disk queue ?

Post by **mkretzer** » Sep 23, 2017 12:50 pm this post

Correct. You should see latency / queue.

We even tries both - we have one storage system which does only 64 k and another which we set to 256 k. No difference.

Post by **Gostev** » Sep 24, 2017 1:28 pm this post

antipolis wrote:I'm no expert here and by saying this I might be totally out of line but I'm not sure aligning the raid stripe size to the refs block size makes a lot of sense

The way you put it, it does not makes sense indeed... however, increasing RAID strip size to more closely match Veeam's typical I/O size makes a lot of sense with IOPS-constrained backup storage (and that is regardless of the underlying file system used by backup repository).

Sep 24, 2017 1:32 pm

By the way, I checked with the ReFS dev team for updates and apparently they are having a good progress working with the affected customers, with the newest performance fixes looking even more promising. So, thanks to everyone who opened support cases with Microsoft - and continues working with them in testing these latest private fixes! I do encourage everyone to open support case with Microsoft on this issue if you have not already - because the issue is not reproducing easily in every lab, and takes a lot of work to simulate for troubleshooting purposes.

antipolis · Post by **antipolis** » Sep 25, 2017 11:55 am this post

Gostev wrote:increasing RAID strip size to more closely match Veeam's typical I/O size makes a lot of sense with IOPS-constrained backup storage (and that is regardless of the underlying file system used by backup repository).

this is what I meant: that reducing the stripe size in hopes of "aligning" to refs will not bring better performance, and can actually do the opposite

mkretzer wrote:Correct. You should see latency / queue.

We even tries both - we have one storage system which does only 64 k and another which we set to 256 k. No difference.

good to know

what really made the issue manageable for me was spreading my jobs synthetics&rebuilds on different days of the week...

I'm really curious how this all ends... just reading about "beta drivers" and registry settings for a filesystem which is supposed to be production ready makes me anxious...

Post by **dellock6** » Sep 25, 2017 8:26 pm this post

I still find a bit hard to understand why you all want to run synthetic fulls over ReFS. On regular storage, I totally understand that synthetics are a good way to split a chain and avoid to become too long and risk to have one corrupted incremental to corrupt all the following ones, but on ReFS you are just re-pointing the same blocks multiple times in the new full. Fast clone doesn't do any read or write check as Veeam does for synthetic fulls on regular storage, so there's in my opinion really no point in running synthetic fulls over ReFS. And since it seems that large amounts of metadata deletions are one of the causes of the lack of performance in ReFS, synthetic fulls that are expired and need to be removed will not help for sure, while forever incremental would distribute the much smaller deletions over multiple days.

nmdange · Post by **nmdange** » Sep 25, 2017 8:48 pm this post

Running synthetic fulls helps speed up restore times when you have longer retention periods. But I would agree that this does not make sense unless you have a fairly long retention period. Think about it if you have a retention of 30 days with forward forever. If you want to restore your most recent backup, you have to restore a full and 30 different incrementals that have to be processed. Whereas if you have a weekly synthetic full, you only have 7 incrementals.

My pre-REFS strategy has been to use forever forward with 14-day retention, and weekly synthetic full + transform previous chains into rollbacks for 60 day retention. I believe using the transform previous chains into rollbacks option would help with deletions in the same way that forward forever does.

Post by **tsightler** » Sep 25, 2017 9:24 pm this post

nmdange wrote:Think about it if you have a retention of 30 days with forward forever. If you want to restore your most recent backup, you have to restore a full and 30 different incrementals that have to be processed. Whereas if you have a weekly synthetic full, you only have 7 incrementals.

Veeam does not work this way. It does not restore the full and then the next 29 increments, it opens all 30 files at once, reads in the metadata, and then reads whatever blocks are required from each file. The metadata lets it know which file contains the most recent version of each block so only the most recent version of a block is ever read no matter how many points there are in the chain. The Veeam engine was built exactly for this, just about the only performance difference between a chain with 5 points and a chain with 30 points is the time spent reading the metadata from each file, which is typically about 1/10th of a second per-file.

Now sure, it's possible that really large backups, which have lots of metadata, with really large chains, can start getting significant delays, and also in cases of very slow repositories (like dedupe appliances), where reading metadata can be much slower than normal, but for typical ReFS deployments, which are usually on quite fast disks, there's unlikely to be more than a few seconds difference in restore times between a chain of 5 points and a chain of 30 points. Typically, we don't start seeing measurable performance differences until chains reach near the 100 restore point range, and even there, it's usually measured in 10's of seconds of difference.

antipolis · Post by **antipolis** » Sep 26, 2017 8:01 am this post

ok I'm really confused here... I've read a lot of things on veeam /w refs, watched the videos from Gostev, and I might have missed it but I don't recall seeing anywhere that the new best practice is just stop doing synthetic backups on refs...

anyway synthetics or not doesnt explain the performances issues many are seeing, disabling synthetics on regular backups won't solve the issue merge operations or backup copy job GFS

Sep 26, 2017 3:44 pm

antipolis wrote:ok I'm really confused here... I've read a lot of things on veeam /w refs, watched the videos from Gostev, and I might have missed it but I don't recall seeing anywhere that the new best practice is just stop doing synthetic backups on refs...

anyway synthetics or not doesnt explain the performances issues many are seeing, disabling synthetics on regular backups won't solve the issue merge operations or backup copy job GFS

I'll try to explain it from my perspective. Gostev is part of product development (specifically I believe his title is currently Senior VP of Product Management), while Luca and I work on the field with customers and their deployments (I'm Principal Solutions Architect in NA, and I believe Luca's title is Cloud and Hosting Architect in EMEA). We all work together closely, but I suppose the best way to say it might be that Gostev, and likely all of us in the early days, provide practices that are best in theory, because there is not enough field practice to provide anything else, but, over time, the field proves what practices are actually best, and sometimes this is different based on various technological issues that were unforeseen.

The Veeam best practices document is created by Veeam architects that work in the field, like Luca and myself (and many others), because we observe what things work for customers and what things cause problems, thus the recommendations/suggestions you see here may actually be based on changes we are planning to make in the best practices going forward, or may be based on a specific situation that exist now (like the current issues with ReFS) that we hope will go away in the future, but we want customers to have the best experience now. This is not to say the Gostev and product development do not also see field issues, they of course do through support cases and other avenues (including speaking with field resources like myself and Luca and others), but from that perspective they're focus is mostly regarding fixing the issues so that hopefully theory and practice get closer together, while we, as field resources, work are to provide recommendations that will be successful even with any current technical limitaitons.

And that's exactly where this specific recommendation is coming from. Sure, in theory, using synthetic fulls on ReFS via block clone should not cause any problems. However, in practice, we've seen that having lots of cloned files with many files referencing the same blocks, has many more negative impacts than expected, one of the more common being when those files have to be deleted. Microsoft continues to work on these issues, so hopefully, at some point in the future, theory will get closer to practice, but for now, those of us in the field are saying is that, based on field experience, customers that run synthetics have significantly more performance and stability issues than those that do not so, if you are running synthetics in cases where they offer minimal benefit, it's probably best to not use them.

The reason for this is actually pretty obvious if you think about it. Most of the performance and stability issues around ReFS seem to be with how ReFS accounts for cloned blocks. If I have 45 days of retention with synthetic fulls, that mean I have 6 VBK fules that have blocks referenced with anywhere from 1-6 references in every one of those files. If my full backups are 20TB, that's 20TB worth of blocks that have to have their metadata updated. On the other hand, if I just have 45 days of forever forward backups, the only time there is ever any duplicate blocks is during the merge, and, as soon as the merge is complete, the file with the duplicate blocks (the oldest incremental) is deleted. Since this is not a full backup, it's much smaller, so far fewer blocks need metadata updated during the delete, and the update operation is overall simpler since the maximum times a block will be references is 2, and then immediately back down to 1 again.

In other words, when running synthetics on ReFS, there's a massive increase in the amount of accounting work that ReFS has to do compared to running forever incremental and, in practice, it does impact both the performance and stability of the solution, even if theory says it should not.

Of course you are correct to point out that there are cases, like Backup Copy with GFS, where synthetics are the only option, that's certainly true. We're hopeful that Microsoft eventually gets to the bottom of those specific issues, in which case it probably won't matter anymore more. But for now, we're simply saying that, based on the results of hundreds of customer deployments, and the current status of ReFS, running synthetic fulls in cases where they are not needed and offers little to no benefit is likely to cause more problems than using and forever mode.

antipolis · Post by **antipolis** » Sep 26, 2017 4:08 pm this post

tsightler wrote: I'll try to explain it from my perspective.

Thank you for this very informative post !

Just one thing that disturbs me

dellock6 wrote: there's in my opinion really no point in running synthetic fulls over ReFS

tsightler wrote:In other words, when running synthetics on ReFS, there's a massive increase in the amount of accounting work that ReFS has to do compared to running forever incremental and, in practice, it does impact both the performance and stability of the solution, even if theory says it should not.

if synthetics does not make any sense with reFS, and even potentially brings instability, why would we need ReFS in the first case ? just to speed up merges ? I don't see that point justifying refs block cloning by itself (but I guess it depends on deployments and environments)

for backup copy I guess the same logic applies (when not using GFS) : rebuilds = useless with reFS ?

and for GFS would you suggest it's better to have a huge retention on the job (6 mo-1yr) instead ? (granted of course that one can afford to store that time worth of daily incrementals...)

it's just that I deployed refs having in mind that BP was definitely to go synthetics/rebuilds/GFS... in light of your arguments the benefits from refs suddenly looks a lot less appealing to me...

another way to see it (from my current understanding) :

- forever incremental backup chain on ntfs : can get slow over time, possibility to optimize with synthetics
- forever incremental backup chain on refs : can get slow over time, and will stay that way no matter what

are these assumptions correct ?

Post by **tsightler** » Sep 26, 2017 4:42 pm this post

antipolis wrote:if synthetics does not make any sense with reFS, and even potentially brings instability, why would we need ReFS in the first case ? just to speed up merges ? I don't see that point justifying refs block cloning by itself (but I guess it depends on deployments and environments)

Well yes, to me speeding up merges was the number one benefit ReFS provided, as this was one of the most difficult design issues for Veeam at scale. I don't know how many VMs you have, but I work with customers that have 1000's of VMs and sometimes PBs of data, and merges are a big deal at that scale. All you have to do is search for "slow merge" on the forum to see that it was a number one pain point for Veeam for many years.

However, we're not saying synthetics make no sense ever. We're saying that if you are using synthetics in cases where they offer little to no benefit (I'll say, any retention < 100 points although it's hard to set an exact line in the sand), then our suggestion, at least until the performance/stability issues are worked out, is to not use synthetics in those cases. The point was that we were seeing lots of people that have 30 day or 45 day retention using synthetics, and they just don't have much advantage in that scenario, so don't use them and you likely have a better experience with ReFS. It's really that simple.

antipolis wrote:for backup copy I guess the same logic applies (when not using GFS) : rebuilds = useless with reFS ?

I'm not sure what you mean by rebuilds, but I strongly suggest that you enable health checks and defrag/compacts even on ReFS, for any forever backup chain, whether primary backup jobs or backup copy.

antipolis wrote:and for GFS would you suggest it's better to have a huge retention on the job (6 mo-1yr) instead ? (granted of course that one can afford to store that time worth of daily incrementals...)

You are likely to find it to be more stable, but I understand that this is one of the highly desired use cases for ReFS is longer term GFS retention. If you want GFS retention with ReFS then we have to hope Microsoft eventually gets to the bottom of this issue, but you're probably OK for now.

antipolis wrote:another way to see it (from my current understanding) :

- forever incremental backup chain on ntfs : can get slow over time, possibility to optimize with synthetics
- forever incremental backup chain on refs : can get slow over time, and will stay that way no matter what

are these assumptions correct ?

I wouldn't state it this way. Once again, there's no recommendation here that, if you have a year retention, you shouldn't use GFS and synthetics on ReFS. The recommendation to not use synthetics is only for cases where the synthetic offers little to no benefit when used on ReFS. If I'm only keeping 45 days of restores points, you'd be hard pressed to come up with a measurable benefit of synthetic fulls on ReFS. But, if I have a requirement to keep say, 6 weekly, and 12 monthlies, then sure GFS and synthetic makes sense, but you may be more likely to hit some of the ReFS performance issues. Hopefully Microsoft will eventually get to the bottom of those and then it won't matter so much.

nmdange · Post by **nmdange** » Sep 26, 2017 7:33 pm this post

Good to know the number of restore points is not as much of an issue as you'd think! I'm hoping being on the new 1709/Semi-Annual Channel build will prove to be better with ReFS than 2016 has been so far.

Post by **suprnova** » Sep 26, 2017 9:01 pm this post

The reason we are using ReFS for synthetic fulls is because incremental merges on NTFS were extremely slow (24-70 hours). Also we even have instability with fast cloning merges with ReFS.

Post by **tsightler** » Sep 26, 2017 9:27 pm this post

suprnova wrote:The reason we are using ReFS for synthetic fulls is because incremental merges on NTFS were extremely slow (24-70 hours). Also we even have instability with fast cloning merges with ReFS.

It's certainly possible that you can still have problems even without synthetic fulls, the recommendations are based on the fact that we see far less problems from people who use synthetic fulls, vs those that do not, but less <> none.

When I first started testing ReFS late last year, it was pretty each to crash my home lab environment every 3 weeks with just a very basic setup, and I was able to crash the S3260 in the lab nightly. With current patches from July, and registry key tweaks, neither of these environments have experienced any problems, performance or stability wise, in the last 2 months. Every customer that I've been working with that have been able to be made stable with the beta drivers and/or most July updates, assuming registry keys were set and other minimum requirements were met (proper memory, proper tasks limits, etc). Most customers that are doing GFS points also seem to become stable, at least usable, but, just like in this thread, there are some with performance or stability issues even after all the tweaks, although many have some other questionable factors.

Anyone who is on this thread who would like to become part of the ongoing work, I would love to look at your environment, make sure we have full details of your setup, and continue to track your and our progress. Feel free to PM me your email address and I'll reach out to you. I think all of us hope that Microsoft eventually just gets this thing right but, for right now, I think it's useful to collect information and compare to environments that work, vs those that don't, and share that information when it seems useful for others.

antipolis · Post by **antipolis** » Sep 27, 2017 8:18 am this post

tsightler wrote:I'm not sure what you mean by rebuilds, but I strongly suggest that you enable health checks and defrag/compacts even on ReFS, for any forever backup chain, whether primary backup jobs or backup copy.

I was referring to full backup file maintenance

The description says "Use these settings to cleanup, defragment and compact full backup file periodically when the job schedule does not include periodic fulls", having synthetics enabled on backup jobs, and GFS on backup copy jobs I disabled this... so enabling full backup file maintenance will not bring the same downsides than synthetics on refs ? I mean this still uses block cloning right ?

health checks I left enabled of course

Sep 27, 2017 5:49 pm

antipolis wrote:I was referring to full backup file maintenance

The description says "Use these settings to cleanup, defragment and compact full backup file periodically when the job schedule does not include periodic fulls", having synthetics enabled on backup jobs, and GFS on backup copy jobs I disabled this... so enabling full backup file maintenance will not bring the same downsides than synthetics on refs ? I mean this still uses block cloning right ?

Got it, so of course, the compact/defragment option does use block clone during the process, but then the old file is deleted immediately after the compact/defragment is finished, so all the reference counts go back to one. Deletes are definitely one case that can trigger the behavior, but deleting a file referencing a block twice doesn't seem to be as bad as the purge of many different VBK files in one fast action, each of which that are referencing blocks shared many times. So far, I haven't seen file compact/defrag cause an issue and I've had customers leave it on because it is needed to free up unused blocks over time.

antipolis · Post by **antipolis** » Sep 28, 2017 12:21 pm this post

tsightler wrote: Got it, so of course, the compact/defragment option does use block clone during the process, but then the old file is deleted immediately after the compact/defragment is finished, so all the reference counts go back to one. Deletes are definitely one case that can trigger the behavior, but deleting a file referencing a block twice doesn't seem to be as bad as the purge of many different VBK files in one fast action, each of which that are referencing blocks shared many times. So far, I haven't seen file compact/defrag cause an issue and I've had customers leave it on because it is needed to free up unused blocks over time.

Thank you for your clarifications, this is much appreciated

Mgamerz · Post by **Mgamerz** » Sep 29, 2017 8:12 pm this post

Can someone post the beta drivers or info on them? While this thread has lots of valuable info, I see many references to beta drivers but the OP doesn't have any useful info in it and I don't really want to search through hundreds of posts to try and find them...

DaveWatkins · Post by **DaveWatkins** » Oct 01, 2017 6:59 pm this post

As far as I'm aware the drivers referred to here as beta driver were released in the August Cumulative Update. All the registry entries posted should work on an up to date machine

Cicadymn · Post by **Cicadymn** » Oct 05, 2017 2:54 pm this post

I spoke with Microsoft's ReFS team the other day. Last update was that they were back porting the fix that they believe will solve the issues. Originally they expected it to be out sometime in August or September, however, that wasn't the case. I got the following response from them:

No, this fix wasn’t ported yet. As this isn’t trivial change we want to be sure there is no regression before we release it.

So it sounds like they're really trying to cross their T's and dot their I's. Fingers crossed this will get us to some semblance of normality on ReFS!

nmdange · Post by **nmdange** » Oct 06, 2017 12:03 pm this post

The phrase "backported" says to me the fix is part of the next build (1709) which will be out in the next few weeks. I definitely plan on testing backups on the new build, but I haven't really had the same issues other ppl have had. Anyone who's been having problems willing to try out the new release?

DaveWatkins · Post by **DaveWatkins** » Oct 07, 2017 7:01 pm this post

On a possibly related note, can anyone confirm this?

ReFS is not supported on SAN-attached storage.

From here https://docs.microsoft.com/en-us/window ... s-overview

Is anyone else running on iSCSI or FC attached disks?

Post by **mkretzer** » Oct 07, 2017 8:02 pm this post

We had it on FC. Does anyone have the same performance issues on non-ISCSI and non-FC disks?

R&D Forums

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Who is online