REFS issues (server lockups, high CPU, high RAM)

Availability for the Always-On Enterprise

Re: REFS 4k horror story

Veeam Logoby tsightler » Tue Sep 26, 2017 3:44 pm 3 people like this post

antipolis wrote:ok I'm really confused here... I've read a lot of things on veeam /w refs, watched the videos from Gostev, and I might have missed it but I don't recall seeing anywhere that the new best practice is just stop doing synthetic backups on refs...

anyway synthetics or not doesnt explain the performances issues many are seeing, disabling synthetics on regular backups won't solve the issue merge operations or backup copy job GFS

I'll try to explain it from my perspective. Gostev is part of product development (specifically I believe his title is currently Senior VP of Product Management), while Luca and I work on the field with customers and their deployments (I'm Principal Solutions Architect in NA, and I believe Luca's title is Cloud and Hosting Architect in EMEA). We all work together closely, but I suppose the best way to say it might be that Gostev, and likely all of us in the early days, provide practices that are best in theory, because there is not enough field practice to provide anything else, but, over time, the field proves what practices are actually best, and sometimes this is different based on various technological issues that were unforeseen.

The Veeam best practices document is created by Veeam architects that work in the field, like Luca and myself (and many others), because we observe what things work for customers and what things cause problems, thus the recommendations/suggestions you see here may actually be based on changes we are planning to make in the best practices going forward, or may be based on a specific situation that exist now (like the current issues with ReFS) that we hope will go away in the future, but we want customers to have the best experience now. This is not to say the Gostev and product development do not also see field issues, they of course do through support cases and other avenues (including speaking with field resources like myself and Luca and others), but from that perspective they're focus is mostly regarding fixing the issues so that hopefully theory and practice get closer together, while we, as field resources, work are to provide recommendations that will be successful even with any current technical limitaitons.

And that's exactly where this specific recommendation is coming from. Sure, in theory, using synthetic fulls on ReFS via block clone should not cause any problems. However, in practice, we've seen that having lots of cloned files with many files referencing the same blocks, has many more negative impacts than expected, one of the more common being when those files have to be deleted. Microsoft continues to work on these issues, so hopefully, at some point in the future, theory will get closer to practice, but for now, those of us in the field are saying is that, based on field experience, customers that run synthetics have significantly more performance and stability issues than those that do not so, if you are running synthetics in cases where they offer minimal benefit, it's probably best to not use them.

The reason for this is actually pretty obvious if you think about it. Most of the performance and stability issues around ReFS seem to be with how ReFS accounts for cloned blocks. If I have 45 days of retention with synthetic fulls, that mean I have 6 VBK fules that have blocks referenced with anywhere from 1-6 references in every one of those files. If my full backups are 20TB, that's 20TB worth of blocks that have to have their metadata updated. On the other hand, if I just have 45 days of forever forward backups, the only time there is ever any duplicate blocks is during the merge, and, as soon as the merge is complete, the file with the duplicate blocks (the oldest incremental) is deleted. Since this is not a full backup, it's much smaller, so far fewer blocks need metadata updated during the delete, and the update operation is overall simpler since the maximum times a block will be references is 2, and then immediately back down to 1 again.

In other words, when running synthetics on ReFS, there's a massive increase in the amount of accounting work that ReFS has to do compared to running forever incremental and, in practice, it does impact both the performance and stability of the solution, even if theory says it should not.

Of course you are correct to point out that there are cases, like Backup Copy with GFS, where synthetics are the only option, that's certainly true. We're hopeful that Microsoft eventually gets to the bottom of those specific issues, in which case it probably won't matter anymore more. But for now, we're simply saying that, based on the results of hundreds of customer deployments, and the current status of ReFS, running synthetic fulls in cases where they are not needed and offers little to no benefit is likely to cause more problems than using and forever mode.
tsightler
Veeam Software
 
Posts: 4897
Liked: 1836 times
Joined: Fri Jun 05, 2009 12:57 pm
Full Name: Tom Sightler

Re: REFS 4k horror story

Veeam Logoby antipolis » Tue Sep 26, 2017 4:08 pm

tsightler wrote:I'll try to explain it from my perspective.


Thank you for this very informative post !

Just one thing that disturbs me

dellock6 wrote: there's in my opinion really no point in running synthetic fulls over ReFS


tsightler wrote:In other words, when running synthetics on ReFS, there's a massive increase in the amount of accounting work that ReFS has to do compared to running forever incremental and, in practice, it does impact both the performance and stability of the solution, even if theory says it should not.


if synthetics does not make any sense with reFS, and even potentially brings instability, why would we need ReFS in the first case ? just to speed up merges ? I don't see that point justifying refs block cloning by itself (but I guess it depends on deployments and environments)

for backup copy I guess the same logic applies (when not using GFS) : rebuilds = useless with reFS ?

and for GFS would you suggest it's better to have a huge retention on the job (6 mo-1yr) instead ? (granted of course that one can afford to store that time worth of daily incrementals...)

it's just that I deployed refs having in mind that BP was definitely to go synthetics/rebuilds/GFS... in light of your arguments the benefits from refs suddenly looks a lot less appealing to me...

another way to see it (from my current understanding) :

- forever incremental backup chain on ntfs : can get slow over time, possibility to optimize with synthetics
- forever incremental backup chain on refs : can get slow over time, and will stay that way no matter what

are these assumptions correct ?
antipolis
Enthusiast
 
Posts: 63
Liked: 8 times
Joined: Wed Oct 26, 2016 9:17 am

Re: REFS 4k horror story

Veeam Logoby tsightler » Tue Sep 26, 2017 4:42 pm

antipolis wrote:if synthetics does not make any sense with reFS, and even potentially brings instability, why would we need ReFS in the first case ? just to speed up merges ? I don't see that point justifying refs block cloning by itself (but I guess it depends on deployments and environments)

Well yes, to me speeding up merges was the number one benefit ReFS provided, as this was one of the most difficult design issues for Veeam at scale. I don't know how many VMs you have, but I work with customers that have 1000's of VMs and sometimes PBs of data, and merges are a big deal at that scale. All you have to do is search for "slow merge" on the forum to see that it was a number one pain point for Veeam for many years.

However, we're not saying synthetics make no sense ever. We're saying that if you are using synthetics in cases where they offer little to no benefit (I'll say, any retention < 100 points although it's hard to set an exact line in the sand), then our suggestion, at least until the performance/stability issues are worked out, is to not use synthetics in those cases. The point was that we were seeing lots of people that have 30 day or 45 day retention using synthetics, and they just don't have much advantage in that scenario, so don't use them and you likely have a better experience with ReFS. It's really that simple.

antipolis wrote:for backup copy I guess the same logic applies (when not using GFS) : rebuilds = useless with reFS ?

I'm not sure what you mean by rebuilds, but I strongly suggest that you enable health checks and defrag/compacts even on ReFS, for any forever backup chain, whether primary backup jobs or backup copy.

antipolis wrote:and for GFS would you suggest it's better to have a huge retention on the job (6 mo-1yr) instead ? (granted of course that one can afford to store that time worth of daily incrementals...)

You are likely to find it to be more stable, but I understand that this is one of the highly desired use cases for ReFS is longer term GFS retention. If you want GFS retention with ReFS then we have to hope Microsoft eventually gets to the bottom of this issue, but you're probably OK for now.

antipolis wrote:another way to see it (from my current understanding) :

- forever incremental backup chain on ntfs : can get slow over time, possibility to optimize with synthetics
- forever incremental backup chain on refs : can get slow over time, and will stay that way no matter what

are these assumptions correct ?

I wouldn't state it this way. Once again, there's no recommendation here that, if you have a year retention, you shouldn't use GFS and synthetics on ReFS. The recommendation to not use synthetics is only for cases where the synthetic offers little to no benefit when used on ReFS. If I'm only keeping 45 days of restores points, you'd be hard pressed to come up with a measurable benefit of synthetic fulls on ReFS. But, if I have a requirement to keep say, 6 weekly, and 12 monthlies, then sure GFS and synthetic makes sense, but you may be more likely to hit some of the ReFS performance issues. Hopefully Microsoft will eventually get to the bottom of those and then it won't matter so much.
tsightler
Veeam Software
 
Posts: 4897
Liked: 1836 times
Joined: Fri Jun 05, 2009 12:57 pm
Full Name: Tom Sightler

Re: REFS 4k horror story

Veeam Logoby nmdange » Tue Sep 26, 2017 7:33 pm

Good to know the number of restore points is not as much of an issue as you'd think! I'm hoping being on the new 1709/Semi-Annual Channel build will prove to be better with ReFS than 2016 has been so far.
nmdange
Expert
 
Posts: 235
Liked: 60 times
Joined: Thu Aug 20, 2015 9:30 pm

Re: REFS 4k horror story

Veeam Logoby suprnova » Tue Sep 26, 2017 9:01 pm

The reason we are using ReFS for synthetic fulls is because incremental merges on NTFS were extremely slow (24-70 hours). Also we even have instability with fast cloning merges with ReFS.
suprnova
Service Provider
 
Posts: 21
Liked: never
Joined: Fri Apr 08, 2016 5:15 pm

Re: REFS 4k horror story

Veeam Logoby tsightler » Tue Sep 26, 2017 9:27 pm

suprnova wrote:The reason we are using ReFS for synthetic fulls is because incremental merges on NTFS were extremely slow (24-70 hours). Also we even have instability with fast cloning merges with ReFS.


It's certainly possible that you can still have problems even without synthetic fulls, the recommendations are based on the fact that we see far less problems from people who use synthetic fulls, vs those that do not, but less <> none.

When I first started testing ReFS late last year, it was pretty each to crash my home lab environment every 3 weeks with just a very basic setup, and I was able to crash the S3260 in the lab nightly. With current patches from July, and registry key tweaks, neither of these environments have experienced any problems, performance or stability wise, in the last 2 months. Every customer that I've been working with that have been able to be made stable with the beta drivers and/or most July updates, assuming registry keys were set and other minimum requirements were met (proper memory, proper tasks limits, etc). Most customers that are doing GFS points also seem to become stable, at least usable, but, just like in this thread, there are some with performance or stability issues even after all the tweaks, although many have some other questionable factors.

Anyone who is on this thread who would like to become part of the ongoing work, I would love to look at your environment, make sure we have full details of your setup, and continue to track your and our progress. Feel free to PM me your email address and I'll reach out to you. I think all of us hope that Microsoft eventually just gets this thing right but, for right now, I think it's useful to collect information and compare to environments that work, vs those that don't, and share that information when it seems useful for others.
tsightler
Veeam Software
 
Posts: 4897
Liked: 1836 times
Joined: Fri Jun 05, 2009 12:57 pm
Full Name: Tom Sightler

Re: REFS 4k horror story

Veeam Logoby antipolis » Wed Sep 27, 2017 8:18 am

tsightler wrote:I'm not sure what you mean by rebuilds, but I strongly suggest that you enable health checks and defrag/compacts even on ReFS, for any forever backup chain, whether primary backup jobs or backup copy.


I was referring to full backup file maintenance

The description says "Use these settings to cleanup, defragment and compact full backup file periodically when the job schedule does not include periodic fulls", having synthetics enabled on backup jobs, and GFS on backup copy jobs I disabled this... so enabling full backup file maintenance will not bring the same downsides than synthetics on refs ? I mean this still uses block cloning right ?

health checks I left enabled of course
antipolis
Enthusiast
 
Posts: 63
Liked: 8 times
Joined: Wed Oct 26, 2016 9:17 am

Re: REFS 4k horror story

Veeam Logoby tsightler » Wed Sep 27, 2017 5:49 pm 1 person likes this post

antipolis wrote:I was referring to full backup file maintenance

The description says "Use these settings to cleanup, defragment and compact full backup file periodically when the job schedule does not include periodic fulls", having synthetics enabled on backup jobs, and GFS on backup copy jobs I disabled this... so enabling full backup file maintenance will not bring the same downsides than synthetics on refs ? I mean this still uses block cloning right ?

Got it, so of course, the compact/defragment option does use block clone during the process, but then the old file is deleted immediately after the compact/defragment is finished, so all the reference counts go back to one. Deletes are definitely one case that can trigger the behavior, but deleting a file referencing a block twice doesn't seem to be as bad as the purge of many different VBK files in one fast action, each of which that are referencing blocks shared many times. So far, I haven't seen file compact/defrag cause an issue and I've had customers leave it on because it is needed to free up unused blocks over time.
tsightler
Veeam Software
 
Posts: 4897
Liked: 1836 times
Joined: Fri Jun 05, 2009 12:57 pm
Full Name: Tom Sightler

Re: REFS 4k horror story

Veeam Logoby antipolis » Thu Sep 28, 2017 12:21 pm

tsightler wrote:Got it, so of course, the compact/defragment option does use block clone during the process, but then the old file is deleted immediately after the compact/defragment is finished, so all the reference counts go back to one. Deletes are definitely one case that can trigger the behavior, but deleting a file referencing a block twice doesn't seem to be as bad as the purge of many different VBK files in one fast action, each of which that are referencing blocks shared many times. So far, I haven't seen file compact/defrag cause an issue and I've had customers leave it on because it is needed to free up unused blocks over time.

Thank you for your clarifications, this is much appreciated
antipolis
Enthusiast
 
Posts: 63
Liked: 8 times
Joined: Wed Oct 26, 2016 9:17 am

Re: REFS 4k horror story

Veeam Logoby Mgamerz » Fri Sep 29, 2017 8:12 pm

Can someone post the beta drivers or info on them? While this thread has lots of valuable info, I see many references to beta drivers but the OP doesn't have any useful info in it and I don't really want to search through hundreds of posts to try and find them...
Mgamerz
Lurker
 
Posts: 1
Liked: never
Joined: Fri Sep 29, 2017 8:07 pm

Re: REFS 4k horror story

Veeam Logoby DaveWatkins » Sun Oct 01, 2017 6:59 pm

As far as I'm aware the drivers referred to here as beta driver were released in the August Cumulative Update. All the registry entries posted should work on an up to date machine
DaveWatkins
Expert
 
Posts: 277
Liked: 70 times
Joined: Sun Dec 13, 2015 11:33 pm

Re: REFS 4k horror story

Veeam Logoby Cicadymn » Thu Oct 05, 2017 2:54 pm

I spoke with Microsoft's ReFS team the other day. Last update was that they were back porting the fix that they believe will solve the issues. Originally they expected it to be out sometime in August or September, however, that wasn't the case. I got the following response from them:

No, this fix wasn’t ported yet. As this isn’t trivial change we want to be sure there is no regression before we release it.

So it sounds like they're really trying to cross their T's and dot their I's. Fingers crossed this will get us to some semblance of normality on ReFS!
Cicadymn
Influencer
 
Posts: 21
Liked: 5 times
Joined: Mon Jan 30, 2017 7:42 pm
Full Name: Sam

Re: REFS 4k horror story

Veeam Logoby nmdange » Fri Oct 06, 2017 12:03 pm

The phrase "backported" says to me the fix is part of the next build (1709) which will be out in the next few weeks. I definitely plan on testing backups on the new build, but I haven't really had the same issues other ppl have had. Anyone who's been having problems willing to try out the new release?
nmdange
Expert
 
Posts: 235
Liked: 60 times
Joined: Thu Aug 20, 2015 9:30 pm

Re: REFS 4k horror story

Veeam Logoby DaveWatkins » Sat Oct 07, 2017 7:01 pm

On a possibly related note, can anyone confirm this?

ReFS is not supported on SAN-attached storage.


From here https://docs.microsoft.com/en-us/window ... s-overview

Is anyone else running on iSCSI or FC attached disks?
DaveWatkins
Expert
 
Posts: 277
Liked: 70 times
Joined: Sun Dec 13, 2015 11:33 pm

Re: REFS 4k horror story

Veeam Logoby mkretzer » Sat Oct 07, 2017 8:02 pm

We had it on FC. Does anyone have the same performance issues on non-ISCSI and non-FC disks?
mkretzer
Expert
 
Posts: 367
Liked: 76 times
Joined: Thu Dec 17, 2015 7:17 am

PreviousNext

Return to Veeam Backup & Replication



Who is online

Users browsing this forum: No registered users and 1 guest