Comprehensive data protection for all workloads
Post Reply
JaySt
Service Provider
Posts: 213
Liked: 32 times
Joined: Jun 09, 2015 7:08 pm
Full Name: JaySt
Contact:

Re: Windows 2019, large REFS and deletes

Post by JaySt »

https://www.veeambp.com/repository_serv ... ing_sizing
still says so there...
But ok, good to know it's no longer relevant.
Veeam Certified Engineer

tsightler
VP, Product Management
Posts: 5731
Liked: 2575 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Windows 2019, large REFS and deletes

Post by tsightler » 1 person likes this post

Yep, the 1GB of RAM for every 1TB of data on ReFS is a best practice that came directly from our early work with ReFS deployments in the field. It was clearly observed that customers with less than this amount experienced significantly more performance and stability issues vs those that had large servers with lots of RAM. From a field perspective, we want our customers to have the very best experience possible, so if we observe configuration X has far higher success rate than configuration Y, then configuration X will become best practice.

It was indeed a workaround, based on the issues that ReFS experienced with kernel memory. It's very similar to the recommendation to use forever forward vs synthetic fulls on ReFS where possible, because synthetic fulls put dramatically higher load on the ReFS filesystem and significantly increase the odds of experiencing problems.

As Gostev noted, the memory issue was largely mitigated with patches to Windows 2016 release last year, and, at least internally, we've stopped making 1GB per 1TB a hard recommendation. However, one of the interesting challenges with field best practices is that, once you make one, everybody does it that way, so it's hard to undo it, even if factors change that may make the old practice unnecessary. Most customers, given the choice of "we know X works best in the past, and continues to work fine, but Y should work fine now too" will still choose X.

JaySt
Service Provider
Posts: 213
Liked: 32 times
Joined: Jun 09, 2015 7:08 pm
Full Name: JaySt
Contact:

Re: Windows 2019, large REFS and deletes

Post by JaySt »

Thanks! Great info and i understand the reasoning.
Veeam Certified Engineer

ferrus
Veeam ProPartner
Posts: 270
Liked: 39 times
Joined: Dec 03, 2015 3:41 pm
Location: UK
Contact:

Re: Windows 2019, large REFS and deletes

Post by ferrus »

Can I ask, from users experience with 2019 and ReFS - if they notice a performance difference between 2016 and 2019 ReFS for normal IO operations (not just deletions on large volumes).

I rebuilt one of our Veeam repositories with 2019/ReFS, leaving all the others on 2016/ReFS. Both are fully patched, on the same hardware (54TB RAID 6).
I noticed file operations were taking a lot longer on the 2019 server, so I performed a diskspd on two of the servers.

The Total IO for 2016 was 373 MB/s
The Total IO for 2019 was 10 MB/s

There was a running Backup Copy Job on the 2019 server, which would account for some of the performance difference - but certainly not that amount.
I had a similar performance degradation earlier in the year (case ref #03336103), but that turned out to be RAID settings within Cisco UCS.

Do the performance fixes mentioned in this thread apply to normal day-to-day operations - or just deletions on large volumes?

JaySt
Service Provider
Posts: 213
Liked: 32 times
Joined: Jun 09, 2015 7:08 pm
Full Name: JaySt
Contact:

Re: Windows 2019, large REFS and deletes

Post by JaySt » 1 person likes this post

No i don't see difference in performance for normal IO operations. I also test my repositories with diskspd before deploying, but always happy with results, even on 2019 (hundreds of MB/s, depending on disk config). The numbers you get on 2019 seem to indicate there's definitely something wrong with drivers/raid controller config / caching etc.
Veeam Certified Engineer

Markus M.
Novice
Posts: 3
Liked: never
Joined: Dec 09, 2019 5:41 pm
Contact:

Re: Windows 2019, large REFS and deletes

Post by Markus M. »

We are running a relatively small veeam environment (single server, approx. 75TB storage) and that was installed with WS2016 first.

Because I was in need of SureBackup and VMs are now ConfigVersion 9, I decided to upgrade the server to Ws2019 (1809 LTSC).
The SOBR has 2 extents provided by storage spaces and after upgrade everything seemed to be fine, backup Performance was more or less equal.

But when running a restore from tape (LTO7) back into one of the extents, I discovered a dramatically decrease of throughput: With WS2016 it was approx. 250 MB/s now its just 80 MB/s.
I tried all I know for this issue: Baseline updates, Firmware on HBA, Disks, Library, LTO Drive with no success.
Then I checked windows updates, but all involved systems are current. Compared the performance when running tape to repo vs. running smb copy from tape server to repo.
After, I opened a case with Veeam (03898998) checked a couple of Settings inside Veeam with the engineers there - no luck.
Found out, when copying a 92 GB vbk via SMB (PS: copy-item) and observed Network throughput, that it "ripples" between 5Gb/s and almost zero all the time (like a potrait from swiss alpes :-)
Finally I found a warning in SMB server eventlog always when the copy Operation was "stalled": " Event 1020 - File system operation has taken longer than expected. The underlying file system has taken too long to respond to an operation. This typically indicates a problem with the storage and not SMB."
This finally leads me to this blogpost here. I set the TRIM option as described earlier here: "fsutil behavior set DisableDeleteNotify ReFS 1" but so far I am still experiencing the reported performance issues, too. Currently, I can use another SMB repo for tape restores with WS2012 R2 and ReFS with no Performance issues.
The refs.sys version used on the B+R Server is 10.0.17763.831 which supposed to be the latest for LTSC Version.
So for now I stay tuned on this post hoping somebody will report a final fix for this issue!

Gostev
SVP, Product Management
Posts: 27454
Liked: 4558 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Windows 2019, large REFS and deletes

Post by Gostev »

Indeed, as noted above it is best to avoid Windows Server 2019 LTSC at the moment... either remain on Server 2016, or use 1903/1909 SAC builds of Server 2019. Thanks!

ferrus
Veeam ProPartner
Posts: 270
Liked: 39 times
Joined: Dec 03, 2015 3:41 pm
Location: UK
Contact:

Re: Windows 2019, large REFS and deletes

Post by ferrus »

It's just odd that it seems to have got so dramatically worse, so suddenly.
I'm currently on day three - of a 'fast clone' incremental merge on 2019.

I'll look through the logs, for the event mentioned above.

Gostev
SVP, Product Management
Posts: 27454
Liked: 4558 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Windows 2019, large REFS and deletes

Post by Gostev »

ferrus wrote: Dec 09, 2019 9:20 pmIt's just odd that it seems to have got so dramatically worse, so suddenly.
Sounds like a typical major release? ;)

ferrus
Veeam ProPartner
Posts: 270
Liked: 39 times
Joined: Dec 03, 2015 3:41 pm
Location: UK
Contact:

Re: Windows 2019, large REFS and deletes

Post by ferrus »

Sounds like a typical major release? ;)
Shhhhh .... Major Veeam release coming soon. Don't jinx it! :lol:

I presume downgrading to 2016, but keeping the same 2019 ReFS formatted volume isn't supported?
Or is the file system the same, and just the refs driver that changes?

Gostev
SVP, Product Management
Posts: 27454
Liked: 4558 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Windows 2019, large REFS and deletes

Post by Gostev »

Good question. If ReFS format changed between 2016 and 2019, then I would expect an upgrade process to require re-format of existing volumes (and some process similar to VMFS upgrade). But this is definitely not the case, meaning it is safe to assume that even if ReFS version was incremented in 2019, this will apply to newly provisioned under Server 2019 volumes only.

poulpreben
Veeam Vanguard
Posts: 1011
Liked: 438 times
Joined: Jul 23, 2012 8:16 am
Full Name: Preben Berg
Contact:

Re: Windows 2019, large REFS and deletes

Post by poulpreben » 1 person likes this post

I haven't been able to successfully bring original 2016 volumes online again on a 2016 server after they have been online on 1903. Might be a glitch, but we ended up staying on 1903.

1903 does run brilliantly though, even with trim/unmap enabled. We haven't tested 1909 yet since it is not supported by Backup & Replication.

mkretzer
Expert
Posts: 692
Liked: 162 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: Windows 2019, large REFS and deletes

Post by mkretzer »

Brilliantly indeed! 1903 saved our whole REFS project!

I just hope 1909 gets supported fast. Were there any tests from Veeam side already?

poulpreben
Veeam Vanguard
Posts: 1011
Liked: 438 times
Joined: Jul 23, 2012 8:16 am
Full Name: Preben Berg
Contact:

Re: Windows 2019, large REFS and deletes

Post by poulpreben »

Are you aware of any further ReFS improvements in 1909? We did some testing with the v10 beta. It works, but I didn’t see any noticeable difference in performance or memory consumption...

Torrey.bley
Veeam Software
Posts: 8
Liked: 2 times
Joined: Dec 17, 2014 11:12 pm
Full Name: Torrey Bley
Contact:

Re: Windows 2019, large REFS and deletes

Post by Torrey.bley »

Andrew@MSFT wrote: Nov 22, 2019 10:49 pm DISCLAIMER: I work for Microsoft as a Program Manager on the Storage and File Systems Team – specifically the Resilient File System (ReFS).

First, wanted to give my sincerest THANK YOU! for choosing ReFS with Veeam as your preferred platform. Microsoft has worked directly with Veeam since their integration with ReFS Block Cloning technology to ensure your data integrity is of top priority. Our goal is to make the most performant, space efficient, reliable solution for our customers.

Can you explain the issue?

Veeam uses ReFS block cloning functionality to make backups reliable, fast and efficient. ReFS Block Cloning involves maintaining a reference count of each allocated block. Sometimes, performance can be affected when a system has a large number of cloned files and is doing large numbers of deletes, overwrites, etc. The more frequently your data is changing, and the more data you have, the larger the reference table. This tracking ensures your data remains consistent, available, and correct.

What is Microsoft doing about it?

Microsoft recognizes the issue and has invested in new optimizations for block cloning. These changes make cloning faster and more efficient. We are considering multiple options to get these optimizations to our customers. I will post again in January 2020 when I have more details.

What can I do now if I am experiencing this issue?

Ensure Trim is disabled "fsutil behavior set DisableDeleteNotify ReFS 1"
Create smaller volumes. This can help with the amount of data churn.
Engage with Microsoft product support. By opening a support case, you get a dedicated resource to help with your specific needs.
I am with Veeam support and I am working with a customer on this issue. They have had experience in the past with the step "fsutil behavior set DisableDeleteNotify ReFS "1 mentioned above and asked that I pass along a warning. They warned that setting that might be a bad idea, it might improve performance in the short term, but in some circumstances (like local HDD storage or any other storage that doesn’t perform its own garbage collection/unmap) it will keep deleted block from being returned to free space. Eventually all space will be filled even though it isn’t really and disk space will be exhaustive. The only fix is to format the volume as no amount of deleting will fix it, and reverting the setting won’t reclaim the space from already deleted files. That space becomes permanently unavailable.

They worked with MS for months trying to find a solution, but in the end, formatting the repositories and starting over was the only choice. They didn’t notice the problem until free space was almost gone and then it was too late.

Gostev
SVP, Product Management
Posts: 27454
Liked: 4558 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Windows 2019, large REFS and deletes

Post by Gostev »

poulpreben wrote: Dec 10, 2019 5:19 pmAre you aware of any further ReFS improvements in 1909? We did some testing with the v10 beta. It works, but I didn’t see any noticeable difference in performance or memory consumption...
Yes, I am aware - and they are very significant... some serious NDA stuff there under the hood! However, if you want to observe resulting performance improvements over 1903, you would have to use a very large volume, and create a lot of churn (so that there are a lot of cloning operations).

There's also one trick you can use: compare 1903 to 1909 on ReFS volumes with 4K cluster size, which creates by an order of magnitude more cloning operations and metadata for ReFS driver to deal with. Just to be clear: 64K clusters remain the recommendation for production ReFS repository deployments! The suggestion to use 4K clusters is specifically to make life much harder for ReFS by increasing the number of blocks in action 16x without changing anything else. Such test should make 1909 benefits over 1903 much more visible.

mkretzer
Expert
Posts: 692
Liked: 162 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: Windows 2019, large REFS and deletes

Post by mkretzer »

In other words larger volumes (like our initial 600 TB repo) should be less problematic with 1909 from what i understand :-)

rhys.hammond
Veeam Vanguard
Posts: 64
Liked: 15 times
Joined: Apr 07, 2013 10:36 pm
Full Name: Rhys Hammond
Location: Brisbane , Australia
Contact:

Re: Windows 2019, large REFS and deletes

Post by rhys.hammond »

Update on our 1809 ReFS woes, managed to piece together some temporary storage to add to the SOBR in order to evacuate some backups off the 1809 ReFS extent.
Unfortunately, we didn't manage to piece together enough temporary storage which meant we couldn't evacuate all backup data.

The ReFS performance, or lack thereof, continued to cause headaches during the remediation work, whenever a backup file evacuated or a VBK offloaded, the performance would again, fall off a cliff.
At which point we could either let it run severely degraded (50MB/s) or restart the repo server, losing progress for any incomplete offloads/evacuating files, but bringing the performance back up to 2-3GB/s.

Once the 2016 ReFS repo is up and running Ill provide an update after a few weeks.

Note: I have destroyed the 1809 ReFS volume and will be recreating it from scratch for 2016.

Cheers
Veeam Certified Architect | Author of http://rhyshammond.com | Veeam Vanguard | vExpert

fsr
Enthusiast
Posts: 28
Liked: 1 time
Joined: Mar 27, 2019 5:28 pm
Full Name: Fernando Rapetti
Contact:

Re: Windows 2019, large REFS and deletes

Post by fsr »

Gostev wrote: Dec 11, 2019 12:02 am Yes, I am aware - and they are very significant... some serious NDA stuff there under the hood! However, if you want to observe resulting performance improvements over 1903, you would have to use a very large volume, and create a lot of churn (so that there are a lot of cloning operations).

There's also one trick you can use: compare 1903 to 1909 on ReFS volumes with 4K cluster size, which creates by an order of magnitude more cloning operations and metadata for ReFS driver to deal with. Just to be clear: 64K clusters remain the recommendation for production ReFS repository deployments! The suggestion to use 4K clusters is specifically to make life much harder for ReFS by increasing the number of blocks in action 16x without changing anything else. Such test should make 1909 benefits over 1903 much more visible.
Makes you wonder if Microsoft couldn't just add the option to set the cluster size to 128 KB, or even larger as an aid for the versions with problems. And maybe not only for that. After all, it's not like that would waste any real disk space on a volume dedicated for very big files like backups and/or VMs, right?

Gostev
SVP, Product Management
Posts: 27454
Liked: 4558 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Windows 2019, large REFS and deletes

Post by Gostev » 3 people like this post

Because it would be more of a temporary workaround, rather than the real solution. Best compared to a painkiller injection to ease life of a dying patient... 2x or even 4x improvement over "bad" is still bad, especially considering that data footprint doubles every few years.

On the other hand, the architectural changes around ReFS metadata handling they've implemented in 1909 seem like the real deal - at least on paper. And if works as advertised, it should provide ReFS with a nice scalability headroom for the future growth.

mkretzer
Expert
Posts: 692
Liked: 162 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: Windows 2019, large REFS and deletes

Post by mkretzer »

@Gostev When will we be able to use 1909?
Do we really have to wait for V10 for that?

Gostev
SVP, Product Management
Posts: 27454
Liked: 4558 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Windows 2019, large REFS and deletes

Post by Gostev »

Yes. To start testing currently shipping versions against 1909 would require taking QC off of v10, thus delaying it.

JaySt
Service Provider
Posts: 213
Liked: 32 times
Joined: Jun 09, 2015 7:08 pm
Full Name: JaySt
Contact:

Re: Windows 2019, large REFS and deletes

Post by JaySt »

Any news on the backport of ReFS fixes to 1809 ltsc ?
Veeam Certified Engineer

Gostev
SVP, Product Management
Posts: 27454
Liked: 4558 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Windows 2019, large REFS and deletes

Post by Gostev »

See the post from ReFS PM above, he provided timelines for the next update.

GACcc
Novice
Posts: 8
Liked: 1 time
Joined: Jun 06, 2018 1:06 pm
Contact:

Re: Windows 2019, large REFS and deletes

Post by GACcc » 1 person likes this post

So just to give you guys a feedback.
We just formatted our whole NetApp Storage and connected it to a new installed Core Server 1909 (as repo) and everything is working fine, just as it was before all this.

rhys.hammond
Veeam Vanguard
Posts: 64
Liked: 15 times
Joined: Apr 07, 2013 10:36 pm
Full Name: Rhys Hammond
Location: Brisbane , Australia
Contact:

Re: Windows 2019, large REFS and deletes

Post by rhys.hammond » 1 person likes this post

Quick update on the downgrade from 1809 back to 2016. Previously the job was taking multiple days to create synthetic fulls on 1809, after installing 2016 the very same job took just 50 minutes and 44 seconds... happy days.
Veeam Certified Architect | Author of http://rhyshammond.com | Veeam Vanguard | vExpert

poulpreben
Veeam Vanguard
Posts: 1011
Liked: 438 times
Joined: Jul 23, 2012 8:16 am
Full Name: Preben Berg
Contact:

Re: Windows 2019, large REFS and deletes

Post by poulpreben »

We upgraded a 400 TB repository (4x 100 TB volumes) from 1809 to 1903. After a while it started behaving exactly like 1809. Merges took too long and dumping the VeeamAgent process revealed that it was indeed waiting for ReFS.

The recommendation from Veeam Support was to disable block cloning via a registry key, but this being a ~1,500 VM environment it would also impact the primary jobs which were running fine.

The only difference between the primary and the secondary SOBR, was that the primary SOBR contained 2x 200 TB extents across two servers, while the secondary was 4x 100 TB extents on a single server.

Instead of continuing troubleshooting the issue with support (and because I had to spend time with my family during the holidays), we had to close the missed SLAs by splitting up the 4x 100 TB across two servers instead. Since doing that, everything has been running fine. I’m still not excited about the merge times, but at least we’re not missing any SLAs.

hunterisageek
Lurker
Posts: 1
Liked: never
Joined: Jan 06, 2020 7:41 pm
Full Name: Hunter Kaemmerling
Contact:

Re: Windows 2019, large REFS and deletes

Post by hunterisageek »

I know it's early Jan, but we have been fighting this since about Oct on our 2019 ReFS Repos.

Is there going to be a fix for the server 2019 (1809)? or should I try and figure out a way to go back to 2016 ReFS?

We have 4x S3260's (2 at each site) each Repo has 28x 8TB drives in a raid 60. Each server has it carved into 3x 55tb Volumes (mostly for Windows ReFS Dedup only supports up to 64tb volumes).
Murges can take anywhere from 24hrs, and we had to kill everything a few weeks ago at 100+ hrs on some of the BCJ's.

Gostev
SVP, Product Management
Posts: 27454
Liked: 4558 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Windows 2019, large REFS and deletes

Post by Gostev »

Going back to Server 2016 will provide guaranteed results, so it seems as a safer bet to me. Otherwise, even when the fix is finally available (there's no specific timeline yet), you'll still be dependent on its first version working as it is supposed to right away... which from my experience is not always the case.

mdxyz
Service Provider
Posts: 18
Liked: 1 time
Joined: Jan 05, 2018 3:19 am
Contact:

Re: Windows 2019, large REFS and deletes

Post by mdxyz »

If there's a Server 2019 created ReFs volume can this safely be attached to a Server 2016 system (i.e., do we need to format and start over)?

Post Reply

Who is online

Users browsing this forum: No registered users and 23 guests