Windows 2019, large REFS and deletes

Post by **JaySt** » Nov 27, 2019 4:30 pm this post

https://www.veeambp.com/repository_serv ... ing_sizing
still says so there...
But ok, good to know it's no longer relevant.

Nov 27, 2019 4:38 pm

Yep, the 1GB of RAM for every 1TB of data on ReFS is a best practice that came directly from our early work with ReFS deployments in the field. It was clearly observed that customers with less than this amount experienced significantly more performance and stability issues vs those that had large servers with lots of RAM. From a field perspective, we want our customers to have the very best experience possible, so if we observe configuration X has far higher success rate than configuration Y, then configuration X will become best practice.

It was indeed a workaround, based on the issues that ReFS experienced with kernel memory. It's very similar to the recommendation to use forever forward vs synthetic fulls on ReFS where possible, because synthetic fulls put dramatically higher load on the ReFS filesystem and significantly increase the odds of experiencing problems.

As Gostev noted, the memory issue was largely mitigated with patches to Windows 2016 release last year, and, at least internally, we've stopped making 1GB per 1TB a hard recommendation. However, one of the interesting challenges with field best practices is that, once you make one, everybody does it that way, so it's hard to undo it, even if factors change that may make the old practice unnecessary. Most customers, given the choice of "we know X works best in the past, and continues to work fine, but Y should work fine now too" will still choose X.

Post by **JaySt** » Nov 27, 2019 6:25 pm this post

Thanks! Great info and i understand the reasoning.

Post by **ferrus** » Dec 04, 2019 9:18 am this post

Can I ask, from users experience with 2019 and ReFS - if they notice a performance difference between 2016 and 2019 ReFS for normal IO operations (not just deletions on large volumes).

I rebuilt one of our Veeam repositories with 2019/ReFS, leaving all the others on 2016/ReFS. Both are fully patched, on the same hardware (54TB RAID 6).
I noticed file operations were taking a lot longer on the 2019 server, so I performed a diskspd on two of the servers.

The Total IO for 2016 was 373 MB/s
The Total IO for 2019 was 10 MB/s

There was a running Backup Copy Job on the 2019 server, which would account for some of the performance difference - but certainly not that amount.
I had a similar performance degradation earlier in the year (case ref #03336103), but that turned out to be RAID settings within Cisco UCS.

Do the performance fixes mentioned in this thread apply to normal day-to-day operations - or just deletions on large volumes?

Dec 04, 2019 9:24 am

No i don't see difference in performance for normal IO operations. I also test my repositories with diskspd before deploying, but always happy with results, even on 2019 (hundreds of MB/s, depending on disk config). The numbers you get on 2019 seem to indicate there's definitely something wrong with drivers/raid controller config / caching etc.

Markus M. · Post by **Markus M.** » Dec 09, 2019 6:16 pm this post

We are running a relatively small veeam environment (single server, approx. 75TB storage) and that was installed with WS2016 first.

Because I was in need of SureBackup and VMs are now ConfigVersion 9, I decided to upgrade the server to Ws2019 (1809 LTSC).
The SOBR has 2 extents provided by storage spaces and after upgrade everything seemed to be fine, backup Performance was more or less equal.

But when running a restore from tape (LTO7) back into one of the extents, I discovered a dramatically decrease of throughput: With WS2016 it was approx. 250 MB/s now its just 80 MB/s.
I tried all I know for this issue: Baseline updates, Firmware on HBA, Disks, Library, LTO Drive with no success.
Then I checked windows updates, but all involved systems are current. Compared the performance when running tape to repo vs. running smb copy from tape server to repo.
After, I opened a case with Veeam (03898998) checked a couple of Settings inside Veeam with the engineers there - no luck.
Found out, when copying a 92 GB vbk via SMB (PS: copy-item) and observed Network throughput, that it "ripples" between 5Gb/s and almost zero all the time (like a potrait from swiss alpes

Finally I found a warning in SMB server eventlog always when the copy Operation was "stalled": " Event 1020 - File system operation has taken longer than expected. The underlying file system has taken too long to respond to an operation. This typically indicates a problem with the storage and not SMB."
This finally leads me to this blogpost here. I set the TRIM option as described earlier here: "fsutil behavior set DisableDeleteNotify ReFS 1" but so far I am still experiencing the reported performance issues, too. Currently, I can use another SMB repo for tape restores with WS2012 R2 and ReFS with no Performance issues.
The refs.sys version used on the B+R Server is 10.0.17763.831 which supposed to be the latest for LTSC Version.
So for now I stay tuned on this post hoping somebody will report a final fix for this issue!

Post by **Gostev** » Dec 09, 2019 9:12 pm this post

Indeed, as noted above it is best to avoid Windows Server 2019 LTSC at the moment... either remain on Server 2016, or use 1903/1909 SAC builds of Server 2019. Thanks!

Post by **ferrus** » Dec 09, 2019 9:20 pm this post

It's just odd that it seems to have got so dramatically worse, so suddenly.
I'm currently on day three - of a 'fast clone' incremental merge on 2019.

I'll look through the logs, for the event mentioned above.

Post by **Gostev** » Dec 09, 2019 9:24 pm this post

ferrus wrote: ↑Dec 09, 2019 9:20 pmIt's just odd that it seems to have got so dramatically worse, so suddenly.

Sounds like a typical major release?

Post by **ferrus** » Dec 09, 2019 9:40 pm this post

Sounds like a typical major release?

Shhhhh .... Major Veeam release coming soon. Don't jinx it!

I presume downgrading to 2016, but keeping the same 2019 ReFS formatted volume isn't supported?
Or is the file system the same, and just the refs driver that changes?

Post by **Gostev** » Dec 09, 2019 10:25 pm this post

Good question. If ReFS format changed between 2016 and 2019, then I would expect an upgrade process to require re-format of existing volumes (and some process similar to VMFS upgrade). But this is definitely not the case, meaning it is safe to assume that even if ReFS version was incremented in 2019, this will apply to newly provisioned under Server 2019 volumes only.

Dec 10, 2019 9:01 am

I haven't been able to successfully bring original 2016 volumes online again on a 2016 server after they have been online on 1903. Might be a glitch, but we ended up staying on 1903.

1903 does run brilliantly though, even with trim/unmap enabled. We haven't tested 1909 yet since it is not supported by Backup & Replication.

mkretzer · Post by **mkretzer** » Dec 10, 2019 4:57 pm this post

Brilliantly indeed! 1903 saved our whole REFS project!

I just hope 1909 gets supported fast. Were there any tests from Veeam side already?

Post by **poulpreben** » Dec 10, 2019 5:19 pm this post

Are you aware of any further ReFS improvements in 1909? We did some testing with the v10 beta. It works, but I didn’t see any noticeable difference in performance or memory consumption...

Post by **Torrey.bley** » Dec 10, 2019 5:42 pm this post

Andrew@MSFT wrote: ↑Nov 22, 2019 10:49 pm DISCLAIMER: I work for Microsoft as a Program Manager on the Storage and File Systems Team – specifically the Resilient File System (ReFS).

First, wanted to give my sincerest THANK YOU! for choosing ReFS with Veeam as your preferred platform. Microsoft has worked directly with Veeam since their integration with ReFS Block Cloning technology to ensure your data integrity is of top priority. Our goal is to make the most performant, space efficient, reliable solution for our customers.

Can you explain the issue?

Veeam uses ReFS block cloning functionality to make backups reliable, fast and efficient. ReFS Block Cloning involves maintaining a reference count of each allocated block. Sometimes, performance can be affected when a system has a large number of cloned files and is doing large numbers of deletes, overwrites, etc. The more frequently your data is changing, and the more data you have, the larger the reference table. This tracking ensures your data remains consistent, available, and correct.

What is Microsoft doing about it?

Microsoft recognizes the issue and has invested in new optimizations for block cloning. These changes make cloning faster and more efficient. We are considering multiple options to get these optimizations to our customers. I will post again in January 2020 when I have more details.

What can I do now if I am experiencing this issue?

• Ensure Trim is disabled "fsutil behavior set DisableDeleteNotify ReFS 1"
• Create smaller volumes. This can help with the amount of data churn.
• Engage with Microsoft product support. By opening a support case, you get a dedicated resource to help with your specific needs.

I am with Veeam support and I am working with a customer on this issue. They have had experience in the past with the step "fsutil behavior set DisableDeleteNotify ReFS "1 mentioned above and asked that I pass along a warning. They warned that setting that might be a bad idea, it might improve performance in the short term, but in some circumstances (like local HDD storage or any other storage that doesn’t perform its own garbage collection/unmap) it will keep deleted block from being returned to free space. Eventually all space will be filled even though it isn’t really and disk space will be exhaustive. The only fix is to format the volume as no amount of deleting will fix it, and reverting the setting won’t reclaim the space from already deleted files. That space becomes permanently unavailable.

They worked with MS for months trying to find a solution, but in the end, formatting the repositories and starting over was the only choice. They didn’t notice the problem until free space was almost gone and then it was too late.

Post by **Gostev** » Dec 11, 2019 12:02 am this post

poulpreben wrote: ↑Dec 10, 2019 5:19 pmAre you aware of any further ReFS improvements in 1909? We did some testing with the v10 beta. It works, but I didn’t see any noticeable difference in performance or memory consumption...

Yes, I am aware - and they are very significant... some serious NDA stuff there under the hood! However, if you want to observe resulting performance improvements over 1903, you would have to use a very large volume, and create a lot of churn (so that there are a lot of cloning operations).

There's also one trick you can use: compare 1903 to 1909 on ReFS volumes with 4K cluster size, which creates by an order of magnitude more cloning operations and metadata for ReFS driver to deal with. Just to be clear: 64K clusters remain the recommendation for production ReFS repository deployments! The suggestion to use 4K clusters is specifically to make life much harder for ReFS by increasing the number of blocks in action 16x without changing anything else. Such test should make 1909 benefits over 1903 much more visible.

mkretzer · Post by **mkretzer** » Dec 11, 2019 5:32 am this post

In other words larger volumes (like our initial 600 TB repo) should be less problematic with 1909 from what i understand

Post by **rhys.hammond** » Dec 12, 2019 4:50 am this post

Update on our 1809 ReFS woes, managed to piece together some temporary storage to add to the SOBR in order to evacuate some backups off the 1809 ReFS extent.
Unfortunately, we didn't manage to piece together enough temporary storage which meant we couldn't evacuate all backup data.

The ReFS performance, or lack thereof, continued to cause headaches during the remediation work, whenever a backup file evacuated or a VBK offloaded, the performance would again, fall off a cliff.
At which point we could either let it run severely degraded (50MB/s) or restart the repo server, losing progress for any incomplete offloads/evacuating files, but bringing the performance back up to 2-3GB/s.

Once the 2016 ReFS repo is up and running Ill provide an update after a few weeks.

Note: I have destroyed the 1809 ReFS volume and will be recreating it from scratch for 2016.

Cheers

fsr · Post by **fsr** » Dec 12, 2019 12:56 pm this post

Gostev wrote: ↑Dec 11, 2019 12:02 am Yes, I am aware - and they are very significant... some serious NDA stuff there under the hood! However, if you want to observe resulting performance improvements over 1903, you would have to use a very large volume, and create a lot of churn (so that there are a lot of cloning operations).

There's also one trick you can use: compare 1903 to 1909 on ReFS volumes with 4K cluster size, which creates by an order of magnitude more cloning operations and metadata for ReFS driver to deal with. Just to be clear: 64K clusters remain the recommendation for production ReFS repository deployments! The suggestion to use 4K clusters is specifically to make life much harder for ReFS by increasing the number of blocks in action 16x without changing anything else. Such test should make 1909 benefits over 1903 much more visible.

Makes you wonder if Microsoft couldn't just add the option to set the cluster size to 128 KB, or even larger as an aid for the versions with problems. And maybe not only for that. After all, it's not like that would waste any real disk space on a volume dedicated for very big files like backups and/or VMs, right?

Dec 12, 2019 1:05 pm

Because it would be more of a temporary workaround, rather than the real solution. Best compared to a painkiller injection to ease life of a dying patient... 2x or even 4x improvement over "bad" is still bad, especially considering that data footprint doubles every few years.

On the other hand, the architectural changes around ReFS metadata handling they've implemented in 1909 seem like the real deal - at least on paper. And if works as advertised, it should provide ReFS with a nice scalability headroom for the future growth.

mkretzer · Post by **mkretzer** » Dec 12, 2019 4:17 pm this post

@Gostev When will we be able to use 1909?
Do we really have to wait for V10 for that?

Post by **Gostev** » Dec 12, 2019 6:36 pm this post

Yes. To start testing currently shipping versions against 1909 would require taking QC off of v10, thus delaying it.

Post by **JaySt** » Dec 13, 2019 5:45 pm this post

Any news on the backport of ReFS fixes to 1809 ltsc ?

Post by **Gostev** » Dec 13, 2019 11:37 pm this post

See the post from ReFS PM above, he provided timelines for the next update.

GACcc · Dec 17, 2019 1:14 pm

So just to give you guys a feedback.
We just formatted our whole NetApp Storage and connected it to a new installed Core Server 1909 (as repo) and everything is working fine, just as it was before all this.

Jan 06, 2020 4:41 am

Quick update on the downgrade from 1809 back to 2016. Previously the job was taking multiple days to create synthetic fulls on 1809, after installing 2016 the very same job took just 50 minutes and 44 seconds... happy days.

Post by **poulpreben** » Jan 06, 2020 6:02 am this post

We upgraded a 400 TB repository (4x 100 TB volumes) from 1809 to 1903. After a while it started behaving exactly like 1809. Merges took too long and dumping the VeeamAgent process revealed that it was indeed waiting for ReFS.

The recommendation from Veeam Support was to disable block cloning via a registry key, but this being a ~1,500 VM environment it would also impact the primary jobs which were running fine.

The only difference between the primary and the secondary SOBR, was that the primary SOBR contained 2x 200 TB extents across two servers, while the secondary was 4x 100 TB extents on a single server.

Instead of continuing troubleshooting the issue with support (and because I had to spend time with my family during the holidays), we had to close the missed SLAs by splitting up the 4x 100 TB across two servers instead. Since doing that, everything has been running fine. I’m still not excited about the merge times, but at least we’re not missing any SLAs.

hunterisageek · Post by **hunterisageek** » Jan 07, 2020 2:57 pm this post

I know it's early Jan, but we have been fighting this since about Oct on our 2019 ReFS Repos.

Is there going to be a fix for the server 2019 (1809)? or should I try and figure out a way to go back to 2016 ReFS?

We have 4x S3260's (2 at each site) each Repo has 28x 8TB drives in a raid 60. Each server has it carved into 3x 55tb Volumes (mostly for Windows ReFS Dedup only supports up to 64tb volumes).
Murges can take anywhere from 24hrs, and we had to kill everything a few weeks ago at 100+ hrs on some of the BCJ's.

Post by **Gostev** » Jan 07, 2020 4:54 pm this post

Going back to Server 2016 will provide guaranteed results, so it seems as a safer bet to me. Otherwise, even when the fix is finally available (there's no specific timeline yet), you'll still be dependent on its first version working as it is supposed to right away... which from my experience is not always the case.

Post by **mdxyz** » Jan 13, 2020 5:13 am this post

If there's a Server 2019 created ReFs volume can this safely be attached to a Server 2016 system (i.e., do we need to format and start over)?

R&D Forums

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Re: Windows 2019, large REFS and deletes

Who is online