Slow Tape Job Performance

Stephan23 · Post by **Stephan23** » Aug 06, 2019 8:50 am this post

For a long time now I suffer from a relatively poor tape job performance of about 150 MB/s. Because of the growths of the backup files the job now needs usually more than 40 hours to complete which gets to a point which is more and more disruptive.

After I wasn't able to find the clear cause of the issue I contacted the support, to help analyse it and/or point to the cause. Unfortunately this wasn't very helpful so far. To be honest, I'm pretty unsatisfied with the whole case (# 03666072). So I was hoping to get more information here, from people with similar experiences.

Tape Proxy:
Dell R620
2x E5-2630
192 GB RAM
8 Gb/s Fibre Channel

Tape Library:
Quantum Scalar i3
IBM Ultrium 8 HH
8 Gb/s Fibre Channel

Backup Storage:
NetApp E2860
20 x 4 TB (7200 rpm)
8 Gb/s Fibre Channel

Windows Server 2016
30 TB ReFS 64k repository

I'm using LTO 7 tapes formatted as type M. So I would expect a throughput of 300 MB/s, when the target is the bottleneck.
What I'm getting is an average throughput of 150 MB/s and source as bottleneck (~ 87%)

In contrary, there is one File to Tape job that backs up files from a NTFS partition that resides on the same storage system, which is always fast (300 MB/s, bottleneck: target).

During the case I was asked to perform a benchmark on the partition with the source jobs, which looks OK:

Code: Select all

>diskspd.exe -c1G -b512K -w0 -r4K -Sh -d600 H:\testfile.dat

Command Line: diskspd.exe -c1G -b512K -w0 -r4K -Sh -d600 H:\testfile.dat

Input parameters:

        timespan:   1
        -------------
        duration: 600s
        warm up time: 5s
        cool down time: 0s
        random seed: 0
        path: 'H:\testfile.dat'
                think time: 0ms
                burst size: 0
                software cache disabled
                hardware write cache disabled, writethrough on
                performing read test
                block size: 524288
                using random I/O (alignment: 4096)
                number of outstanding I/O operations: 2
                thread stride size: 0
                threads per file: 1
                using I/O Completion Ports
                IO priority: normal

System information:

        computer name: veeam-san
        start time: 2019/08/05 13:22:37 UTC

Results for timespan 1:
*******************************************************************************

actual test time:       600.00s
thread count:           1
proc count:             24

CPU |  Usage |  User  |  Kernel |  Idle
-------------------------------------------
   0|  14.64%|   0.51%|   14.14%|  85.36%
   1|   0.38%|   0.11%|    0.27%|  99.63%
   2|   0.48%|   0.07%|    0.41%|  99.52%
   3|   0.30%|   0.10%|    0.20%|  99.70%
   4|   0.41%|   0.05%|    0.35%|  99.59%
   5|   0.08%|   0.06%|    0.02%|  99.92%
   6|   0.63%|   0.13%|    0.51%|  99.37%
   7|   9.74%|   3.22%|    6.52%|  90.26%
   8|   0.85%|   0.18%|    0.67%|  99.15%
   9|   0.11%|   0.07%|    0.04%|  99.89%
  10|   0.30%|   0.10%|    0.20%|  99.70%
  11|   0.20%|   0.04%|    0.16%|  99.80%
  12|   1.06%|   0.12%|    0.93%|  98.94%
  13|   1.12%|   0.08%|    1.04%|  98.88%
  14|   0.33%|   0.10%|    0.23%|  99.67%
  15|   0.05%|   0.05%|    0.00%|  99.95%
  16|   2.70%|   1.82%|    0.88%|  97.30%
  17|   0.24%|   0.06%|    0.18%|  99.76%
  18|   0.25%|   0.09%|    0.16%|  99.75%
  19|   0.07%|   0.06%|    0.01%|  99.93%
  20|   0.07%|   0.05%|    0.02%|  99.93%
  21|   0.04%|   0.04%|    0.00%|  99.96%
  22|   0.05%|   0.03%|    0.02%|  99.95%
  23|   0.23%|   0.04%|    0.19%|  99.77%
-------------------------------------------
avg.|   1.43%|   0.30%|    1.13%|  98.57%

Total IO
thread |       bytes     |     I/Os     |    MiB/s   |  I/O per s |  file
------------------------------------------------------------------------------
     0 |    350717739008 |       668941 |     557.45 |    1114.90 | H:\testfile.dat (1024MiB)
------------------------------------------------------------------------------
total:      350717739008 |       668941 |     557.45 |    1114.90

Read IO
thread |       bytes     |     I/Os     |    MiB/s   |  I/O per s |  file
------------------------------------------------------------------------------
     0 |    350717739008 |       668941 |     557.45 |    1114.90 | H:\testfile.dat (1024MiB)
------------------------------------------------------------------------------
total:      350717739008 |       668941 |     557.45 |    1114.90

Write IO
thread |       bytes     |     I/Os     |    MiB/s   |  I/O per s |  file
------------------------------------------------------------------------------
     0 |               0 |            0 |       0.00 |       0.00 | H:\testfile.dat (1024MiB)
------------------------------------------------------------------------------
total:                 0 |            0 |       0.00 |       0.00

After reading trough the forums and looking for performance issues I often found a statement about fragmentation of VBK files caused by fast clone, which fits my case very well. I also did some test jobs on newly created volume, new VBK and got good performance.

Is fragmentation the cause of poor performance in my case?

How am I able to confirm this?

What can be done to increase performance, while staying on ReFS? Active fulls are not an option for me.

Is the storage system not good enough, even with high fragmentation?

Would more (7k) disks on the back end increase performance?

What else could be the cause in my case?

Regards
Stephan

Post by **Dima P.** » Aug 06, 2019 6:30 pm this post

Hello Stephan,

Looks like the main issue is with backup files being dehydrated while sitting in ReFS. In order to write it to tape data block should be retrieved from the system and it takes time to actually get these data blocks from ReFS. Can you please clarify if disk jobs are configured to create periodic synthetic full backup or you synthesized full with tape job? Thank you!

Stephan23 · Post by **Stephan23** » Aug 07, 2019 8:35 am this post

Hello Dmitry,

not quite sure what dehydrated backup files means in that context.

Of the two relevant backup jobs one is creating a synthetic full every week, the other is configured as reverse incremental.
And the tape job does not archive any incremental backups and also does only process the latest backup chain.

Post by **Dima P.** » Aug 09, 2019 5:45 pm this post

Stephan,

Just to make sure we are on the same page can you please clarify if deduplication is enabled on your ReFS volume? Thank you in advance!

Stephan23 · Post by **Stephan23** » Aug 12, 2019 7:29 am this post

I was under the impression, that there is no dedup for ReFS on Server 2016? So no, it's not.

"inline data deduplication" in enabled for all backup jobs, if it's that what you mean.

notrootbeer · Post by **notrootbeer** » Oct 03, 2019 9:04 pm this post

Stephan23, were you able to resolve your issue? We're experiencing similar behavior and we currently have our case escalated with Veeam support but still haven't gotten very far yet. We're considering switching completely back to NTFS, though.

Post by **Dima P.** » Oct 04, 2019 11:23 am this post

Hello notrootbeer,

Can you please share your case ID? Thank you in advance!

notrootbeer · Post by **notrootbeer** » Oct 04, 2019 3:25 pm this post

Yes! It's 03721079

Post by **Dima P.** » Oct 09, 2019 11:03 am this post

Thanks! Case is being reviewed, so please keep working with our support team. Cheers!

Stephan23 · Post by **Stephan23** » Oct 09, 2019 2:59 pm this post

notrootbeer wrote: ↑Oct 03, 2019 9:04 pm Stephan23, were you able to resolve your issue? We're experiencing similar behavior and we currently have our case escalated with Veeam support but still haven't gotten very far yet. We're considering switching completely back to NTFS, though.

My case was also escalated, but unfortunately the issue was not resolved, but an explanation was given.
I hope it's OK to quote from the Technical Support:

The situation is unfortunately quite expected.
As you know, ReFS (when using Block Cloning) allows several files to share the same physical data. Instead of high-cost copy of real block, it just copies metadata and sets up references to physical regions. However to read the such file, it is required to read the address with the link and only then read the data from are where data is stored. More links are used, more effort is required to read the file.

So degradation of read performance seems to be an expected cost of fast merge operations. Backup to tape includes reading data from source, and, as your BTT logs say, source is always the most time-consuming operation. This is unfortunately well-known ReFS limitation which hardly could be overcome.

Switching to NTFS was an optioned mentioned to work around the issue.

I also expressed my displeasure that no information regarding performance degradation is mentioned in any documentation or KB article, especially if it is "quite expected" and a "well-known ReFS limitation".

However, the explanation confirmed what I already suspected and I was satisfied with it.
My plan is to experiment with Active Fulls as soon as our VPN connection to a mirror Veeam repository gets "upgraded" to mitigated the issue with fresh backup chains.
Depending on the outcome we might consider switching to NTFS as well.

Post by **Dima P.** » Oct 10, 2019 9:07 am this post

Hello Stephan,

Thank you for honest feedback. We will discuss the investigation results with our support team and technical writers to make the corresponding adjustments in the Help Center. Additionally, I'll noted the improvement request and raise this topic with RnD folks. Cheers!

FBartsch · Post by **FBartsch** » Oct 28, 2019 8:35 am this post

Hey guys,

is there something new for that topic?

Will this be fixed some day or will it be better for me to change from ReFS to NTFS?

Thank you. =)

Post by **veremin** » Oct 28, 2019 10:54 am this post

Might be a stupid question - but are you using Fast Clone on you ReFS volume or not? The degraded performance happens only, when backup files (selected for tape backups) reside on ReFS repository with Fast Clone enabled? Thanks!

FBartsch · Post by **FBartsch** » Oct 28, 2019 11:09 am this post

Hi veremin,

thanks for the link. Did not find the Options where I can see if it's activated or not. Aber yes. It's activated.
After backup log is writing "Synthetic full backup created sucessfully [fast clone].

But how can i deactivate it? That article is saying via registry but nothing specific.
Will that change damage the actual backups when he starts the following jobs?

Thx.

Post by **veremin** » Oct 28, 2019 12:15 pm this post

You can disable it via REFSVirtualSyntheticDisabled <DWORD> regkey, however, keep in mind increased storage consumption, increased backup time and additional load on storage system that will come as a payback. Thanks!

FBartsch · Post by **FBartsch** » Oct 28, 2019 12:26 pm this post

I will think about it and test it.

Where do I need to add that key with which value?

Thanks.

Oct 28, 2019 12:57 pm

In the standard registry key:

Code: Select all

Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Veeam\Veeam Backup and Replication

Set 1 as a value.

Thanks!

Post by **aich365** » Jan 13, 2020 1:45 pm this post

Hi Vladimir
Please can you advise is this a reg entry on the VCC host or on each repository server?
We have 5 Rep servers and ideally would like to only target the ones running tenant tape jobs.
Thanks.

Post by **HannesK** » Jan 20, 2020 1:16 pm this post

@@aich365: it's a VBR server key that is applied to all repositories.

Post by **aich365** » Jan 20, 2020 1:20 pm this post

Hi Hannes

Can you clarify. This surely applies to the repository server rather than the repository?

So if there are 4 repositories on the server all will be affected.

Thanks

Post by **HannesK** » Jan 20, 2020 1:54 pm this post

all repositories will be affected. No matter whether they are on one server or on different servers. The key must be set on the backup server.

JPMS · Post by **JPMS** » Jan 26, 2020 12:27 pm this post

If REFS block-cloning is the issue, can somebody at Veeam explain why tape backup performance is so much worse than restore performance when using the same REFS block-cloning?

I have been looking to migrate to a Windows repo with REFS and I also use tape as our secondary backup. When I first saw this thread my first thought was that if the issue was rehydration then it would equally affect restores as they both have to go through the same process. I also wondered if the issue could be resolved by doing a periodic Active Full but as I was to discover, the tape performance takes a dive after doing a single (block-cloned) incremental backup.

This week I finally bit the bullet. I went for Windows server 2019 LSB, patched with this mid-months update that is supposed to address the worst of the issues discussed in this thread veeam-backup-replication-f2/windows-201 ... 6-180.html

I ran an active full, then the following day ran an incremental backup with a creation of a synthetic full and then backed that up to tape. As with others in this thread, the tape backup (to an LTO8 tape drive) often struggled to hit 100MB/s when previously we had a sustained 300MB/s for an entire tape backup. So, at this point I decided to see how this compared to VM restore times from the repo as this is effectively the same process, just to a hard drive rather than tape.

We have a small environment, 15 VMs. All apart from one don't exceed 150GB. The one exception holds all our data and is about 2TB. I left this out of the testing because of the time restores would take and because the vast majority of it wouldn't change so it wouldn't be the best test of the effect on block-cloning on restore times. So for the first test I restored 14VMs, a total of 710GB. I did three restores, The first from the original Active Full, the second from the next days backup (16GB of changed data) and the third from the following days backup (a further 33GB of changes). As expected, the restores got slower but would still be high enough to saturate the capacity of the tape drive - 563MB/s, 473MB/s and 438MB/s.

Our repo is set to 'Use-per-VM backup files' so can process multiple VMs in parallel when conducting the above test. As tape is a serial device I felt it may be a fairer test if I just did the same for a single VM as I don't know if your tape software can write out multiple VMs in one stream. Even then, restore times for a single VM were 488MB/s, 458MB/s and 384MB/s. Again, faster than the tape drives speed of 300MB/s.

So, this can't just be a REFS block-cloning issue. The repo can recover the block-cloned data faster than the tape drive can write it, so why is tape performance so poor (currently unusable without disabling block-cloning)?

JPMS · Post by **JPMS** » Feb 02, 2020 10:50 pm this post

Bump

Any comment from Veeam?

Post by **HannesK** » Feb 03, 2020 7:21 am this post

any comment without support case number would be guessing.

I talked to some partners and customers last week whether they have seen general performance issues with REFS and tape and they said "no". So I'm not sure whether it is a general issue.

ShanGan · Post by **ShanGan** » May 28, 2020 11:46 am this post

Hi,

Is there any update on this case. I have a very similar setup and the tape jobs are really slow. Processing rates show only upto 70MB/s. When I check the throughput (ALL TIME) the maximum speed goes upto 200MB/s. But 90% of the time its just close to 70MB/s.

I am still confuced if the issue is due to the Tape library/drive or the repository. Bottleneck always shows Target and but not too sure if thats the case.

I did log a case with veeam but its close to useless. I stopped wasting my time calling them now.

Any in put will be greatly appriciated.

Thank you,

Stephan23 · Post by **Stephan23** » May 28, 2020 11:55 am this post

I switched the repository to NTFS and set Reverse incremental for every job.
Tape performance is now (just) OK. Version 10 also seems to increased the performance a little bit, but not as much as was let to believe.

Post by **Dima P.** » May 28, 2020 2:41 pm this post

Shan,

I did log a case with veeam but its close to useless. I stopped wasting my time calling them now.

Can you please share the case ID, I'll ask support management to review the case details. Thank you in advance!

ShanGan · Post by **ShanGan** » May 29, 2020 5:58 am this post

sure. 04167122

Post by **Dima P.** » May 29, 2020 11:32 am this post

Forwarded to the support management. Thanks!

R&D Forums

Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Re: Slow Tape Job Performance

Who is online