Tape job from HPE Storeonce processing rate

Hirosh · Post by **Hirosh** » Dec 04, 2023 5:41 am this post

hi guys,

i have been monitoring my backup-to-tape jobs, which are reading data from Storonce 5000 to MSL 6480 with multiple LTO9 drives. The thing that is bothering me is that, the processing rate for backup-to-tape job does not go over 1GB/s no matter what how i schedule the job in the summary windows of the job it says the bottleneck is source(Storeonce5000). now the storonce is idle at the time when the tape jobs are running. there are no limit set on the storonce repositories. the infrastructure is SAN with 16GB/s FC connectivity. On my mediapool settings as well as tape-job setting i have allowed to use multiple drives, still i can not go beyond 1GB/s.

regards,
Hirosh.

YouTube · Post by **haslund** » Dec 04, 2023 7:05 am this post

Pulling 1 GB/s from a HPE StoreOnce already seems relatively good. Maybe you can share some additional information about your infrastructure?
It sounds like your Veeam Gateway Server is connected using 16 Gbit FC (please note 16 Gbit/s != 16 GB/s). Where is your MSL 6480 Tape Library connected? Is it connected to the same Veeam Gateway Server? How is it connected? (I assume also using FC, but need to ask to be sure). If the Veeam Tape Server is a different server than your Veeam Gateway Server, what is the network connectivity between the two servers? Perhaps 10 GbE?

Hirosh · Dec 05, 2023 8:18 am

Hi haslund,

my bad , typo error i ment 16gb/s. Veaam B&R is a physical server(all-in-one) so the gw server and taper server is installed on the Veeam B&R server.Yes it is connected to the same GW server. its connected over FC. well i was expecting a better rate, considering im using mupltiple drives per job and Storeonce being Idle at the time!

YouTube · Post by **haslund** » Dec 05, 2023 10:38 am this post

Which is the exact StoreOnce model you are using from the 5000 series?

Post by **bct44** » Dec 05, 2023 12:09 pm this post

Hello,

Tape performance at scale was a big topic for me for end of last year and beginning of this year, i can probably give you some inputs from the field.
1- Open a veeam support case
2- Are you in True-per VM backup files? Are you in v12?
2- Check performances metrics (cpu, memory, network/FCusage, disk io) from your hardware involved (repo, tape server, vbr...), check monitoring alarms, counter/errors on the SAN Fabric.
3- Run read performance test on your source repo
https://www.veeam.com/kb2014
4- Tape performance is related to data you're sending to drives, more your files are big (example vbk) more you can achieve cool performance numbers. LTO drives need to be flooded with data.
5- Involved HPe vendor too

As Rasmus said 1GB/S continuous is quite good with HPe StoreOnce specially when your data are dehytraded with the dedup then you need to rehydrated them to send it to tape .If you have enough space free and a fasst storage volume on your all in one vbr, you can try to send data without dedup.
If you need to try the performance the performance of a single drive without the repo, you can use the tool below. Unfortunately you can't run it on multiple drives.
https://www.ibm.com/support/pages/ibm-t ... l-itdt-v95

I hope my answer gives you some food for thought, if you don't mind feel free to share your conclusion.

Hirosh · Post by **Hirosh** » Dec 07, 2023 4:54 am this post

haslund wrote: ↑Dec 05, 2023 10:38 am Which is the exact StoreOnce model you are using from the 5000 series?

its 5600 with multiple expansions

Hirosh · Post by **Hirosh** » Dec 07, 2023 5:04 am this post

Hi bct44,

1. its not an issue , its more of tune up.
2.yes and yes
2 no error on SAN fabric
4.they are mostly big vbk
5. that is a good idea
thank you.

perhaps if there were some benchmarks released by veeam, i could get a clearer picture, what is an average and what is a great throughput?!

Post by **bct44** » Dec 07, 2023 3:26 pm this post

Hello,

1) Even if it's optimisation, I had opened a ticket and this enabled me to get help from Veeam and PM support.

You could probably ask to your vendor for some benchmarks with Veeam, if it's exist

. For an other hardware vendor, it was just tested for support compatibility. They didn't have a benchmark from a lab to share.

Post by **Regnor** » Dec 08, 2023 2:39 pm this post

It's hard to compare benchmarks for deduplication appliances as there are too many factors to take into account. HPE and all other vendors publish data on what's theoretically possible with their hardware, but this doesn't mean you'll achieve the same with your data. In January Federico gave you some performance numbers, which looked quite well in my opinion: post473747.html#p473747

In your case: How many LTO9 drives are you using in parallel? And did you mean you're getting 1 Gigabyte/s or 1 Gigabit/s (~125MB/s)?

Hirosh · Post by **Hirosh** » Dec 10, 2023 7:31 am this post

@regnor
true, Federico gave a thorough test report which was very helpful and enlightening. However my scenario is a little different from what Federico tested(single stream, 470MB/s).
I'm using multiple drives(10) for a single job which would be multiple concurrent streams, so I think 1GB/s(gigabyte) seems not very satisfying.

Dec 19, 2023 5:04 pm

Performance tuning is always a task for Hercule Poirot or Sherlock Holmes. To gather more details, I would also use the Performance monitoring section in StoreOnce. During the Backup to tape, check how many Read streams you have, the total throughput, and the CPU/Disk workload, and any other workload indicator that you think it could be a bottleneck. Also, I would try reducing the number of concurrent devices, to see if the overall performance changes. In general we want to avoid the tape shoe-shining effect, i.e. start-stop-buffering-rewind cycles. I would also Monitor the library HW to check how many tape-drives are really concurrently active. If all the 10 drives are concurrently active, then they receive on average 100MB/s each, and if the HW drive compression is active (which is a good practice), then they might write at 50MB/s. This throughput is not enough to keep the LTO-9 drive in streaming mode, and it might cause start-stop cycles.
There are 2 StoreOnce models: 5650 (~2018) and 5660 (~2021). I think that even a fully-populated 5650, having multiple read streams, should be able to read faster than 1GB/s.

Post by **bct44** » Dec 21, 2023 9:14 am this post

Is it really worth it to enable compression on drives when backup files are already compressed? I had made many test with another hardware vendor and I haven't noticed any differences.

Post by **FedericoV** » Dec 27, 2023 12:30 am this post

@Bct44
you are absolutely right, but here the source is StoreOnce, and for its best practices, Veeam own compression should be disabled. At logical level, StoreOnce receives uncompressed backup files.

Why is data sent uncompressed?
Because it indeed reduces the storage utilization on StoreOnce.
The Catalyst SW running on the Veeam Proxy/Gateway server makes its source-side deduplication. This deduplication level is very high because it is based on small segments, and every segment is compared to anything already written in the Catalyst Store. If the backup file was pre-compressed by VBR, this comparison would be less effective

Is data inside StoreOnce compressed?
Yes, it is. The new segments, i.e. the not-deduped ones, are compressed by StoreOnce before being written.

Post by **RobTurk** » Dec 27, 2023 8:27 am this post

Also note that tape jobs are single stream operations. Many dedupe appliances list maximum performance using many streams.

Read performance in dedupe appliances depends on the type and number of physical disks, as well as the number of CPU cores that can effectively be used.
By nature, dedupe appliances end up with data fragments scattered across all disks. Reading data is essentially a random read operation, so disk IOPS performance is crucial.
Units with just a few large NL-SAS disks will be held back in performance by lack of IOPS, while all-Flash units may do very well in that respect.

Once read from disk, each CPU core is tasked with reassembling and decompressing the data for a particular stream. Most dedupe appliances have many cores to allow this to happen in parallel for multiple streams.
The cumulative performance might be quite impressive. Single stream performance much less so.

In this case with a single tape data stream, all cores except one are twiddling their thumbs. This explains why the StoreOnce reports mostly idle, while VBR reports StoreOnce as the bottleneck.

Hirosh · Post by **Hirosh** » Jan 02, 2024 8:24 am this post

Merry Christmas to everyone, hope everyone have a bright & prosperous New Year.

@Federico
as always I find interesting points in your comments. we are using 5660 with 8 expansion. we are not using HW compression. When the tape job is running 10 drives are used, so each drive is writing roughly at 100MB/s . I have checked performance monitoring, but I have not seen any anomalies of excessive resource usage. my assumption is that when deduplicated backup is stored on Storeonce, before it is copied to tape it needs to be rehydrated, so that's what make SO the bottleneck. Personally I was expecting a higher throughput for the tape Jobs.( I followed every point it was mentioned in the https://www.hpe.com/psnow/doc/a00023056enw)

@Robturk
we are using around 80 NL-SAS Disks on 8 Enclosures, we also have 8*6.4TB SSDs on Base System. The Storeonce Appliance are equipped with two AMD EPYC 7502 32-Core Processors. I was under the impression that the SSD Tier would be used for metadata storage & caching, it is also surprising that the read is random, I was expecting it to be more sequential of nature, since where that data is stored and how it is fetched is handled by Storeonce. it might be worth mentioning that even when we are writing data to SO, the bottleneck is also the SO. In our scenario, we are using 10 drives in a single job it is not a single tape data stream.

P.S: I welcome any suggestions & recommendations.

Hirosh · Post by **Hirosh** » Jan 13, 2024 7:41 am this post

hi everyone,

anymore feedback on the matter?

BR

Post by **FedericoV** » Jan 16, 2024 2:35 am this post

A restore, or a read operation from a deduplication appliance like StoreOnce generates an intense random read access on StoreOnce disks. This is because of the nature of deduplication: During a backup operation, when a written data chunk is deduplicated, it is because it has already been stored somewhere in the appliance disks by a previous backup operation. When, during the same backup operation, a chunk of data is NOT deduplicated, it has to be written to a new position on disks, and the position of the deduplicated chunk is generally not adjacent to the new written chunk. Fortunately multiple new chunks can be contiguous, but not so many. This depends by the workload in the production system, and how its workload changes blocks in its disk, as small segments/files, or large ones. This means that the restore process isn't a brute 4k random read, but still not a simple sequential operation.
During a restore/read operation the metadata to rebuild the file from the deduplicated area on HDDs, is stored on the SSD based volumes. The operation to retrieve metadata is also random and generates a lot of I/O, but having it on SSDs, it is a fast operation. The metadata required for restoring a catalyst object (backup file) is entirely loaded in RAM. Except for this initial workload, then the restore/read process proceeds accessing blocks on HDD using the map already loaded in RAM. Clearly this is a simplification of the process. Indeed we add a lot of read-ahead whenever it looks advantageous.
This long description to explain that on StoreOnce it is not the deduplication layer adding latency, because metadata is loaded to RAM from SSDs, indeed the real bottleneck are the HDDs where data is stored. For this reason StoreOnce restore performance is highly influenced by the number of disks in the unit: more disks, more IOPS, faster restore.

That said, for Veeam design, each copy-to-tape operation is a 1 to 1 association between a single backup file and a tape device. It never happens that multiple files are read IN PARALLEL and then written to the SAME tape device. For who remembers the concept, there is no multiplexing. This means that each tape cannot receive data faster than a single-stream restore process. Additionally, the process is faster when we read from a single .vbk, rather than when we rebuild the full, reading from a .vbk and multiple .vib.
For best practices, VBR writes data to StoreOnce in its native format, i.e. uncompressed. This means that when data is read back by the backup-to-tape process, it is still in its native format. It is important to know this to proper size the LAN bandwidth. Before the data is written to tape, it is suggested to compress it again to save some tape capacity. My preference is to let the tape device to make the compression task.
To tune a specific environment, I would start making backup to tape on a single device, then I would add a second and so on. This way we can see if the cumulative throughput grows linearly, or if, after a certain number of tape devices, there is no more benefit on adding new devices, and then identify the reason, and avoid the tape shoe-shining effect.

Hirosh · Post by **Hirosh** » Feb 21, 2024 10:37 am this post

@federico

im terribly sorry for late response on this case, i have been moving from Dubai to Austria starting working for new company, last few weeks i was settling down.Thank you for comments on the case i had already took the same approach ,starting from single drive and adding drives after, but still can not get over 1GBps even with single drive, might get lesser, so i think 1GBps is the hard limit for 5660 series.

Best regards

Hirosh · Post by **Hirosh** » Feb 24, 2024 11:02 am this post

@maruv

thanks for the feedbacks, could you share the Forum post and may i know how many drives you used to reach 2GB/s?
using NAS/CIFS hsa some security consideration as well as overheads, which migh not be a best alternative for us, but still im interested to know this in more details.

regards,
Ledwan.

Post by **pcan** » Feb 28, 2024 9:30 pm this post

@Hirosh Federico alerted me to this thread. Hopefully the following can help provide you with a route forward:

With just the copy to tape jobs running and no other workload I would expect your 5660 with 8 shelves to be capable of significantly more than 1GB/s with 10 parallel restore streams. I think the Catalyst Protocol is very unlikely to be your bottleneck; that can quite happily transfer several GB/s of physical data on a single data session (assuming infinitely fast Tape/Veeam, network links, and StoreOnce dedupe engine). As per https://www.hpe.com/psnow/doc/a00023056enw, the StoreOnce dedupe engine is able to deliver around 400MB/s off the disk to Catalyst for a single stream on your class of StoreOnce hardware. Obviously this is unlikely to scale linearly with 10 streams, but i'd think you should be at around to 2GB/s with 10 streams on your StoreOnce, assuming your data is of typical compressibility and you are mainly restoring from fulls (.vbk), not backup chains (.vbk + .vib).

To clarify:
Is the Veeam to StoreOnce connection over Catalyst over FC, or over ethernet?
What is the link bandwidth available dedicated to Catalyst traffic?
If ethernet, are you using any encryption (IPSEC) between the Veeam server and StoreOnce?
If FC, have you checked the FC devices settings https://support.hpe.com/hpesc/public/do ... BF3A4.html

If you haven't already raised a case with HPE Support I recommend you do. If you can provide me with the HPE case number I will ask for it to be escalated to the R&D lab so we can work with you to get a support ticket/bundle from your StoreOnce after you have reproduced the issue and we can get a clearer picture of what is going on in Catalyst/StoreOnce space performance-wise.

Post by **RobTurk** » Mar 05, 2024 12:12 pm this post

@Hirosh Just curious, you mention in your second post that you use an all-in-one Veeam server.
How is this server sized in terms of CPU cores and memory? Is it sized for enough task slots? See:
https://helpcenter.veeam.com/docs/backu ... ml?ver=120

Veeam B&R Server: https://helpcenter.veeam.com/docs/backu ... kup-server
Gateway function for 10 streams: https://helpcenter.veeam.com/docs/backu ... way-server
Tape server for 10 streams: https://helpcenter.veeam.com/docs/backu ... ape-server

Then also Windows itself, the SQL Express database server and possibly the VBR console. It all adds up.

As mentioned in the documentation: "One machine can perform several roles. For example, you can assign roles of the VMware backup proxy and backup repository to the same machine, or use a VMware backup proxy as a gateway server for a shared folder backup repository. In such situation, you must make sure that the backup infrastructure component is able to process the cumulative number of tasks specified for different roles."

Hirosh · Post by **Hirosh** » Mar 10, 2024 10:19 am this post

@Robturk

we are using a DL380 Gen10+ with 2 cpus, 256G of RAM(16 RAMS) , SSDs are used for local OS instalation & Vpower NFS .
we are using SQL server Enterprise. we have updated to Veeam BR 12.1.
we have not created in extra hop in our backup plane.The medium of connection is over FC.the number of concurrent tasks on proxy servers are configured based on Veeam BP.
The resource utilization on the server , hardly reaches 20_25% , for RAM we around 30% most of the times.

as i mentioned earlier, in the Veeam Backup report, it shows Storeonce as the Bottleneck.

Mar 11, 2024 10:08 am

@Hirosh

Veeam can only report that it is waiting on data from the Catalyst client. It does not report where the bottleneck is beyond the Catalyst Client.

Based on your description of your configuration, the bottleneck is likely to be the network between the Catalyst Client (inside Veeam Gateway) and the Catalyst Server (inside the StoreOnce). It is not unusual for us to find issues in customer FC SAN configurations resulting in sub-optimal performance.

The only way to progress this issue is to find the source of the bottleneck and see how it can be fixed, which means raising a support case with HPE support. They can analyse the Catalyst logs, and work with you to run diagnostic tools, to determine where the bottleneck is and work out how it can be fixed.

R&D Forums

Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Re: Tape job from HPE Storeonce processing rate

Who is online