Very slow "Backup Copy Job" over local LAN (10 GBPS)?

Simon_LBC · Jun 06, 2022 12:16 pm

Hi,

I just recreated a fresh new Veeam B&R setup in my new production server environment. I created my initial (Full) backup job successfully on a QNAP Enterprise NAS connected over 10 GPBS iSCSi directly on the Veeam server VM and then take the full initial backup (24 TB size) in a stunning 40 hours at average 250 MB/s processing speed. Then I created a simple "Backup Copy Job" from this backup job onto a second repository, which is physically located on a second server (physical server) with an internal SAS RAID array (60 TB) also running latest ESXi 7.3. The repository on this target server is a Ubuntu Linux VM with a single 35 TB repository on it. All VM's and server in the environment are all connected together using 10 GBPS network switch/links.

When launching my initial Backup Copy Job, I was very shocked to find a stable but VERY POOR 53 MB/s processing speed... so my 24 TB backup job copy should roughly take over 100 HOURS to complete. Everything over a 10 GBPS network link, all over the local LAN... I really don't understand how it could be that slow. In my understanding a "Backup Copy Job" from Veeam is a very simple block-copy job that doesn't requires much processing over the host & target so it should be only affected by network speed IMO? Isn't it?

Any suggestion?

Post by **PetrM** » Jun 06, 2022 5:33 pm this post

Hi Simon,

It's not affected by network speed only: the source Data Mover must open backup files on the primary storage in order to read data blocks. After that, these blocks are transferred over the network to the target Data Mover which writes this data to a backup on the secondary storage. Therefore, we have two more variables in the equation: source storage read speed and target storage write speed. I'd suggest to open a support case, upload debug logs as per the instruction of this KB and ask our engineers to determine the "bottleneck" based on logs analysis. For example, you can find more details about performance bottlenecks on this page, the same approach is true for backup copy job as it also represents data pipeline, the only difference is that this information is not shown on UI as far as I remember.

Thanks!

Post by **foggy** » Jun 07, 2022 4:54 pm this post

Bottleneck stats are available in the job session log - please review those and I bet that the Source is designated as the major bottleneck there. Indeed, the backup copy job 'is a very simple block-copy job' and the 'block' is key here, or rather, its location. Since it is block-based, the backup copy reads blocks from the source randomly and NAS devices are typically quite bad in terms of random read I/O.

Simon_LBC · Post by **Simon_LBC** » Jun 07, 2022 5:40 pm this post

I have opened a ticket with customer support and they only determined that 99% bottleneck is caused by target... but I am still pretty surprised of that because the target storage is a Linux hardened repository stored on a almost's new server and it's a local SAS RAID array directly inside the server that is pretty quick and when I've performed direct backup to this repo (not a backup copy job) earlier, it was faster that that like average 150-175 MB/s processing speed. Now with the backup copy the faster that I got it is around 75-80 MB/s processing speed.

So this initial backup job (24 TB) took like 28 hours to complete at average speed of 223 MB/s to the other NAS repo but the backup copy job of this same job will take 100+ hours to complete, what a huge difference.

Obviously this is a concern for the initial backup copy job, since daily incremental's will be like 300-400 GB thereafter, so they will complete pretty quick as well.

Post by **foggy** » Jun 07, 2022 10:06 pm this post

Some kind of parallel disk activity (if there's any) might be the reason for such behavior. Also wanted to warn you against storing backups on a virtual machine.

Simon_LBC · Post by **Simon_LBC** » Jun 08, 2022 12:08 am this post

I agree that creating a Linux Hardened Repository on a VM wasn't my first idea, however I don't think there's any other option for me here. I have a Dell PowerEdge T440 server running latest VMWare ESXi (7.3) that is located off-site (for the moment it's on-site but only for initial backup copy job since it's 23 TB) and this server have a single internal RAID SAS array of 56 TB capacity (8x8 TB). This server is primary used as an off-site replication target (with Veeam as well) to replicate my entire production VMs (9 VM) off-site and provide immediate failover in case of disaster. However I also want an off-site repository target on this server for data retention with a second Veeam backup job, so I can't see any other option to create a repository on this ESXi server... do you have other suggestion?

Post by **Mildur** » Jun 08, 2022 5:19 am this post

From the backup server, this other „backup“ esxi host must be reachable on Port 443 if you do replication to it.
Veeam has all the passwords stored for the esxi hosts in its database. There is a simple method for the administrator to decrypt the passwords.
If an attacker has access to the backup server, it‘s simple as follows:

1) Attacker gets access to the backup server
2) decrypt esxi password from the remote site esxi host
3) Logon to the esxi host
4) delete the linux hardened repository with all it‘s stored backups.
5) delete all backups from the directly connected enterprise NAS
6) delete or encrypt all production vms.

I recommend to have another thought about your environment. Accessing the backup server gives you the capability to destroy everything (production, replicas and all of your backups). Your backups are not protected if you store backups on the second ESXI host.
Each backup copy can be destroyed using the backup server.

Simon_LBC · Post by **Simon_LBC** » Jun 08, 2022 10:37 am this post

Both production and backup server are behind several firewall with all ports closed and only accessible using a site-to-site VPN, so it's not that much of a big deal for me. As I explained, this is my only available target server for both off-site replication job and backup job, I don't have any other infrastructure available to do that. So what other setup could be achieved knowing this is the only available server I have?

Post by **foggy** » Jun 08, 2022 5:14 pm this post

Any chance to cut out a second LUN on the array to set up a Linux repository there?

Simon_LBC · Post by **Simon_LBC** » Jun 09, 2022 12:26 pm this post

Unfortunately cutting a second LUN out of the main (and only) array isn't an option neither because I verified and it's not actually allowed to cut-out (reducing) the virtual volume size, so cutting it into 2 parts mean that it requires to "trash" the entire server setup and start it from scratch again who will be way to much time consuming and ineffective. Look like I am pretty screwed with this setup now.

Simon_LBC · Post by **Simon_LBC** » Jun 09, 2022 12:54 pm this post

I also red the entire thread you provided (vmware-vsphere-f24/don-t-store-backups- ... 10666.html) about storing backup on a VM and I must admit that the bottom-line of this discussion is all about risk management. This virtual repository stored on a virtual VMFS is our SECOND backup destination, not the main one. The first backup destination is on a good performance iSCSI storage and it's all build in accordance to Veeam best practices. This virtual repo here is the second destination for off-site storage. So if it die... it die... and we will rebuild a new. The risk here would be if the main backup would fail at the exact same time than the second backup, so as to the exact same time as the main production environment fail as well. So realistically speaking, is maybe a super low risk. Moreover, this second destination server is primary used for full replication destination. So even if the virtual backup repo fail because of a VMDK corruption, the replicated VM would still be working, except if the entire server totally fail, but that would be very surprising. This virtual VM repo is only a storage for data retention, as a second destination repository.

I think that maybe our entire actual backup & replication setup is perhaps even "too safe" or overkill... because it's like a backup from a backup to a second backup and a replication, to a second replication... in the case that something, somewhere, fails...

Post by **foggy** » Jun 09, 2022 12:58 pm this post

Just wanted to make sure you understand all the risks and seems you do.

Post by **RobTurk** » Jun 13, 2022 6:23 am this post

The capacities mentioned (56TB out of a set of 8x8TB) suggests RAID-5. That may be another risk. When one of the disks fails, the rebuild will take quite a lot of time (several days).
Performance will drop significantly during this time, which impacts the replication and copy processes. You may have to stop these while a rebuild is in progress.

Performance is as good as the weakest link. As you have at least two different tasks running on the disks (replication and copy job), the disks will perform for random access. Depending on the type of disk (5400/7200 rpm, cache on disk) they may not be a good match for that type of workload. The type of RAID controller itself can also play a role. The T440 supports several from low-end software (S140) to high end, cache-backed hardware (H740P). It all makes a difference.

Butha · Jun 13, 2022 7:08 am

Hi Simon,

I see you mention 10Gbit a few times so I'll tackle this from the lower levels - which is often the cause

(And out of scope for normal support)

How is your jumbo frame configuration here? As you mention lan, and iscsi, it's very possible network related, and to be honest the speeds you are mentioning (around 200MB/s) is not close to what's possible on 10Gbit. If it's single jobs on testing and the target are anything above 3x sas disks - you should be pushing 600MB+. (for reference we're running 100GBit around the esxi side, 10Gbit to physical ubuntu IMM Repo's and 40GBit to the backend storage).

When running 10Gbit links with standard mtu size you will still be close to 1Gbit speeds. It's out of scope here as to why - just trust me

First - jumbo frame configuration seems scary, and you should be careful as you don't want to risk cutting yourself off from something, but with a little bit of planning it's actually easy - as jumbo and non jumbo talk no problem (but there are fragmentation on your switches - which could be an issue with very cheap switches) If you are not comfortable with this, or don't manage the switch infrastructure or don't have a network background - ask for somebody that does or reach out.

To test your networking throughput as a first phase (don't worry about disk read/write speeds yet) - use something like iperf. You run it on source or target side in "server" mode - and then on the client you run it connecting to the "server". Run it with multiple streams - say 6 and see what you get. example: "server (10.1.1.2)" : iperf -s " and client (10.1.1.3) " iperf -c 10.1.1.2 -P 6 . You can also reverse the direction to show you if you have issues one way (perhaps duplexing etc) The speeds should be closely matched. This will show you what your switches are capable of.

Second - You have a few places where jumbo frames needs to be configured. Importantly the physical switch ports needs to have it enabled - Vendors differ here, as do switch models, some require it as a global command (which often requires a reboot of the whole switch!! - for example Cisco 2900/ 385x series) - some higher end ones like Nexus can be done "per port". The ports uploading to esx for your data (iscsi) layer needs to be configured, as any other physical ports connecting to the nas. In fact if all your traffic are on 10Gbit I'd recommend jumbo everywhere, but thats a bit out of scope here

You then need to configure your vswitches on esx at one or two places as well, and make sure the switch config -> data ports are mapped correctly. (Often a vswitch has multiple physical uplinks, even 1Gbit and 10Gbit mixed but engineers don't map them correctly) You then need to configure the vnics for any VM's to be jumbo frame enabled as well (In the OS). This is done differently on windows VMs and Ubuntu VMs. For windows VM's you must use vnxnet3 adapters, and there are also a lot of custom powershell tweaking required to allow proper speeds (things like offloading, flow control etc etc). On Linux a bit less tweaking - more a challenge as most flavours differe WHERE to change mtu, and also how to save it so a reboot don't reset to standard. You then lastly have to make sure jumbo frames are also enabled on the physical ports on the NAS.

You can then repeat the tests with iperf and see the difference.

Simple method to test jumbo frames end to end is with say 2x vm's running via the switch. Simply ping one from the other, but add " -l 8900 and -f" ("dash Lima dash Foxtrot) so example from: 10.1.1.2 : from 10.1.1.2: ping 10.1.1.3 -l 8900 -f It means ping with mtu size 8900 (jumbo is >9000 but you leave a few bits for overhead) and -f is "don't defrag packets" -so force 8900 - otherwise it might seem to work but isn't really using jumbo! If everything checks out you are ready to go and should see great improvements.

Just on a last note - you mentioned the "processing speed" as a metric - please note this is not a reflection of actual writing speeds through the network - it's merely a calculation of time it took to read data + write changed blocks - many other factors play a role here. I have jobs with processing speeds of 2-3Gbits/s but actual writing speeds are around 1GB. (the 10Gbit max).

gparker · Post by **gparker** » Jun 13, 2022 7:55 am this post

Hi, make sure you’ve provisioned Veeam Proxy server roles in both the source and target sites. If you don’t have a proxy in the target site then the data mover on the proxy in the source site has to do all the work when processing the backup copy job.
Regards, George.

Post by **Mildur** » Jun 13, 2022 8:53 am this post

Hi George

Proxy Server are not involved in a Backup Copy Job.
The backup copy job transfers the data directly between the data mover of the source repository server and the target repository server.
Only exception, if you use a NFS or SMB based repository, then the gateway server will be involved to copy the backups.

Thanks
Fabian

gparker · Post by **gparker** » Jun 13, 2022 10:31 am this post

Hi Fabian, thanks for that. My bad, I thought the proxy server and data mover were one and the same. I was under the assumption that one of the jobs of a proxy server is to move data to/from a repository. I did not realise that a data mover runs on the repository server. I know that there should be proxy servers in both sites to support Veeam replication, i had also assumed they were needed at both sites to support/ improve on backup copy throughput processing.
George.

Alastair-R · Post by **Alastair-R** » Jun 13, 2022 10:42 am this post

Mildur I think your point 4 is incorrect. The only way backups can be removed from a linux HR is the retention policy period or the attacker has direct access to the Harden Repository.

Mildur wrote: ↑Jun 08, 2022 5:19 am From the backup server, this other „backup“ esxi host must be reachable on Port 443 if you do replication to it.
Veeam has all the passwords stored for the esxi hosts in its database. There is a simple method for the administrator to decrypt the passwords.
If an attacker has access to the backup server, it‘s simple as follows:

1) Attacker gets access to the backup server
2) decrypt esxi password from the remote site esxi host
3) Logon to the esxi host
4) delete the linux hardened repository with all it‘s stored backups.
5) delete all backups from the directly connected enterprise NAS
6) delete or encrypt all production vms.

I recommend to have another thought about your environment. Accessing the backup server gives you the capability to destroy everything (production, replicas and all of your backups). Your backups are not protected if you store backups on the second ESXI host.
Each backup copy can be destroyed using the backup server.

Jun 13, 2022 11:08 am

Mildur I think your point 4 is incorrect. The only way backups can be removed from a linux HR is the retention policy period or the attacker has direct access to the Harden Repository.

Not if the Linux Hardened Repo is a VM with VMDKs as the backup storage.
When I delete a VM, the ESXI Host doesn't care about the "immutable" filesystem inside the vm. It just removes the VMDK's and with them the immutable backups.
That's the scenario in this topic.

R&D Forums

Very slow "Backup Copy Job" over local LAN (10 GBPS)?

Re: Very slow "Backup Copy Job" over local LAN (10 GBPS)?

Re: Very slow "Backup Copy Job" over local LAN (10 GBPS)?

Re: Very slow "Backup Copy Job" over local LAN (10 GBPS)?

Re: Very slow "Backup Copy Job" over local LAN (10 GBPS)?

Re: Very slow "Backup Copy Job" over local LAN (10 GBPS)?

Re: Very slow "Backup Copy Job" over local LAN (10 GBPS)?

Re: Very slow "Backup Copy Job" over local LAN (10 GBPS)?

Re: Very slow "Backup Copy Job" over local LAN (10 GBPS)?

Re: Very slow "Backup Copy Job" over local LAN (10 GBPS)?

Re: Very slow "Backup Copy Job" over local LAN (10 GBPS)?

Re: Very slow "Backup Copy Job" over local LAN (10 GBPS)?

Re: Very slow "Backup Copy Job" over local LAN (10 GBPS)?

Re: Very slow "Backup Copy Job" over local LAN (10 GBPS)?

Re: Very slow "Backup Copy Job" over local LAN (10 GBPS)?

Re: Very slow "Backup Copy Job" over local LAN (10 GBPS)?

Re: Very slow "Backup Copy Job" over local LAN (10 GBPS)?

Re: Very slow "Backup Copy Job" over local LAN (10 GBPS)?

Re: Very slow "Backup Copy Job" over local LAN (10 GBPS)?

Who is online