Data rates slow dramatically over the course of a backup job

StanO · Post by **StanO** » Mar 16, 2010 4:03 pm this post

I've encountered a troublesome issue with some of our backup jobs. We're running Veeam 4.1.1 though I saw the same behavior with Veeam 4.1.

The job in question contains two VMs totaling roughly 1TB in size (one VM is 818GB, the other 218GB). Using perfmon to monitor the read bytes/sec of the VMFS volume presented to the Veeam Windows host during the job, I observe that the read rate starts off at an acceptable rate on the order of 50MB/sec with occasional spikes to 120MB/sec and valleys to 30MB/sec, but for the most part maintaining a 50MB/sec rate. As the job progresses, however, the read rate (and associated write rate on the destination disk) decreases significantly to the order of less than 10MB/sec. The net result is that the first 400GB of the job takes about 4 hours, while the remaining 600GB takes about 14 hours. All the drives in these VMs are nearly full of data (the roughly 1TB of space contains roughly 850GB of data), resulting in little fluctuation of the actual data transfer rate due to deduplication of zeroed data.

We are using 3TB logical drives (14 spindles - Ultra320) at the destination which are less than 50% full (minimizing the possibility that severe fragmentation is the culprit). To verify this I plan to repeat my testing using a pristine destination drive. I may also break the job up into two jobs to see how that impacts the time to completion. I suspect performing the backups as two separate jobs will reduce the time to completion significantly reasoning being if it takes 4 hours to do the first 400GB of a job, and 8 hours to do the next 400GB and 8 hours to do the final 200GB arranging the jobs such that the largest single job is 800GB should result in a completion time of about 12 hours.

I can provide any additional information that may be relevant, and will continue testing (I'm trying to reproduce this in a lab environment but am struggling to come up with 0.8-1TB of space to fill with random data). Considering the testing times that are required for each trial, I expect it will be slow going to resolve this on my own and so decided to solicit feedback here as to possible causes or explanations.

Any help or insight would be appreciated.

Thanks,

Stan

Post by **tsightler** » Mar 16, 2010 5:49 pm this post

I'm not 100% clear, are you seeing this of the initial full backup, or is this for incremental passes?

We're not seeing anything like this in our environment, we have quite a number of jobs that are bigger than this, including single VM's that are 1.4TB in size of mostly compressed data. We see consistent performance across the entire backup.

StanO · Post by **StanO** » Mar 16, 2010 5:53 pm this post

Thanks for the response, Tom. If nothing else, it's good to hear this behavior is unique to our environment (because we're lucky like that).

Sorry, I meant to put the data point in, but forgot. In all cases this occurs during a Full backup initial or forced.

Post by **tsightler** » Mar 16, 2010 6:29 pm this post

So I'm assuming your target is a locally attached disk formatted with NTFS, or is it SAN? You're not seeing any significant IOPS on the target at that point are you?

Are you monitoring memory usage for the job? CPU usage? What type of server hardware? Have you tried with different compression ratios and/or disabling the dedupe engine? What size is the final VBK file?

Sorry I don't have any answers, just lots more questions!

StanO · Post by **StanO** » Mar 16, 2010 7:27 pm this post

No apologies needed for trying to help. Talking through it sparks ideas.

Initially, the target disk was a direct attached SCSI array (5 of them actually, each with a 14 spindle RAID5 LUN - each job uses a separate SCSI array destination). Then we determined that the original backup server was a bit underpowered on CPU as we couldn't run two concurrent jobs with optimal compression and dedupe enabled. For the sake of clarity, let's call the original server BEServer (guess what product Veeam is displacing

), this is a dual single-core processor server (3GHz, 4GB RAM). To alleviate the CPU resource shortage, we moved the Veeam backup processing to the vCenter Server (previously nearly dedicated to running vCenter Server) - call it VCserver. VCserver is a blade server and thus does not lend itself readily to external SCSI connection, so we've kept the external SCSI arrays connected to BEserver.

So, VCserver runs the Veeam jobs and moves the data over GigE via network share to SCSI disks attached to BEserver. We recognize that the GigE connection represents the most likely bottleneck when running two or more concurrent jobs, but it should be sufficient for processing of a single job at a time and adequate for running two jobs concurrently (which is all we intend to ask of the single quad-core VCserver (2.83GHz, 4GB RAM)). Source disks are all FC SAN attached (40 spindles) and use the SAN transport for reading.

IOPS on the target scale down with the decrease in total throughput. CPU and memory utilization, in general, both appear to remain constant throughout the job, though CPU may taper off a bit with the reduction in throughput (I'll make a more conscious note of it through the next run).

I've started another test run, this time using space from the SAN array as the destination. This space is using the same 40 spindles as the VM stores, but this test should eliminate or implicate the backend network/BEserver/SCSI storage components as the likely culprit(s).

I've not yet done trials with varying compression and deduplication settings since moving the processing to VCserver and introducing the network sharing, but I did do trials with both Optimal and Low compression when everything was running on BEserver and observed the same behavior in both cases. If my current run exhibits the same behavior I'll give the de-dupe and compression settings additional consideration.

Thanks,

Stan

Post by **tsightler** » Mar 16, 2010 8:10 pm this post

I'm not sure how Veeam behaves when writing to a share, but this would be my most likely guess as to the culprit. While I would generally agree that writing to a share shouldn't be that big of an issue, especially for the full backup, there are some known performance penalties when accessing very large files via a share.

We have a somewhat similar setup. Our Veeam server is a blade, but our target storage is iSCSI based storage locate at a remote facility about 7 miles away, connected via 1Gb fiber Ethernet. We installed a small linux VM to frontend the iSCSI box and added the linux system directly into Veeam rather than access it via a share. With this method Veeam deploys a small agent to the linux server and read data directly from the Veeam server over a socket, no need for sharing the filesystem. This works very, very well.

Will be interested to see your results.

Post by **Gostev** » Mar 16, 2010 8:43 pm this post

Stan, also you can save time on experiments with disabling dedupe and various compression levels... can just skip those. We found that disabling dedupe and using low or no compression will always reduce the backup performance, because much more data will need to sent to the target storage over the network. In fact, you will see performance drop from using low compression levels even when backing up to local storage (famous "DriveSpace" effect).

Also, note that dedupe engine has very minimal overhead... most CPU is eaten by compression - but "Optimal" compression settings is called so for a good reason

Post by **tsightler** » Mar 16, 2010 9:43 pm this post

tsightler wrote:I'm not sure how Veeam behaves when writing to a share, but this would be my most likely guess as to the culprit.

I guess I have to take this back since, after re-reading your post, I realized that you said you were experiencing this issue both before and after moving the backup to a new host. No I'm at a loss. Will wait to see how you results are going to the SAN target.

StanO · Post by **StanO** » Mar 17, 2010 6:11 pm this post

I ran through two tests yesterday and this morning using the SAN attached disks as the destination and observed no significant slow down as the job progressed. So, I'll turn my focus to the backend part of the equation. I'm going to confirm again that performing a backup on BEserver exhibits the slowdown (eliminating the network from the mix). If the problem persists, I believe it will be indicative of a resource shortage on BEserver or an issue with the SCSI attached storage.

Thanks for talking it through with me, I'll post any findings or progress.

Stan

R&D Forums

Data rates slow dramatically over the course of a backup job

Re: Data rates slow dramatically over the course of a backup job

Re: Data rates slow dramatically over the course of a backup job

Re: Data rates slow dramatically over the course of a backup job

Re: Data rates slow dramatically over the course of a backup job

Re: Data rates slow dramatically over the course of a backup job

Re: Data rates slow dramatically over the course of a backup job

Re: Data rates slow dramatically over the course of a backup job

Re: Data rates slow dramatically over the course of a backup job

Who is online