Slower with parallel processing

Post by **masonit** » Jun 04, 2014 7:22 am this post

Hi

Setup:
Veeam backup server ver 7.0.0.839
2 * proxy servers, both with 20 concurrent tasks (san backup)

With the setup above we have been backing up 600 - 650 vms within backup windows 22:00 - 08:00 without any problems. We have compression set to high after the upgrade to 7. In 6.5 it was optimal compression. We haven't been using parallel processing. But now when a specific bug was fixed in 7.0.0.839 we desided to start using parallel processing.

Yesterday I activated parallel processing and hoped we would gain even more perfomance. But now it seems like it has have opposite effect. This morgning at 08:00 there was still 28! jobs still running. If I look at the load of our proxy servers and backup server they are all almost idle. Bottleneck on the jobs say target as they always do. But now it is at 70-80 % but normally it is more like 99%..

One thing I noticed is at almost all jobs are saying: Resource not ready: Vmware proxy. When I count concurrent tasks to one of the proxy servers it is only 8. So in theory is should be able to handle 12 more concurrent tasks. 20 min later the jobs still was waiting for resources. Seems like Veeam doesn't use the proxy resources very efficient in parallel processing mode..?

Shouldn't we get better prefomance with parallel?

\Masonit

Tobias_Elfstrom · Jun 04, 2014 8:57 am

I have noticed the same thing. If you are using parallel processing and you are using all your proxy servers (depending on how many jobs, VM and disk in thous VM's you have of course) it seems that there is often some sort on resource allocation conflict so that jobs are holding each other up. For this reason your total backup running time might be longer when using parallel rather than doing things in sequel with in the backup jobs.. (And if there are to many jobs altogether in this state at the same time Veeam never seem to recover out of it. )

So in order to have a shorter total backup windows using parallel processing you might need to rethink your scheduling.

//Tobias

Jun 04, 2014 9:29 am

masonit wrote: One thing I noticed is at almost all jobs are saying: Resource not ready: Vmware proxy. When I count concurrent tasks to one of the proxy servers it is only 8. So in theory is should be able to handle 12 more concurrent tasks. 20 min later the jobs still was waiting for resources. Seems like Veeam doesn't use the proxy resources very efficient in parallel processing mode..?

Hi Masonit,

One thing to clarify is that processing one virtual disk (VMDK or VHD(X)) equals one proxy task, and requires 1 CPU core. Since you are running Direct SAN, I assume you are using two physical servers? If you have dual CPU with each 10 cores, then 20 tasks per proxy is OK, but please verify that the number of tasks per proxy does not exceed the number of CPU cores. Since you do not see high CPU utilization on the proxy servers, this may not be the limitation, but still a good thing to verify.

You mention that the job is "waiting for resources", but have you verified it is not actually processing 20 disks simultaneously (it can be a little difficult to see from the job overview)? It will not process 20 VMs, but 20 disks. When using parallel processing, I would also recommend that you switch to "Optimal" compression, since this will really increase your backup performance, with only ~10% additional storage required. You can do this on the fly, but it is a good idea to schedule a new active full at some point.

There is no doubt that using parallel processing will put additional stress to your repository, that may actually even slow down the backup process. You can also try and throttle the ingest rate by limiting the number of tasks on your repository, and find your repository's "sweet spot" by increasing the number of tasks gradually. One repository task equals one active job.

/Preben

Post by **masonit** » Jun 04, 2014 1:04 pm this post

Yes I also realized now as you say that parallel isen't only on vm level but also on disk level. When I said 8 concurrent jobs I only looked at 8 concurrent vms in progress. But that could of course be 20 disks..

I will now change compression to optimal and also allow more concurrent jobs on my proxies.

I know that Veeam recommend one task / core. But from my experience it is not any issue with alot more concurrent tasks / core as long you keep track of the total load on the proxy servers.

\Masonit

Post by **Gostev** » Jun 04, 2014 6:47 pm this post

That's true.

Post by **tsightler** » Jun 04, 2014 8:49 pm this post

Can you tell me a little more about your environment? Specifically, what backup mode are you using (forward or reverse) and how is your repository setup?

I'm guessing that your have some pretty decent target storage for your repository, and that you are using reverse incremental. If so, you might need some changes to actually get the most advantage from parallel processing but I don't want to go down a rat-hole if my guesses about your setup is wrong.

Post by **masonit** » Jun 05, 2014 7:02 am this post

tsightler wrote:Can you tell me a little more about your environment? Specifically, what backup mode are you using (forward or reverse) and how is your repository setup?

I'm guessing that your have some pretty decent target storage for your repository, and that you are using reverse incremental. If so, you might need some changes to actually get the most advantage from parallel processing but I don't want to go down a rat-hole if my guesses about your setup is wrong.

Sad to say that using optimal compression and more concurrent task on our proxys hasen't helped.

If I count disks in progress I can see that Veeam seems to use all possible concurrent task so that don't seem to be the issue. Load on our proxy servers is not very high 50 - 60 % cpu, enough ram.
That leaves us to our storage that seem to be the bottleneck. We use reverse incremental and with raid 6 on our storage. I know that it comes with big write penalitys but it has worked so far.

Does parallel processing always use more i/o than "normal" mode?

Our environment works very well but is already before on the limit within the backup window. Seems now like activating parallel has increased the i/o by only 10 - 15 % but that seems to have pushed it over the edge.

As it looks right now we have to turn of parallel mode again. The only reason why we want to use parallel mode is because of you can't run backup of a single vm inside a backup job. You have to run the entire job. (we don't want to run zip jobs).
Today when a technincian want to for example run Windows update on a server. Then instead on just backing up that vm they need to run the entire backup job. And if it is a big job and the vm is in the bottom of the list then it can take some time. (normal technician are not allowed to change order in the job). This is a big issue for us and with parallel this would not be an issue anymore. Yes all vms would still be processed but the job would finish much quicker with parallel.

Maybe it is possible to run backup of only a specific vm inside a job in Veeam 8? Today you can run retry on a single vm so from my perspektiv it can't understand why it is not possible..

\Masonit

Post by **foggy** » Jun 05, 2014 10:49 am this post

masonit wrote:Does parallel processing always use more i/o than "normal" mode?

It could result in less optimal usage of storage IOPS, depending on repository configuration. You haven't answered Tom's request regarding your backup repository: what kind of storage do you have, how it is presented in Veeam B&R, do all the jobs point to a single backup repository, and what is the maximum allowed number of tasks that can be assigned to it?

Post by **masonit** » Jun 05, 2014 12:07 pm this post

In our environment proxy and repository is on one server. We have dedicated virtuell environment for Veeam. Every proxy server is a Windows server 2008r2 / 2012r2 with windows iscsi connected storage ( ~40 TB). The iscsi connected storage is repository. So we always have a 1 to 1 relation between proxy and repository (they are the same server). Jobs are always configured so proxy and repository are the same server. On the storage each repository is and array with ~16 * 3 TB SATA mdl drives with raid 6. We don't set any limits on repository in Veeam. We only limit concurrent tasks on the proxy server (20). If proxy server is congested we add more resources to the vm.

On average we could backup ~300 vms to every proxy / repository before we didn't use parallel processing.

\Masonit

Post by **Gostev** » Jun 05, 2014 12:15 pm this post

masonit wrote:Maybe it is possible to run backup of only a specific vm inside a job in Veeam 8? Today you can run retry on a single vm so from my perspektiv it can't understand why it is not possible.

Explanation is very easy: this functionality is going to mess up retention policy of the entire job. Unlike retry process, this needs to create the new restore point. And such restore points will need to be accounted somehow differently, which means a lot of new logic in a very critical place (retention policy). Any bug may potentially result in backup data loss. Most things are not as easy as they may appear for non-developer.

Post by **tsightler** » Jun 05, 2014 1:02 pm this post

OK, you still really didn't answer my earlier questions in full, so I'll detail out the questions and we'll go from there:

1. Forward or reverse incremental?
2. I understand the mulitple repositories (40TB), but what is the disk configuration (i.e. number of disk, RAID type, etc)?

Why am I asking these questions? Because assuming you're using reverse incremental, and assuming you're storage is 12 or more spindles with moderate latency (i.e. SATA disk via iSCSI) you likely won't get full performance from your repository storage without having mulitple I/O streams (i.e. multiple jobs) running at one time. Based on your setup you may now have less jobs actually running concurrently, which is less I/O streams on the repository.

Post by **masonit** » Jun 05, 2014 2:11 pm this post

Gostev wrote: Explanation is very easy: this functionality is going to mess up retention policy of the entire job. Unlike retry process, this needs to create the new restore point. And such restore points will need to be accounted somehow differently, which means a lot of new logic in a very critical place (retention policy). Any bug may potentially result in backup data loss. Most things are not as easy as they may appear for non-developer.

Ok thank you for the explanation.

So you are saying there is no plan to introduce this feature?

\Masonit

Post by **masonit** » Jun 05, 2014 2:20 pm this post

tsightler wrote:OK, you still really didn't answer my earlier questions in full, so I'll detail out the questions and we'll go from there:

1. Forward or reverse incremental?
2. I understand the mulitple repositories (40TB), but what is the disk configuration (i.e. number of disk, RAID type, etc)?

Why am I asking these questions? Because assuming you're using reverse incremental, and assuming you're storage is 12 or more spindles with moderate latency (i.e. SATA disk via iSCSI) you likely won't get full performance from your repository storage without having mulitple I/O streams (i.e. multiple jobs) running at one time. Based on your setup you may now have less jobs actually running concurrently, which is less I/O streams on the repository.

Well I have already answered all your questions in the thread but I try again.

1: Reverse incremental
2: Disk is 17 * 3 TB SATA mdl drives, Raid 6, strip size 512.

As I said we don't limit anything on the repository but on proxy we have set 20 concurrent tasks. Last night I tried with 40 concurrent tasks but that didn't help. I could try with even more but 40 concurrent tasks should be enough to get enough streams. If we allow to many concurrent tasks we get another problem were there are alot of active snapshots in vmware. Because of the heavy load on the storage, backup take forever and eventually when the backup is done. Then removing the snapshots will use alot of I / O in vmware and thats never good.

\Masonit

Post by **foggy** » Jun 05, 2014 2:45 pm this post

masonit wrote:As I said we don't limit anything on the repository but on proxy we have set 20 concurrent tasks. Last night I tried with 40 concurrent tasks but that didn't help. I could try with even more but 40 concurrent tasks should be enough to get enough streams.

Please review this post, I believe this is what Tom tries to explain here.

Jun 05, 2014 3:43 pm

masonit wrote:Well I have already answered all your questions in the thread but I try again.

Oops, sorry about that. I guess that's what I get for answering a post first thing in the morning.

Please accept my sincere apology.

masonit wrote:As I said we don't limit anything on the repository but on proxy we have set 20 concurrent tasks. Last night I tried with 40 concurrent tasks but that didn't help. I could try with even more but 40 concurrent tasks should be enough to get enough streams. If we allow to many concurrent tasks we get another problem were there are alot of active snapshots in vmware. Because of the heavy load on the storage, backup take forever and eventually when the backup is done. Then removing the snapshots will use alot of I / O in vmware and thats never good.

So this is why I'm thinking it's taking longer. I'm assuming you have quite a number of jobs, i.e., enough that you were able to run 20 jobs concurrently at one time. This would have created 20 I/O streams with only 20 VM snapshots open, and would likely have kept your repository busy pretty much 100% of the time (pretty much confirmed by the bottleneck stats of 99%).

Now you have 20 tasks, which is completely different. If a job has 25VMs, and some of the VMs have mulitple jobs, 20 tasks may very well only keep one job running, even though it is actively backing up 20 VM disk, they are all going to 1 single backup file, and thus only 1 I/O stream on the repository.

To get maximum performance from your setup I would anticipate that you would need 4-6 I/O streams to fully saturate the backend storage. Where does this number come from? You have 17 disks in your array, with 512K stripe size. Typical I/O size for Veeam is 512K as well (assuming Local target), which means that a typical Veeam I/O will be serviced by only a single drive in the array. Since you are using reverse incremental, each changed block creates three I/Os (two write, one read) so that will also increase the number of disks used since parity bytes will need to be read and re-written, however, overall, there's no way that a single job can saturate the spindles.

If I were you I would reorganize my repositories to get optimal performance from the storage array, as well as optimal parallel processing. For example sake here's what I'm talking about:

Non-Parallel Config
20 Backup Jobs
1 Repository - No Limits
1 Proxy - 20 tasks
With this config you could potentially have all 20 jobs running at the same time, each processing a single VM disk. This provides good performance, and likely completely saturates the repository, but likely means that VMs take longer than they should since there is contention for the repository as it can't really handle I/O for all 20 jobs at once so it's the bottleneck.

Parallel Config - non-Ideal
20 Backup Jobs
1 Repository - No Limits
1 Proxy - 20 tasks
This is basically what happens if you just tick the "Enable parallel processing" option. Now, even if you start all 20 Backup jobs at the same time, only a small subset (perhaps even only one), because all 20 tasks slots may be assigned to a single job. This means there may only be 1 or 2 I/O streams going to the repository, thus not fully utilizing the repositories available capacity for random I/O.

Parallel Config - Ideal
20 Backup Jobs - 4 jobs per repository
5 Repositories (just sub directories on the same disk subsystem) - 4 tasks per repository
1 Proxy - 20 tasks
This setup requires a little more planning, but it allows all 20 jobs to start at the same time and makes sure that 5 I/O streams are going the the repository at all times (well, assuming a given repository has jobs pending). This should generate approximately enough I/O to keep the repository saturated, without completely overrunning it, thus optimizing the time any given VMs snapshot is held open, and allows for the additional efficiencies of parallel processing.

Note that the number of repositories is more of an example based on the information you've provided, but it fits other profiles I've worked with in the field. I have an IOmeter profile that you can use to test and see how many I/O threads are optimal for your repository. One of the most common mistakes I see is that people think of "tasks" as I/O threads, but that's not the case. "Jobs" define I/O threads, so a single job with a 20 tasks is still one I/O thread. Indeed if the repository needs mulitple I/O threads to hit saturation having 20 jobs each with one "task", which is 20 I/O threads, can easily be faster than 1 job with 20 "tasks" which is only 1 I/O thread.

Post by **tsightler** » Jun 05, 2014 3:55 pm this post

BTW, you had a really strong setup without parallel processing, with a huge percentage of the bottleneck being the repository, so no matter what there may not be a huge decrease in your backups times even with an optimal setup. I would think that a perfect parallel processing setup might be able to reduce the window by 10-20% just from some of the other efficiencies that are introduced (being able to assign proxy resources to the next tasks immediately, etc), but I wouldn't expect some of the more dramatic improvements that some customers were seeing because your existing design was already effectively doing "parallel processing" it was just doing so by running lots of jobs concurrently.

Post by **masonit** » Jun 16, 2014 2:34 pm this post

Hi tsightler

Sorry that I haven't replied until now. Great post thanks!

I hear you and understand what you are saying. On average we have ~50 jobs / repository. Biggest job is 19 vm and smallest 1 vm. Average maybe 6-7 vm / job. With 40 concurrent tasks as I tried with there should be enough streams to saturate the storage. I understand that your last example is optimal perfomance wise but not administration wise..

If I would have 5 jobs with 1 vm in each job and concurrent task set to 5. If I would run these jobs at the same time using non parallel and parallel. Changed data is exacly the same. Should these jobs require same amount of i / o or is there a overhead just by using parallel?

\Masonit

Post by **dellock6** » Jun 16, 2014 7:50 pm this post

Hi Magnus,
no the difference in using parallel processing is the complete fill of available processing slots and the process of single vmdk instead of complete VMs. The final amount of I/O is caused by the VMs inside any job: since we do per-job dedup and compression, more VMs in the same job means more deduplication, and since we do source-side deduplication, this means less I/O against the repository.
The best configuration dedup-wise would be a huge single job with all VMs in it, and using parallel processing to process all VMs inside it. Without parallel processing, this single job would be processed sequentially one VM at a time, thus becoming incredibily slow.
Obviously, then there are additional considerations like having a single backup file that is too big to be managed, and you are not optimizing the I/O stream against the repository, since one job creates only 1 stream, while modern repositories can handle multiple cuncurrent streams.

Luca.

Post by **Moebius** » Sep 16, 2014 7:53 am this post

This is very useful information. I, for one, hadn't realized the difference between tasks and I/O streams. I believe this is of capital importance and deserves to be highlighted better. A forum thread can easily be skipped. How about a well written Best Practices guide? I seem to remember it was announced some time ago. Or maybe it's just me that skipped it as well.

Anyway, I'd welcome that IOmeter profile that Tom is taking about. I just got new hardware and am trying to figure out how to better configure Veeam for it.
Thanks.

Post by **dellock6** » Sep 16, 2014 9:04 pm this post

You remember well Lucio, the paper about repository performances is under review. Since the QC and R&D teams are under heavy load to finish all the activities for the release of V8, expect the paper around the GA release. I promise to update this thread once the paper is out (and you will also see the paper promoted on my blog

).

Luca.

Post by **tsightler** » Sep 16, 2014 9:37 pm this post

In the meantime here is a link to a couple of I/Ometer profiles that might be of some use. You can add threads to determine the ideal number of I/O streams for your storage.

You can also get a reasonable estimate if you know the stripe size. For example, using defaults Veeam I/O size is about 512K on average. Assuming you have a RAID6 system with 256K stripe size then each I/O will touch at least two disk and a write I/O will touch at least 4 disk. Using this we can estimate that a single reverse incremental I/O stream should be able to keep 10 disks fairly busy in RAID6 so I would configure the system so that at least one job is running for every 10 disks. Based on caching and other factors you might need more than this to maximize performance, but it's unlikely to be less than this number.

My "one size fits all" rule of thumb is that at least 1 I/O stream for every 6-8 disks in a target storage system is likely required to get maximum performance from reverse incremental or synthetic operations. This is far less critical for active full/incremental backups, although even in those cases I/O streams can help.

R&D Forums

Slower with parallel processing

Re: Slower with parallel processing

Re: Slower with parallel processing

Re: Slower with parallel processing

Re: Slower with parallel processing

Re: Slower with parallel processing

Re: Slower with parallel processing

Re: Slower with parallel processing

Re: Slower with parallel processing

Re: Slower with parallel processing

Re: Slower with parallel processing

Re: Slower with parallel processing

Re: Slower with parallel processing

Re: Slower with parallel processing

Re: Slower with parallel processing

Re: Slower with parallel processing

Re: Slower with parallel processing

Re: Slower with parallel processing

Re: Slower with parallel processing

Re: Slower with parallel processing

Re: Slower with parallel processing

Who is online