Outgrown veeam backup server - now what?

Post by **svallarian** » Jun 24, 2011 4:20 pm this post

Been troubleshooting a case with support of jobs not firing off when scheduled. We've both about come to the conclusion that my physical veeam backup server just doesn't have enough horsepower to do what we're trying to make it do.

Here's what we're backing up:
70 VMs, every night. (and growing!)
Scheduled to fire off every 15 minutes or so. They're all staggered through the night.

a quick snapshot of a typical job run --
3 backups running eats 40% (of 8 processor cores! - 2 Quads), 150MB ram per VM backup running, and 15%network utilization.
I'm using Direct SAN access.

I'm not sure what the best method is to fix the problem.

Do I buy another physical server?
Do I make a VM, and run the backups inside the VM?
Do I try and upgrade the physical server with more processors?

Thanks,
Steven Vallarian

Post by **tsightler** » Jun 24, 2011 4:25 pm this post

Just to make sure I understand, you're backing up all 70 VM's every 15 minutes? Or are you saying you're backing up 70 VM's with separate jobs 15 minutes apart?

Post by **svallarian** » Jun 24, 2011 4:31 pm this post

Separate jobs, 5-15 minutes apart.

Post by **tsightler** » Jun 24, 2011 5:11 pm this post

Well, my generic observation is that this problem can't possibly be due to "not having enough horsepower". We backup a similar number of VM's (~70) with an old server that's just a dual-CPU hyperthreaded system from 2005. Now admittedly we don't run 70 different jobs, just 6 jobs or so, but we can run them all simultaneously and the CPU's will be pegged 100%, but no job will simply fail to start. Our entire backups will run in around 4 hours or so.

Are you getting an error with the jobs, or are they simply failing to actually kick off? With your stagger of 5-15 minutes, what's the maximum number of jobs that are running at a single instant?

If it really is a "horsepower" issue than it feels far more likely you're running out of IOP horsepower than CPU. What type of storage are you writing to?

On another note, why are you backing up 70 VM's with 70 different jobs? That's not what Veeam is designed to do so you're not really using in a way that is ideal, which is possibly why you're having issues. It would likely be best to use the product as designed rather than trying to force it to be something it's not.

Post by **svallarian** » Jun 24, 2011 6:19 pm this post

The jobs are just failing to kick off. I'm thinking that when some of the jobs run long, the machine can't find the time to kick off the next job. I try to run 4 jobs at a time. The machine (which also doubles as my vCenter server) gets completely unresponsive if I go over 4.

We're writing to locally attached near-line SATA.

We probably are using the product wrong, but let me tell you what we've tried. I setup 6 jobs, one for each ESX host. This works fine, most of the time, but when the job fails, it fails the entire host.

For example, we've got one host with 20 VMs on it. It'll run through 11 and fail on the 12th. What was getting me was we would have to restart the entire job - and it would backup all of the previously backed up machines again.

The individual jobs gave me the granularity to just restart the two or three jobs that might have had trouble, so we could be ready to roll by morning.

But if there's a better way I'm all ears!

Post by **Gostev** » Jun 24, 2011 6:26 pm this post

svallarian wrote:For example, we've got one host with 20 VMs on it. It'll run through 11 and fail on the 12th. What was getting me was we would have to restart the entire job - and it would backup all of the previously backed up machines again.

That's not how it works... if your job fails to process certain VMs, then the following automatic retry cycle will only process those VM that failed during "main" run, skipping successful VMs. I assume here that you have automated retries enabled in your job scheduling settings.

If you are not seeing this behavior, please open a support case and let us take a look why. But generally speaking, retry functionality has been in the product for a few years now, and it should not have bugs like that.

Post by **svallarian** » Jun 24, 2011 6:58 pm this post

We had the automatic retry set, but it would get "Job has failed unexpectedly" and then not retry. We'll retry the 6-job method, and let support know.

Post by **tsightler** » Jun 24, 2011 8:41 pm this post

OK, I see, so you split out the jobs because of problems you were having with the jobs failing.

So really, we need to figure out why the jobs are failing as that's the critical part of the equation. It was not clear in your orginal posts that the Veeam server and the vCenter server were both on the same hardware. I would generally not suggest this. vCenter is not particularly demanding, but it does require a SQL server which has a tendency to gobble all of the memory. When another process starts, like Veeam, SQL will not easily give up it's allocated memory and you'll end up with a lot of swapping and a poorly performing system. Not only that, but Veeam needs SQL too, so that's even more memory.

That being said, if you were running say 1-2 jobs, with "normal" compression, I'd still think you'd be OK. The logs should give some indication as to what's actually causing the job to fail. Is it a process issue (perhaps from memory pressure) or a communication issue with vCenter?

In the example you gave you had an ESX host with 20 VM, and the job failed on VM 12, does the job continue to backup the remaining 8 VM's, or the job just stops. Certainly if that's the case there's more to it than just "running out of horsepower".

What type of system/OS/memory are we talking here? 32-bit or 64-bit (I guess it's got to be 64-bit since vCenter requires that now, right?). Are you using the 64-bit versions of Veeam. What does memory look like when Veeam is running? If your storage is fast then running four Veeam jobs simultaneously can put tremendous load on the CPU's (given a high performance storage system, and maximum dedupe/compression, even a single job can saturate 4 our more cores of a modern CPU). Why not create just 2 big jobs with 3 hosts each and see how that works for you? Having lots of jobs on a CPU saturated machine won't make the backups go any faster and will lead random failures.

I really wish Veeam had a way to limit the maximum number of simultaneous jobs as it would make the scheduling for issues like this much easier.

Post by **Gostev** » Jun 24, 2011 9:08 pm this post

svallarian wrote:Job has failed unexpectedly

Oh, this is kinda "very bad" failure, it may indeed kill the job completely. Definitely should not be happening at all, and must be investigated. Are you running the latest product version, because as far as I remember, one of the previous maintenance releases (quite a while ago) fixed the issue that was causing such error.

Post by **svallarian** » Jun 24, 2011 9:50 pm this post

gostev:
5.0.2.230 (64-bit) version of veeam.

tsightler:

The job just fails outright and doesn't continue.

Server 2008 r2, 64-bit, 6GB RAM, 2 Xeon 5520s.
I think we'll have to go with more than two jobs - if one of the VMs gets out of hand and runs long it might overrun the backup window.

Memory runs about 150MB per job. The CPU gets hit pretty good during the backups. I'm using Optimal/Local Target for the compression settings.

Post by **tsightler** » Jun 25, 2011 1:03 am this post

Curious, how tight is your backup window? We backup ~70VM's that total around 10TB's and backups complete in about 4-6 hours. We do this with 4 jobs, configured to start 1 hour apart so that no more than 2 jobs are running simultaneously. There's actually a good bit of dead time on our backups. The fact that you were attempting to run 70 backup jobs at 15 minute intervals implied that you had a pretty reasonable backup window.

To answer your initial question, i.e. "Now What?" I guess if I didn't have more physical hardware I just create a few Veeam VM's and use virtual appliance mode.

Post by **svallarian** » Jul 05, 2011 2:30 pm this post

An update - we moved to a 4 job set, one for each ESX server. Over the last four days, it only failed catastrophically once - an error message saying that the machine didn't have enough resources to continue. We are now having the problem of the backup jobs going long - 12 hours+ for all of the jobs to complete. I'm starting them all at 7pm, and they aren't getting done till about 11am the next morning.

I've got all 4 jobs running concurrently so that I can try and meet the backup window.

What is happening is that one VM - running SQL server of course - is running 2-3 hours and holding up the rest of the backup jobs.

We're going to try and change the compression level from Optimal to Low to see if that helps any.

Barring that, is my only option another veeam backup device?

Post by **Gostev** » Jul 05, 2011 3:09 pm this post

Yes, unless you can identify and remove current bottleneck. Could be either production storage, or your backup storage speed. I assume here (from what you said above) that your backup server is not sitting at 100% when you run 4 jobs in parallel, meaning that bottleneck is storage. But if it is, reducing compression to Low will let you reduce the load indeed.

Of course, you can also improve incremental run performance about 3 times by changing reversed incremental backup mode to regular incremental (not sure which mode are you using).

Anyway, adding new backup server is not a big deal (this is why we have added Enterprise Manager in v4 in the first place, to allow you easily scale your deployment). Since your environment is growing, you will have to add one sooner or later. BTW, we have customers who are using 50 and more backup servers.

Post by **tsightler** » Jul 05, 2011 3:10 pm this post

Are you running forward or reverse incremental?

Why is one SQL server hold up the other 3 jobs, or is it just that job running over?

What is the CPU usage when all of the backups are running at the same time?

The question you really have to look at as to whether another server would really help or not is, are you hitting any actual limits with the existing server, or are you hitting some other limit (bandwidth, source or target IOP, etc.) that's keeping you from fully utilizing your existing server. Based on your previous posts indicating 40% CPU utilization (if this is accurate), I don't think you're using the full capacity of your server, as Veeam running 4 jobs should easily saturate the CPU's on your server (really, with optimal compression/dedupe and no other bottleneck, Veeam could come close to saturating your CPU with just one job).

If you have a few servers that take especially long to run their backups (for example SQL and Exchange servers with lots of random block changes) then it's probably a good idea to split those out into their own job. That's what we do in our environment. We actually have 4 jobs, but they're not broken down by ESX hosts, but rather by VM category based on our retention requirements and the amount of changes the VM's see per day. We choose the job mode that best fits the performance goal and retention requirements.

Post by **tsightler** » Jul 05, 2011 3:26 pm this post

I don't know if your willing to share more details, but I'd love to try to understand why your jobs are taking so long to complete as our environments are similar sizes, and our server hardware is far inferior, yet our backups are much faster. Our backups of 67 VM's, totaling about 9TB's finish in about 3 hours on average, and rarely take longer than 6 hours even on very heavy days. Feel free to PM me if you'd like to dig more and attempt to pinpoint the bottleneck.

Post by **svallarian** » Jul 05, 2011 3:35 pm this post

Reverse Incremental - so I guess we will give plain incrementals a go.

It's just that job (and some of the other jobs) running over due to randomly spaced out SQL servers.

CPU is at 80-100% while all 4 jobs are running. I wasn't looking at the performance counters correctly on the previous post.

..and yeah, we're growing at 2-4 SQL VM's a month.

Post by **Gostev** » Jul 05, 2011 3:40 pm this post

Yes, switching to incremental backup mode might be an easy "fix" for incremental backup performance (albeit at the cost of backup storage, of course). Decisions, decisions...

Post by **tsightler** » Jul 05, 2011 3:51 pm this post

So SQL is really a worst case scenario for Veeam reverse incrementals, at least if there are a relatively high amount of changes. In many cases you can run full backups faster than reverse incrementals if your SQL servers are seeing relatively high rates of random change. Forward incrementals can be a huge benefit here, and you can decide your strategy with them. We always use forward incrementals for VM's with a high random change rate and we then set the jobs to either run weekly fulls, or transform the incrementals into a rollback every night, based on retention requirements. The advantage of transforming incrementals into rollbackups every night, instead of just running reverse incrementals, is that the backup itself will be done more quickly, and, while the transform process will take more time, it's not impacting your production environment.

But, for very high change rate VM's, we've found it's just better to run forward incrementals, with a weekly full. Yes it uses more space, but it's by far the fastest for this particular system type.

R&D Forums

Outgrown veeam backup server - now what?

Re: Outgrown veeam backup server - now what?

Re: Outgrown veeam backup server - now what?

Re: Outgrown veeam backup server - now what?

Re: Outgrown veeam backup server - now what?

Re: Outgrown veeam backup server - now what?

Re: Outgrown veeam backup server - now what?

Re: Outgrown veeam backup server - now what?

Re: Outgrown veeam backup server - now what?

Re: Outgrown veeam backup server - now what?

Re: Outgrown veeam backup server - now what?

Re: Outgrown veeam backup server - now what?

Re: Outgrown veeam backup server - now what?

Re: Outgrown veeam backup server - now what?

Re: Outgrown veeam backup server - now what?

Re: Outgrown veeam backup server - now what?

Re: Outgrown veeam backup server - now what?

Re: Outgrown veeam backup server - now what?

Who is online