Reported VM sizes and FLR issues since upgrade to 4.1

cby · Post by **cby** » Jan 18, 2010 1:57 pm this post

I've encountered a rather puzzling situation since upgrading Veeam backup from 3.1.1 to 4.1. I have registered the matter with Veeam support (case #519016), but I was wondering if any other users had experienced similar issues...

In summary:
A backup job is made up of 2 Linux VMs. On completion of the backup the VM statistics report a dramatic increase in the size of one of the 2 VMs under 4.1.

As an example:
3.1.1 reports VM1 as 68GB and VM2 as 358GB. These tally exactly with the designated VM sizes as originally set up.
4.1 reports VM1 as 69GB and VM2 as 1.36TB which is almost a fourfold increase!

Bizarrely the above backup used to take approx. 3hr 40mins under 3.1.1. Under 4.1 it takes 1hr:55. So that's a huge increase in throughput. We have 8 concurrent jobs backing up Linux VMs and all have shown a big reduction in backup times. However, the Windows backup job has actually increased from 15 to 40 minutes for 2 VMs of 40GB each (though 4.1 reports one of the VMs as 140GB). Very odd.

Throughput stats for the above Linux job:
3.1.1 reports 91MB/s for VM1 and 31MB/s for VM1
4.1 reports 131MB/s for VM1 and 238MB/s for VM2

The throughput figures may well be skewed by the misreported VM2 size but the fact is, the backup completed in about half the time with 4.1.

Has anyone seen anything similar? It doesn't seem to impact on the operation of Veeam backup except when running File Level Restore...

With FLR I am seeing ridiculous amounts of time waiting for the file browser to come up once the FLR appliance is booted -- up to 7 minutes is not unusual. This time seems to be directly associated with the size of the VM (as reported in the stats above). Furthermore the browser will often timeout and only one of the partitions (of the 6) in the backup file is visible. The potential issues that might create this scenario have been considered but none apply in this case. As mentioned earlier the problems are with Veeam support currently. But any suggestions/observations from other users would be very helpful.

Thanks

cby · Post by **cby** » Jan 18, 2010 2:14 pm this post

...just to add that backup mode has remained at VCB/SAN for 3.1.1 and 4.1, though will be looking at vStorage API soon

cby

Post by **Gostev** » Jan 18, 2010 2:36 pm this post

Could be some glitch with the specific VM. Let's see what support has to say after investigating the log files.

cby · Post by **cby** » Jan 18, 2010 2:54 pm this post

Well, I would go along with idea that it's a specific VM that has an issue but the fact that 1 VM of the 2 in 5 jobs out of 8 are incorrectly reported by a very large margin leads me to believe that it's more than a 'glitch'. The remaining 3 jobs are also reported incorrectly but not by the same factor. Could be some legacy stuff from a .vmx file.

I agree, wait for the tech support response but was interested to know if anyone else has experienced similar behaviour.

Post by **tsightler** » Jan 18, 2010 3:56 pm this post

Well, I can only confirm that we haven't seen anything like this, our jobs sizes seem to be 100% accurate. That being said, we recreated all of our jobs from scratch when we moved from 3.1 to 4.x, not because of any Veeam issue, but because we wanted to reorganize our jobs into something that would be more optimized and easier to manage with our new Veeam 4/ESX 4 combo. We previously ran many small jobs in parallel at 10-15 minute intervals, now we have four jobs, pretty much one for each datastore. Our job statistics have always appeared to be accurate.

I do sometimes think the FLR appliance is slow, but it's not horrible. I just tested an FLR of a 600+GB Linux system using the oldest rollback (30 days) and it took about 3 minutes to bring up the browser window. A FLR of a 1.2TB Windows system took about 4 minutes. These times are quite a bit faster if I use a local NTFS volume, but we backup to Linux targets across a 1Gb MetroEthernet link and this seems to add some overhead to the restore process, at least for file-level restores.

It will be interesting to see what support finds in your case. I'd suspect some glitch in upgrading the job stats from the old version, but I really have no idea. Have you considered recreating your jobs, or at least creating a new test job of the same systems to see if they also exhibit the same problem? Just a thought.

cby · Post by **cby** » Jan 18, 2010 10:44 pm this post

tsightler

Thanks for the information and suggestion.

I did create a new test job in 4.1 and that produced some interesting results depending on the backup mode used.

Using vStorage API mode the statistics reported the correct size VM. However, the mode dropped back to SAN/NBD and ran with a slow throughput rate of 60MB/s. I then ran a job with VCB/SAN mode (our current setup). This was looking good throughput-wise, then it reported a 358GB VM as 716GB (double -- coincidence?) and the throughput was reported as a remarkable 2GB/s, though this wasn't the case in real terms. I didn't let either backup run to completion but the VCB/SAN one was showing 99% complete when in fact it had only completed 3%! Like I said, all very interesting.

The 4.1 migration retained VCB/SAN mode. Is shared storage essential with ESX 3.5 to take advantage of vStorage API?

I wanted to get to the bottom of the current issue in the event that there is a simple 'fix'. If it is a case of re-creating jobs then I'll do so. But the test jobs seem to be telling a different story.

I have kept Veeam support up to date with latest developments.

Post by **tsightler** » Jan 19, 2010 12:02 am this post

To be totally honest, I left VCB/SAN mode as quickly as I possibly could. Even though VCB generally worked fine for us, even when we used in with products prior to Veeam, I always thought it was a pretty poorly designed infrastructure. As far as I know if VCB/SAN works for you then vStorage API SAN mode should work as the requirements are the same. You can't really use VCB/SAN mode if you don't have shared storage.

Post by **Gostev** » Jan 19, 2010 12:18 am this post

Tom is correct, if you can use VCB/SAN today, there is no reason not to switch to vStorage/SAN. Although, answering your question, no vStorage API does not require shared storage, as vStorage API with "Network" option will work fine for ESX with local storage.

Post by **tsightler** » Jan 19, 2010 12:38 am this post

cby wrote: Using vStorage API mode the statistics reported the correct size VM. However, the mode dropped back to SAN/NBD and ran with a slow throughput rate of 60MB/s. I then ran a job with VCB/SAN mode (our current setup).

BTW, are you sure the 60MB/s is slow, or simply accurate. It's difficult to know if the numbers being reported by VCB mode is accurate based on it's inaccurate reporting size. 60MB/s is pretty fast for vStorage API in SAN/NBD mode based on my experience. It also sounds about right based on your reported time for the backup job. 60MB/sec would backup about 432GB in 2 hours. You reported your VM's as 68GB and 358GB which totals 426GB and the time as 1hr:55min which is pretty much exactly 60MB/sec. If that mode works for you, and reports correct statistics, that might be the easiest "fix".

cby · Post by **cby** » Jan 19, 2010 10:48 am this post

Seems to me that the figures reported under VCB/SAN are skewed and largely irrelevant. Based purely on completion times (about twice as fast as for 4.1 as it was for 3.1.1) the throughput is faster and the backup job is sound. The FLR issue *appears* to be subject to the incorrect file size given the amount of time the file browser takes to start up.

I'll set all the concurrent jobs to run with vStorage API and monitor the outcome.

Btw, why is the backup mode reported as SAN/NBD in the stats window when vStorage API is selected?

Post by **Gostev** » Jan 19, 2010 11:17 am this post

SAN/NBD is short for "SAN with failover to NBD", so it will attempt to connect direct to storage, and if connection fails, it would go over network through host (there will be extra warning about this in the session results).

cby · Post by **cby** » Jan 19, 2010 1:06 pm this post

Thanks. NBD has connotations of slow backup in version 3 hence my concern.

cby · Post by **cby** » Jan 21, 2010 5:27 pm this post

The FLR timeout problem is a known bug due to be fixed in forthcoming release -- no date yet. It's a real show-stopper for us!

Converting all the jobs from VCB/SAN to vStorage API backup mode did not produce the anticipated improvements in backup times. There was an increase when comparing 4.1 with 3.1.1 and using vStorage mode rather than VCB/SAN. For 7 of our VMs (160GB - 334GB) there is an average increase of about 33% across the board; the remaining, largest VM (426GB), has shown an improvement of about 15%. I shall continue to monitor backup times, but I wonder if anyone has any views/experience on the matter or better still, suggestions in improving these times.

Example (worse case, nearly 60% drop in performance):
3.1.1 - 2xVMs = 334GB - backed up over VCB/SAN in 85 minutes, reported avg transfer rate of 44MB/s
4.1 - 2xVMs = 334GB - backed up over vStorage API in 133 minutes, reported avg transfer rate of 43 MB/s (?)

Post by **Gostev** » Jan 21, 2010 10:32 pm this post

Speed improvements are only expected for ESX4. For ESX 3.5, only slight improvement is expected (about what you have indicated above). The issue with the 2 VMs backup speed needs to be investigated through our technical support, this is definitely not expected, and I think that something else may be slowing down backups of those VMs - I don't see how VCB can be faster than vStorage API really. Thank you.

cby · Post by **cby** » Jan 25, 2010 3:45 pm this post

In case others encounter the FLR file browser timeout problem, here's a workaround issued by Veeam support:

- On the Windows Veeam proxy fire up regedit
- Under the My Computer\HKEY_LOCAL_MACHINE\SOFTWARE\VeeaM\Veeam Backup and FastSCP node add a new DWORD called MaxPerlSoapOperationTimeout
- Assign the decimal value 600000 to MaxPerlSoapOperationTimeout

This prevents (or delays?) the timeout so at least your 7 minute wait will produce a valid file browser listing!

Post by **Gostev** » Jan 25, 2010 3:51 pm this post

Yes, this simply increases the default timeout. Thanks!

R&D Forums

Reported VM sizes and FLR issues since upgrade to 4.1

Re: Reported VM sizes and FLR issues since upgrade to 4.1

Re: Reported VM sizes and FLR issues since upgrade to 4.1

Re: Reported VM sizes and FLR issues since upgrade to 4.1

Re: Reported VM sizes and FLR issues since upgrade to 4.1

Re: Reported VM sizes and FLR issues since upgrade to 4.1

Re: Reported VM sizes and FLR issues since upgrade to 4.1

Re: Reported VM sizes and FLR issues since upgrade to 4.1

Re: Reported VM sizes and FLR issues since upgrade to 4.1

Re: Reported VM sizes and FLR issues since upgrade to 4.1

Re: Reported VM sizes and FLR issues since upgrade to 4.1

Re: Reported VM sizes and FLR issues since upgrade to 4.1

Re: Reported VM sizes and FLR issues since upgrade to 4.1

Re: Reported VM sizes and FLR issues since upgrade to 4.1

Re: Reported VM sizes and FLR issues since upgrade to 4.1

Re: Reported VM sizes and FLR issues since upgrade to 4.1

Who is online