Host-based backup of VMware vSphere VMs.
Post Reply
tom11011
Expert
Posts: 192
Liked: 9 times
Joined: Dec 01, 2010 8:40 pm
Full Name: Tom
Contact:

Replication Performance

Post by tom11011 »

I would like to attempt to compare our replication performance with others. In my opinion, our performance used to be better in the past but it just seems to take a really long time now and I would like to get some comparisons. I'll try and layout the technical specs below.

Our source and destination servers are Dell PE720, dual six core cpu, 256gigs ram
Vmware vsphere 6.0 Update 2 plus all patches after
Latest Veeam 9.5
Dell PE 3200i SAN at both locations, all drives are 600gig 15K rpm sas
1 gig iscsi (although I'm not sure that matters in this case)
All disk groups are 12 drive disk groups in a raid 10

We are replicating over a 30 Mbps vpn tunnel to a distant site, all 30 Mbps are available, however we rarely use the available bandwidth which I find odd, its more likely to see 15-20 Mbps in use. Source always seems to be our bottleneck according to veeam.

Image
Gostev
Chief Product Officer
Posts: 31815
Liked: 7302 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Replication Performance

Post by Gostev »

The issue here is the performance of fetching data from the source SAN... according to your screenshot, something is seriously wrong with the fabric or the SAN itself.
tom11011
Expert
Posts: 192
Liked: 9 times
Joined: Dec 01, 2010 8:40 pm
Full Name: Tom
Contact:

Re: Replication Performance

Post by tom11011 »

What do you judge that upon?
Gostev
Chief Product Officer
Posts: 31815
Liked: 7302 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Replication Performance

Post by Gostev »

By the fact that 20 MB/s is as fast as the backup proxy is able to retrieve the data from SAN... that's too slow even for my 10 year old laptop, let alone SAN with 15K SAS drives. And bottleneck stats confirm that by showing that every other component basically remain idle, just waiting for the source data mover to supply the data.
tom11011
Expert
Posts: 192
Liked: 9 times
Joined: Dec 01, 2010 8:40 pm
Full Name: Tom
Contact:

Re: Replication Performance

Post by tom11011 »

I think maybe the san is too busy, lots of veeam backups and replications running. I am going to disable all jobs and then just let one replication job run by itself. I'll report back.
tom11011
Expert
Posts: 192
Liked: 9 times
Joined: Dec 01, 2010 8:40 pm
Full Name: Tom
Contact:

Re: Replication Performance

Post by tom11011 »

Here is a replication job running by itself, this looks better.

A few questions,

1.) How does the current performance for a single replication job running rate for my described setup?
2.) if I change my backup jobs to use forward incrementals instead of reverse, what kind of performance impact would one expect to see?
3.) also, will veeam continue to cleanup the old vrb files created by the reverse incremental jobs?
4.) can you contrast the difference between processing rate and speed?

Image
foggy
Veeam Software
Posts: 21139
Liked: 2141 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: Replication Performance

Post by foggy »

1. Now performance looks more justified.
2. Considering the bottleneck is still source, this will not affect anything.
3. In case of backup method switch to forward incremental, yes.
4. Processing rate is a ratio between the amount of all actually read data and time it took to transfer data to the target. The reading speed in the stats below is the actual average read rate for the disk.
tom11011
Expert
Posts: 192
Liked: 9 times
Joined: Dec 01, 2010 8:40 pm
Full Name: Tom
Contact:

Re: Replication Performance

Post by tom11011 »

>>>Considering the bottleneck is still source, this will not affect anything.
This was hypothetical, the image above is a replication not a backup, not sure if you based your answer on that.

Thank you to everyone so far. I'm running into a problem with required service level agreements (SLA) that I would like to present in a very generic sense as to get an idea of how it could be handled.

Let's say I have 2 MSSQL 2014 database virtual servers. Each of those servers is configured with 128gigs ram, 8 cores over 2 cpu.

Up until these last months, I was able to manage our SLA which was a 2 hour RPO and a 2 hour RTO.

Each server would basically run a backup to local NAS storage and then a replication to a geographically distant data center. See the original post to get an idea of our technology. Each job for each server was able to finish in about 1 to 1.5 hours on average. But, as our data continues to grow (currently about 750gigs per database server) I am no longer able to meet this SLA.

I'm trying to figure out if I have the right solution or not and am looking for suggestions. What should I be measuring? I have a budget.

Thoughts off the top of my head include the following-

1.) Our gear with the exception of our switches is 10gig ready. Pull the trigger on 10gig switches if it will make a big difference (obviously this doesn't help replications)
2.) Add a second SAN (ie stop adding disk shelves to current san)
3.) Use Veeam to either replicate or backup, but not both, ie use sql mirroring

other?
tsightler
VP, Product Management
Posts: 6035
Liked: 2860 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Replication Performance

Post by tsightler »

To me it looks like CBT is not working even though it says it is using CBT. Look at the amount of data read vs the amount of data transferred, they are massively different, in the first screenshot we read 268GB of data while transferring only 1.8GB. I would suggest trying to reset CBT on these VMs, let it run again. After the CBT reset the first run will not be better, but I'm hoping it will get a lot better on subsequent runs after that.

I guess it could also be possible that something is writing to these blocks even though the actual data is not changing, but that seems less likely.
tom11011
Expert
Posts: 192
Liked: 9 times
Joined: Dec 01, 2010 8:40 pm
Full Name: Tom
Contact:

Re: Replication Performance

Post by tom11011 »

I would be willing to try that on a job with a single vm. Do I do that from the vsphere power CLI or can I reset it from within the gui somehow?
tom11011
Expert
Posts: 192
Liked: 9 times
Joined: Dec 01, 2010 8:40 pm
Full Name: Tom
Contact:

Re: Replication Performance

Post by tom11011 »

I found veeam KB1113 which leads to vmware kb 2139574. But that script resets cbt on all virtual machines, can anyone point me in the direction of a script for a single vm?
tsightler
VP, Product Management
Posts: 6035
Liked: 2860 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Replication Performance

Post by tsightler »

It looks like those screenshots above are from a replica from backup job, is that right? If so, do can you share the equivalent screenshots from the backup itself? That might give more insight. I can't figure out any scenario where you should be seeing such a high read vs transfer unless CBT is not working properly.
tom11011
Expert
Posts: 192
Liked: 9 times
Joined: Dec 01, 2010 8:40 pm
Full Name: Tom
Contact:

Re: Replication Performance

Post by tom11011 »

Hi, I'm not sure what you mean by "replica from a backup job". Those pictures are replication jobs.

Here is a screen shot from a backup. Notice the very first line in the log action? "Virtual disk configuration change detected, resetting CBT". All my sql backup jobs say that, but my other backup jobs do not. All jobs have CBT enable, all vm's are running version 7 or higher.

Image
tsightler
VP, Product Management
Posts: 6035
Liked: 2860 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Replication Performance

Post by tsightler »

OK, so now we are getting somewhere. For some reason Veeam is detecting that there has been a change to the VM disk size and is resetting CBT. This might be expected after you resized a disk, but should only happen once after that. Are you getting this message on each and every run?
tom11011
Expert
Posts: 192
Liked: 9 times
Joined: Dec 01, 2010 8:40 pm
Full Name: Tom
Contact:

Re: Replication Performance

Post by tom11011 »

Yes, each and every run. But only for the db server jobs, not for any other vm.

It is happening on both replication and backup jobs for these particular servers.

These servers have had their disk sizes increased, but the last one was several weeks ago.
tom11011
Expert
Posts: 192
Liked: 9 times
Joined: Dec 01, 2010 8:40 pm
Full Name: Tom
Contact:

Re: Replication Performance

Post by tom11011 »

I've opened case 01995485 on it.
tsightler
VP, Product Management
Posts: 6035
Liked: 2860 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Replication Performance

Post by tsightler »

You might want to try following KB 1940, specifically the section under Veeam Backup & Replication v8 and Later which shows how to use a registry key to disable this automatic CBT reset functionality (this was added as protection against the CBT corruption issues in early versions of vSphere 6). Perhaps somehow this is malfunctioning in your case as it should not reset every time.
tom11011
Expert
Posts: 192
Liked: 9 times
Joined: Dec 01, 2010 8:40 pm
Full Name: Tom
Contact:

Re: Replication Performance

Post by tom11011 »

I'll keep it in mind thanks. I have a case open now with veeam, they are recommending I just open a case with vmware.

Unfortunately, that script only resets ALL vm's. Can't have that happening!

My only workaround is to apply that script to a single esx server instead of vcenter, (ie migrate off all the vm's I don't want reset from the single esx server). I've applied this script to one server and I am currently running a replication which will take a while. I'll report back on the results.
tsightler
VP, Product Management
Posts: 6035
Liked: 2860 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Replication Performance

Post by tsightler »

Well, based on the "Virtual disk configuration change detected, resetting CBT" message, I'm not sure that resetting CBT is really the answer as it's obvious that we are already doing the CBT reset every single job run. This should only happen on the first run after the disk is resized, not every run. This is what Veeam support should be looking into, to understand why we are doing that on every run. I'll reach out to the support engineer on your case and make sure he understand what we are seeing. I'm thinking it might be a bug due to the excluded disk, but that's just a total guess right now.
tom11011
Expert
Posts: 192
Liked: 9 times
Joined: Dec 01, 2010 8:40 pm
Full Name: Tom
Contact:

Re: Replication Performance

Post by tom11011 »

All the vm's that this is happening on have disks excluded. IE- I am excluding the tempdb disk as it doesn't need to be replicated.
tom11011
Expert
Posts: 192
Liked: 9 times
Joined: Dec 01, 2010 8:40 pm
Full Name: Tom
Contact:

Re: Replication Performance

Post by tom11011 »

This is the message I sent to support.

"In both the veeam backup and replication job, we had an excluded disk in the job. Basically, we were excluding a drive that contained tempdb, which is useless to replicate or backup in my opinion as it is simply just recreated anytime mssql is restarted.

What we ended up doing a few months back is deleting the disk from vmware and then recreating it. It was too large for our needs so we made it smaller.

I believe this to be the cause of the issue. Veeam didn't really complain because the disk was excluded. My guess is veeam only concerns itself with the disk number (ie scsi 0:0 etc..). It doesn't really check if it is the same disk or not. So when I deleted the disk and then re-added it in vmware, the number remained in my case scsi 0:2. Veeam saw that it was only to worry about scsi 0:0 and scsi 0:1. But, somehow scsi 0:2 is relevant to veeam even though it is skipping it.

To test this theory on the backup job, I removed the exclusion. The backup did not give me the message "Virtual disk configuration change detected, resetting CBT" for one job, but it did for another. In all cases, the job worked correctly at least after the second run. After the second run completed, I again excluded the disk and now it is running normally again.

I wanted to test one other thing, could I simply just remove the exclusion and then save the config, then just re-add the exclusion and save the config again before starting the job to see if I would get the same result? That did not work for me, the job had to run once with all disks before I could successfully re enable disk exclusion.

Replications seems to be a different story.

When I removed the exclusion from the replication job, on one job it failed with "Processing configuration Error: Cannot replicate disk [XXXXXXX-LUN6-DG4-TEMPDB] xxxxxxxxxxx/xxxxxxxxxxx.vmdk because its capacity was reduced" ie the disk size change as explained above. I tried it again after manually removing snapshots but same thing. I had to go into the replica and delete the disk (who's size didn't match the vm, it was still the old size on the replica). After running the job, it seems ok, it is running now but will take a while to complete and know for sure. I did not receive the cbt message but it did have to calculate digests.

On another replication job, it did recognize the disk size change and gave a warning "VM disk size changed since last sync, deleting all restore points". It proceeded to delete all replica restore points. Then, it added the new disk to the replica, but did not remove the old disk. I have to manually delete the old replica disk once the job finishes."
Post Reply

Who is online

Users browsing this forum: Google [Bot], Semrush [Bot] and 26 guests