Replication Performance

tom11011 · Post by **tom11011** » Dec 03, 2016 5:05 pm this post

I would like to attempt to compare our replication performance with others. In my opinion, our performance used to be better in the past but it just seems to take a really long time now and I would like to get some comparisons. I'll try and layout the technical specs below.

Our source and destination servers are Dell PE720, dual six core cpu, 256gigs ram
Vmware vsphere 6.0 Update 2 plus all patches after
Latest Veeam 9.5
Dell PE 3200i SAN at both locations, all drives are 600gig 15K rpm sas
1 gig iscsi (although I'm not sure that matters in this case)
All disk groups are 12 drive disk groups in a raid 10

We are replicating over a 30 Mbps vpn tunnel to a distant site, all 30 Mbps are available, however we rarely use the available bandwidth which I find odd, its more likely to see 15-20 Mbps in use. Source always seems to be our bottleneck according to veeam.

Post by **Gostev** » Dec 03, 2016 7:16 pm this post

The issue here is the performance of fetching data from the source SAN... according to your screenshot, something is seriously wrong with the fabric or the SAN itself.

tom11011 · Post by **tom11011** » Dec 03, 2016 7:21 pm this post

What do you judge that upon?

Post by **Gostev** » Dec 03, 2016 8:05 pm this post

By the fact that 20 MB/s is as fast as the backup proxy is able to retrieve the data from SAN... that's too slow even for my 10 year old laptop, let alone SAN with 15K SAS drives. And bottleneck stats confirm that by showing that every other component basically remain idle, just waiting for the source data mover to supply the data.

tom11011 · Post by **tom11011** » Dec 03, 2016 8:49 pm this post

I think maybe the san is too busy, lots of veeam backups and replications running. I am going to disable all jobs and then just let one replication job run by itself. I'll report back.

tom11011 · Post by **tom11011** » Dec 03, 2016 9:19 pm this post

Here is a replication job running by itself, this looks better.

A few questions,

1.) How does the current performance for a single replication job running rate for my described setup?
2.) if I change my backup jobs to use forward incrementals instead of reverse, what kind of performance impact would one expect to see?
3.) also, will veeam continue to cleanup the old vrb files created by the reverse incremental jobs?
4.) can you contrast the difference between processing rate and speed?

Post by **foggy** » Dec 05, 2016 2:45 pm this post

1. Now performance looks more justified.
2. Considering the bottleneck is still source, this will not affect anything.
3. In case of backup method switch to forward incremental, yes.
4. Processing rate is a ratio between the amount of all actually read data and time it took to transfer data to the target. The reading speed in the stats below is the actual average read rate for the disk.

tom11011 · Post by **tom11011** » Dec 05, 2016 3:40 pm this post

>>>Considering the bottleneck is still source, this will not affect anything.
This was hypothetical, the image above is a replication not a backup, not sure if you based your answer on that.

Thank you to everyone so far. I'm running into a problem with required service level agreements (SLA) that I would like to present in a very generic sense as to get an idea of how it could be handled.

Let's say I have 2 MSSQL 2014 database virtual servers. Each of those servers is configured with 128gigs ram, 8 cores over 2 cpu.

Up until these last months, I was able to manage our SLA which was a 2 hour RPO and a 2 hour RTO.

Each server would basically run a backup to local NAS storage and then a replication to a geographically distant data center. See the original post to get an idea of our technology. Each job for each server was able to finish in about 1 to 1.5 hours on average. But, as our data continues to grow (currently about 750gigs per database server) I am no longer able to meet this SLA.

I'm trying to figure out if I have the right solution or not and am looking for suggestions. What should I be measuring? I have a budget.

Thoughts off the top of my head include the following-

1.) Our gear with the exception of our switches is 10gig ready. Pull the trigger on 10gig switches if it will make a big difference (obviously this doesn't help replications)
2.) Add a second SAN (ie stop adding disk shelves to current san)
3.) Use Veeam to either replicate or backup, but not both, ie use sql mirroring

other?

Post by **tsightler** » Dec 05, 2016 4:03 pm this post

To me it looks like CBT is not working even though it says it is using CBT. Look at the amount of data read vs the amount of data transferred, they are massively different, in the first screenshot we read 268GB of data while transferring only 1.8GB. I would suggest trying to reset CBT on these VMs, let it run again. After the CBT reset the first run will not be better, but I'm hoping it will get a lot better on subsequent runs after that.

I guess it could also be possible that something is writing to these blocks even though the actual data is not changing, but that seems less likely.

tom11011 · Post by **tom11011** » Dec 05, 2016 4:09 pm this post

I would be willing to try that on a job with a single vm. Do I do that from the vsphere power CLI or can I reset it from within the gui somehow?

tom11011 · Post by **tom11011** » Dec 05, 2016 4:57 pm this post

I found veeam KB1113 which leads to vmware kb 2139574. But that script resets cbt on all virtual machines, can anyone point me in the direction of a script for a single vm?

Post by **tsightler** » Dec 05, 2016 6:20 pm this post

It looks like those screenshots above are from a replica from backup job, is that right? If so, do can you share the equivalent screenshots from the backup itself? That might give more insight. I can't figure out any scenario where you should be seeing such a high read vs transfer unless CBT is not working properly.

tom11011 · Post by **tom11011** » Dec 05, 2016 6:30 pm this post

Hi, I'm not sure what you mean by "replica from a backup job". Those pictures are replication jobs.

Here is a screen shot from a backup. Notice the very first line in the log action? "Virtual disk configuration change detected, resetting CBT". All my sql backup jobs say that, but my other backup jobs do not. All jobs have CBT enable, all vm's are running version 7 or higher.

Post by **tsightler** » Dec 05, 2016 7:01 pm this post

OK, so now we are getting somewhere. For some reason Veeam is detecting that there has been a change to the VM disk size and is resetting CBT. This might be expected after you resized a disk, but should only happen once after that. Are you getting this message on each and every run?

tom11011 · Post by **tom11011** » Dec 05, 2016 7:06 pm this post

Yes, each and every run. But only for the db server jobs, not for any other vm.

It is happening on both replication and backup jobs for these particular servers.

These servers have had their disk sizes increased, but the last one was several weeks ago.

tom11011 · Post by **tom11011** » Dec 05, 2016 8:10 pm this post

I've opened case 01995485 on it.

Post by **tsightler** » Dec 05, 2016 9:25 pm this post

You might want to try following KB 1940, specifically the section under Veeam Backup & Replication v8 and Later which shows how to use a registry key to disable this automatic CBT reset functionality (this was added as protection against the CBT corruption issues in early versions of vSphere 6). Perhaps somehow this is malfunctioning in your case as it should not reset every time.

tom11011 · Post by **tom11011** » Dec 05, 2016 9:35 pm this post

I'll keep it in mind thanks. I have a case open now with veeam, they are recommending I just open a case with vmware.

Unfortunately, that script only resets ALL vm's. Can't have that happening!

My only workaround is to apply that script to a single esx server instead of vcenter, (ie migrate off all the vm's I don't want reset from the single esx server). I've applied this script to one server and I am currently running a replication which will take a while. I'll report back on the results.

Post by **tsightler** » Dec 06, 2016 12:56 am this post

Well, based on the "Virtual disk configuration change detected, resetting CBT" message, I'm not sure that resetting CBT is really the answer as it's obvious that we are already doing the CBT reset every single job run. This should only happen on the first run after the disk is resized, not every run. This is what Veeam support should be looking into, to understand why we are doing that on every run. I'll reach out to the support engineer on your case and make sure he understand what we are seeing. I'm thinking it might be a bug due to the excluded disk, but that's just a total guess right now.

tom11011 · Post by **tom11011** » Dec 06, 2016 1:48 am this post

All the vm's that this is happening on have disks excluded. IE- I am excluding the tempdb disk as it doesn't need to be replicated.

tom11011 · Post by **tom11011** » Dec 07, 2016 3:36 am this post

This is the message I sent to support.

"In both the veeam backup and replication job, we had an excluded disk in the job. Basically, we were excluding a drive that contained tempdb, which is useless to replicate or backup in my opinion as it is simply just recreated anytime mssql is restarted.

What we ended up doing a few months back is deleting the disk from vmware and then recreating it. It was too large for our needs so we made it smaller.

I believe this to be the cause of the issue. Veeam didn't really complain because the disk was excluded. My guess is veeam only concerns itself with the disk number (ie scsi 0:0 etc..). It doesn't really check if it is the same disk or not. So when I deleted the disk and then re-added it in vmware, the number remained in my case scsi 0:2. Veeam saw that it was only to worry about scsi 0:0 and scsi 0:1. But, somehow scsi 0:2 is relevant to veeam even though it is skipping it.

To test this theory on the backup job, I removed the exclusion. The backup did not give me the message "Virtual disk configuration change detected, resetting CBT" for one job, but it did for another. In all cases, the job worked correctly at least after the second run. After the second run completed, I again excluded the disk and now it is running normally again.

I wanted to test one other thing, could I simply just remove the exclusion and then save the config, then just re-add the exclusion and save the config again before starting the job to see if I would get the same result? That did not work for me, the job had to run once with all disks before I could successfully re enable disk exclusion.

Replications seems to be a different story.

When I removed the exclusion from the replication job, on one job it failed with "Processing configuration Error: Cannot replicate disk [XXXXXXX-LUN6-DG4-TEMPDB] xxxxxxxxxxx/xxxxxxxxxxx.vmdk because its capacity was reduced" ie the disk size change as explained above. I tried it again after manually removing snapshots but same thing. I had to go into the replica and delete the disk (who's size didn't match the vm, it was still the old size on the replica). After running the job, it seems ok, it is running now but will take a while to complete and know for sure. I did not receive the cbt message but it did have to calculate digests.

On another replication job, it did recognize the disk size change and gave a warning "VM disk size changed since last sync, deleting all restore points". It proceeded to delete all replica restore points. Then, it added the new disk to the replica, but did not remove the old disk. I have to manually delete the old replica disk once the job finishes."

R&D Forums

Replication Performance

Re: Replication Performance

Re: Replication Performance

Re: Replication Performance

Re: Replication Performance

Re: Replication Performance

Re: Replication Performance

Re: Replication Performance

Re: Replication Performance

Re: Replication Performance

Re: Replication Performance

Re: Replication Performance

Re: Replication Performance

Re: Replication Performance

Re: Replication Performance

Re: Replication Performance

Re: Replication Performance

Re: Replication Performance

Re: Replication Performance

Re: Replication Performance

Re: Replication Performance

Who is online