Replication and Performance - I'm not getting it

mrpopgun · Post by **mrpopgun** » Nov 13, 2015 7:59 pm this post

As an introduction, I'm new to both Veeam and ESX. I was a Hyper-V guy from the beginning, but still learning my way around vSphere and Veeam.

I've been working with replication of a large (40TB) vm and am having problems determining either a) where my performance problems are coming from or b) why the performance is as expected.

Layout
Primary
3 x vSphere 5.5.0 servers (Essentials Plus) - 1 GB connected to LAN
1 x Coraid SRX6300 - 10GB connected to servers (We'll call this CoraidA)

DR - 1GB connected via private fiber at around 25ms
Same as above
(We'll call this storage CoraidB)
Veeam server running here
Veeam Proxy running here
Veeam replication job pulling the data from Primary to DR

Starting setup
We have 2 physical sites Primary and DR. DR has had the equipment, but we are just starting to make it functional. We had an issue with CoraidA at Primary, so we brought CoraidB over with the intention of migrating to it and moving CoraidA back to DR. The reasoning I'll not go into as it is a bit of a rabbit trail.

With this vm being so large, I decided I'd simply use replication and cutover. It turns out this is about a 120 hour job run for an initial replication. The thing that surprised me, though, was that we were only achieving an average of around 280MB/s and it showed a bottleneck in the target (CoraidB), which had nothing else running on it. The replication was using all default settings (Optimal compression and local target storage optimization) so Veeam recommended we go to No compression level and Local Target (16TB+ backup files) since a) the data in this vm is highly non-dedupe friendly and b) the vm is so large. Still, having the target be the bottleneck seemed bizarre. The network or source would have made far more sense.

Now, we've got CoraidA at DR and I've setup replication again. Only now, it shows the Source as the bottleneck (CoraidB) and it is moving at a whopping 5MB/s. Obviously, I have an open support ticket on the CoraidB storage since the issues look to be following it. However, it does make me wonder:

1) Do I have the replication setup as I should?
2) How do I go about pinpointing bottlenecks? What tools should I be using?
3) What do I not know here that I don't know I don't know?

Thanks

Delo123 · Post by **Delo123** » Nov 16, 2015 10:06 am this post

Needless to say an 40TB VM is crazy... Trying to think of a valid Reason why a single VM would need to be so large....

You say you were getting around 280Mb/s (Mbit?) which actually isn't very bad considering your single 40TB probably doesn't consist of that many extents so parallel processing isn't going to help you...

Post by **foggy** » Nov 16, 2015 10:16 am this post

Do you have proxies on both ends? What transport modes are being utilized by each of them? What repository is specified in the job settings as metadata storage?

mrpopgun · Post by **mrpopgun** » Nov 16, 2015 6:15 pm this post

Thanks for the replies.

First, a 40TB vm != "crazy", it = "medical imaging". There isn't a ton of processing overhead, just a ton of of need for storage of that is very non-dedupe or compression friendly. Quality = Size = Important.

I followed the instructions http://helpcenter.veeam.com/backup/70/b ... ation.html, setup a Veeam Server and Proxy at both locations and created the replication job at the DR site so it pulls the data. Should I actually have 4 proxies? 1 at each location for each Veeam server?

Transport mode = 10GB servers to storage, 1GB fiber between sites

The metadata repo is the local repo at the DR site. I have one in source for Source and 1 in DR for the DR server.

NightBird · Nov 16, 2015 6:37 pm

120h seems correct for initial Replication of 40TB on a 1Gb/s line
1Gb/s is about 100-120MB/s so it's about 400GB per hour x 100/120 hours => 40TB

Post by **foggy** » Nov 16, 2015 8:49 pm this post

By transport mode I mean how data is retrieved from the source storage, you can see it in the job session log if you select the VM to the left and look for the proxy server name selected for processing ([hotadd] or [nbd]).

Repository for storing replica metadata should be located closer to the source storage, as far as I can get, this is not so in your case (please check replication job settings).

Delo123 · Post by **Delo123** » Nov 17, 2015 8:59 am this post

Even if it's medical Imaging, having 40TB on a single VM is still a major risk. What do you do if you need to restore? It will take days... Also file curruption could be an issue. Is this a windows vm? Ever thought about DFS? You can deploy the Content on multiple Servers and offer them as a single namespace (share)...

Ps. we also have some Servers with high Quality Images in the 1tb-5tb Range, used for cataloque production. windows dedupe works quite well on that! (ca 50% savings)

mrpopgun · Post by **mrpopgun** » Nov 17, 2015 7:14 pm this post

Transport mode = [hotadd]

Repo = the repo for the DR Veeam Server is at the DR site, not the Primary site. Same for Primary. Do we need to have a repo and proxy for each Veeam Server at each site?

mrpopgun · Post by **mrpopgun** » Nov 17, 2015 7:14 pm this post

Trust me, I do understand the concerns about large vm's, restores, etc. It is precisely why I am making sure we have replication working at our DR facility - no restores, we simply go hot at DR, which we own, operate and is warm standby. Our replication windows are such that we aren't too worried about replicated corruption to DR and this all falls within our RPO's and RTO's.

Otherwise, if you're familiar with PACS, you will understand why single machines are the current standard, though luckily options are coming available. We'll be at 120TB live data within a year so DFS replicating get's pretty expensive when you need multiple copies. Data change rates are low and data infusion rates are pretty manageable, though will skyrocket for a bit as we pull in a bunch of new and HUGE imagery for a current project underway. Oh, and PACS systems have very little in the way of duplicate data - they do not dedupe anything like I would have expected.

Post by **foggy** » Nov 17, 2015 7:26 pm this post

I'm talking about the repo that you select in the replication job itself, you should select the one that is close to data source. If it is selected correctly and hotadd is utilized both on source and target, then you seem to have everything configured correctly, so worth contacting support for a closer look.

mrpopgun · Post by **mrpopgun** » Nov 17, 2015 10:22 pm this post

I hadn't realized I could point to same repo from both sites. I've got that setup now and have my both proxies added as I should have i.e. I've got a proxy in the Primary site, a Proxy in the DR site and the repo being explicitly used is on the Primary side. Veeam tech walked me through all this and had me disable VSS in this case. Killed the last replication that was running and once the snapshot is rolled back, I'll kick it all off again.

Thanks the for help!

R&D Forums

Replication and Performance - I'm not getting it

Re: Replication and Performance - I'm not getting it

Re: Replication and Performance - I'm not getting it

Re: Replication and Performance - I'm not getting it

Re: Replication and Performance - I'm not getting it

Re: Replication and Performance - I'm not getting it

Re: Replication and Performance - I'm not getting it

Re: Replication and Performance - I'm not getting it

Re: Replication and Performance - I'm not getting it

Re: Replication and Performance - I'm not getting it

Re: Replication and Performance - I'm not getting it

Who is online