Real-world replication and DR failover examples?

bhagen · Post by **bhagen** » Apr 08, 2019 11:54 pm this post

We have two vmware 6.7u1 clusters; one production, and one I just built that's empty. Each are running vsan. We run VBR 9.5u4. Our goal is to use VBR to replicate 60 VMs (about 100TB) from production to "empty", permanently failover, then reverse the replication, so that our current production cluster becomes our DR cluster, and the new "empty" cluster becomes the production cluster.

Once all that is working, we then plan to shut down the DR cluster and move it to our DR building.

And the ultimate goal, of course, is the ability to spin up the DR cluster in case the production cluster (or the building that cluster is in) becomes a smoking hole.

The Veeam documentation for this scenario is pretty sparse, and very high-level...almost theoretical. I'm looking for resources that deep dive into real-world implementations of this type of setup.

If you know of a white-paper, case-study, or training, please drop a link!

Post by **HannesK** » Apr 09, 2019 9:41 am this post

Hello,
not sure which information is missing in helpcenter, but in general customers use planned failover for this and schedule on the business preferences. More or less "planning downtime".

I assume that you have the same network settings everywhere, as you did not mention anything.

Are you using one VCenter? Are you doing also backups, or only replication with Veeam?

Best regards,
Hannes

bdufour · Post by **bdufour** » Apr 09, 2019 2:52 pm this post

really depends on the bandwidth u have btw the sites that will handle the replication traffic. thats going to be the biggest factor. getting a full backup seed at the 'empty' site will likely help you do this in a reasonable amount of time.

bhagen · Post by **bhagen** » Apr 25, 2019 10:28 pm this post

HannesK wrote: Apr 09, 2019 9:41 am Hello, not sure which information is missing in helpcenter, but in general customers use planned failover for this and schedule on the business preferences. More or less "planning downtime".

As I mentioned in my original post, "The Veeam documentation for this scenario is pretty sparse, and very high-level...almost theoretical. I'm looking for resources that deep dive into real-world implementations of this type of setup."

HannesK wrote: Apr 09, 2019 9:41 amI assume that you have the same network settings everywhere, as you did not mention anything.

Yes, networking is configured; we'll simply need to change gateways using the Veeam failover plan when we fail to the DR site.

HannesK wrote: Apr 09, 2019 9:41 amAre you using one VCenter? Are you doing also backups, or only replication with Veeam?

Yes, a single vcenter. We'll replicate to the DR cluster, then do backups of the replicated VMs to minimize I/O on the production cluster.

I'm still not finding any real-world examples of using Veeam to replicate a vmware environment to a DR site. No whitepapers, no Veeam case studies. I'd really like to see a few examples of this in the real world...

Post by **HannesK** » Apr 26, 2019 9:11 am this post

about the Veeam documentation. It has around 100 pages in the PDF. I assume that you already tested your desired scenario with a test VM, so feel free to ask what's missing.

About the whitepapers: I guess that's because the technology is about 10 years old and works more or less like normal backup. You will also not find any deep dive whitepaper for a normal backup job.

We'll replicate to the DR cluster, then do backups of the replicated VMs to minimize I/O on the production cluster.

this is where it becomes interesting for me (not mentioned in initial post). Backing up powered off / replicated VMs. Not a good idea for the following reasons:

Additional license usage
no file-indexing possible
No SQL Point in Time recovery
No CBT
More complex restore mechanisms

bhagen · Post by **bhagen** » Apr 26, 2019 5:18 pm this post

Those 100 pages have all the *technical* information I need; but they don't adequately answer the "why" questions...so I suppose I should have just asked the specific questions that I couldn't find answers to.

I'll do that now:

1. What is a "normal" or "standard" way to setup replication jobs for 60 vms: one job per vm? one job per "application tier" (exchange, sql, etc.)? one job per OS version (for better dedupe like backup jobs...or does that even play into replication)? a combination of these? some other way that I'm missing? Why would I use a particular method over another?

2. We're already running 13 nightly backup jobs that cover all 60 vms, and we're wanting to run secondary jobs to an offsite repository. That's a lot of jobs running overnight. What is a "normal" or "standard" way to schedule replication jobs when there are already backup jobs in play? Replicate during working hours? Replicate after hours, when backups are running? How badly do replication jobs affect performance of veeam jobs and/or production VMs? I'm sure a lot of this answer will be based on our RPOs, and the fact that I will be "seeding" the DR cluster while that cluster is onsite, but then doing the incremental replication jobs to that cluster once it's moved to the DR site (which is connected to our main site via L2 connection, and is therefore in the same subnet as our main site). So I'm curious about our options here, and why we would choose one over another.

That's probably enough for now.

Oh...thank you for the information about backing up the replicas as opposed to the production vms. That's very valuable info...and though it's not what I was hoping to hear, at least now I know it and will have to adjust our expectations accordingly.

Post by **foggy** » Apr 28, 2019 9:15 pm this post

1. You may select any of the mentioned approaches, but another typical one is based on VMs importance, where you create a separate job for the most critical VMs and allow it to finish first, then group other VMs based on that criteria. Dedupe factor doesn't play role for replication (there's even no dedupe option in the job settings).

2. This depends on whether you're considering replication from backups or production VMs. The first approach doesn't affect production VMs at all and jobs can be run during working hours, you will just need to stagger backup copy jobs with replication ones accordingly, since both use backups as a source.

skrause · Apr 29, 2019 2:23 pm

On point number 1:
Since you have already categorized your workloads into 13 jobs, you could probably just use those same categorizations for your replication jobs. You can have per VM (and even virtual disk) destination settings inside the job. You can even choose to seed some VMs from backups or existing VMs while having others just do an initial full run. Make sure if you decide to seed jobs that you give them ample time to calculate digests on their first run (which may be longer than if you just pushed all the data from production depending on the VM.)

On point number 2:
It sounds like you are not implementing Backup Copy jobs and are wanting to run separate Backup jobs. If at all possible, you should use backup copy jobs for getting your data to the secondary location. This will cut down on your production impact, if you have ample bandwidth this can even be setup to sync during "business hours" when your backup jobs would not be running. If you don't want the backup jobs and replication jobs to interfere with one another, you could (after the initial seeding of the replica) "chain" the jobs to run after the backup job completes. While, in general, job chaining is not an ideal practice this seems like one where it would be useful.

If you are replicating your vCenter and want to fail it over, you will need to have that replication job setup using individual source/destination hosts directly and not the vCenter to manage the job. Otherwise you can't orchestrate the failover with a failover plan as Veeam has nothing to talk to. You may want to look into getting another vCenter license for the "DR" location depending upon your RTO and RPO SLAs as getting the vCenter up and running properly would have to be your first task in a failover situation. I think you might even be able to use the vCenter HA to do this but I have not read up a whole lot on how (or even if) it works in a multi-site configuration. The couple thousand a year in licensing costs for a second vCenter server is worth the simplification of the DR process, IMO.

bhagen · Post by **bhagen** » Apr 29, 2019 6:20 pm this post

Thanks @foggy and @skrause!

It sounds like doing replication jobs from backup copy jobs that reside on a backup server other than our main backup server is something I should investigate, so I will.

I like the idea of grouping replication jobs by importance; I think that would also work well in the event of a full-scale failover. I'll experiment with that as well.

Good point about the vCenter appliance replication/failover. I do have another vCenter license specifically for a DR stack, but bought it so long ago that I'd forgotten about it until you mentioned it @skrause. So yes, now I need to investigate vCenter HA, and how to make that work in a failover scenario.

This will give me something to work on this week, now that I have all my vms (except for my veeam server and my vcsa's) over on our new cluster.

Thanks for the tips!

skrause · Post by **skrause** » Apr 29, 2019 7:00 pm this post

You can't run Backup Copy jobs on a different B&R server from the one that the original jobs are being run on. (I learned that one from experience) You could use a Backup Copy job's .bak files as a source for replica seeding on another server if you add the repository and import the existing backups to it. But you would want to make sure the copy job does not sync data while you are running your replica seeding.

In any situation, the best practice is to have the B&R server that is running your replication jobs and failover plans to be in a different location than your production workloads. This is so that in the case of failover, you are able to go straight to failing over your production workloads rather than having to first get Veeam back up and running.

bhagen · Post by **bhagen** » Apr 29, 2019 7:18 pm this post

I would put my backup copy jobs on a different NAS, not run them from a different VBR server. Then run replication jobs from that NAS to the DR vmware cluster.

To your point, if my VBR server is running in the DR site, then it would be running even if my main site went down.

So:
Run the VBR server in the DR site
Backup production vms in the main site to a backup nas in the main site
Run Backup copy jobs from the main site backup nas to the DR site backup nas - no hit on production I/O
Replicate productions vms from DR site backup nas to DR site vsphere cluster - no hit on production I/O

skrause · Apr 29, 2019 8:18 pm

That would work.

I personally prefer to have my replicas run (after seeding) from my production VMs as a source since the impact on workloads that are not highly transactional is usually unnoticeable. It also means that there is only one "point of failure" for replicas being current: only the replication job state matters for my immediate failover solution rather than 3 jobs (Backup, Backup Copy, Replication).

It also makes setting up the jobs going back the other direction easier after a failover because all I have to do is use the pre-failover VM as the seeded target in the job going back the other way.

I do have a very fast low-latency link between my primary and secondary data centers though, so YMMV.

R&D Forums

Real-world replication and DR failover examples?

Re: Real-world replication and DR failover examples?

Re: Real-world replication and DR failover examples?

Re: Real-world replication and DR failover examples?

Re: Real-world replication and DR failover examples?

Re: Real-world replication and DR failover examples?

Re: Real-world replication and DR failover examples?

Re: Real-world replication and DR failover examples?

Re: Real-world replication and DR failover examples?

Re: Real-world replication and DR failover examples?

Re: Real-world replication and DR failover examples?

Re: Real-world replication and DR failover examples?

Who is online