Disaster Recovery Setup using Veeam

ysm · Post by **ysm** » Nov 26, 2016 7:31 am this post

Hi folks,
We are in the process of designing our first Disaster Recovery (DR) site. We are currently using Veeam 8 and planning to upgrade to 9 before the DR implementation. So I will be basing my design on Veeam 9.
I just want some confirmation and ideas from you guys if we are doing it right.
Below is the description of our design:

Site A (production) connect via MPLS to Site B (DR).

MPLS line speed: 50Mbps
Daily changes in our VMs = ~43GB (I based this number on the daily incremental backup job - transferred size)
If my calculation is correct, assuming we dedicate this line to replication/backup traffic alone, it will take around 2hr+ to complete the transfer of data
If we are able to turn on the WAN accelerator, by right the time taken should reduce, but we are not sure by how much. Assuming our RPO is 1 day, the time required to transfer the data should be within reasons, right?

The next thing is to whether to use replication or backup replication to transfer data to DR site.
Our backup server is just a server with direct attached disks. It will be connected to our SAN storage for backup. Backup job will run everyday to this server. The question now is , do I configure a secondary backup replication job to copy the backup to my DR site (another SAN storage), or do I configure a separate replication job to replicate the VMs over?
If we do a backup replication job over to DR, I would assume the files are in the form of Veeam backup files, which are not directly usable for DR purposes, I have to restore them during DR?
If we do a replication job over to DR, I would assume that the files are in the form of VMWARE files (we are using VMWARE) and can be directly used, am I right?
If my assumption is correct, a replication job would make more sense as it saves me an additional step. However, it would mean that the replication will have to run using the production SAN and not from the backup server storage? That will have some impact on the performance on the production.

The third question would be where to put our tape library. I am thinking of putting the tape library at our DR site. So essentially the replicated data will be transferred to the tape library at our DR site once the replication is complete. Is that an advisable design or should we put our tape library at production site?
Hope I am clear on my questions and hope to hear from you all soon! Thanks!

millardjk · Nov 28, 2016 5:01 am

Hi!

On a "good day" and assuming all the best-case efficiencies, you'll transfer the data you think you'll have in 2h. In practice, however, there are delays in how fast you can actually push the data across the WAN, so I'd actually double that time estimate at a minimum. And if you add the WAN accelerator, there's no way to know how much of that will be accelerated, but you *will* have one-at-a-time processing: the WAN Accel doesn't handle more than one object at a time, and due to any number of factors, won't every process faster than your WAN connection. You also have additional time added for digest builds & comparisons, etc. While the WA will do a nice job of reducing the amount of data that has to be sent across the WAN, it won't necessarily speed things up significantly from a raw backup window. And, as in all things, your specific setup & data set will determine your performance variations from theoretical maximums.

All that leads me to ask: if your RPO is 1 day, and you're getting daily copies over within that period, do you *need* it to be faster?

Next, it may be pedantic, but "replication" is the copy of a whole VM using VMDK & snapshots to provide storage and recovery points. "Backup copies" are just that: copies of backups, which in turn are VMs stored in the VBK/VIB/VRB file format. You correctly identify the conundrum (send copies of VMs or copies of backups), but given sufficient I/O and a live Veeam server, you can use either to quickly "spin up" a VM in DR: the replica can be instantly powered on without a Veeam server, but a backup copy can be "powered on" with the help of a Veeam server and "Instant Recovery". Ultimately, it depends on whether your DR site will perform double-duty as an offsite backup repository (especially for longer-term archival copies) or purely for DR. In the former case, you'll want to send backup copies to the DR site; in the latter, you'll want to send replicas to DR. In either case, you'll have to balance the storage requirements against compute requirements: if you want to "spin up" your entire environment (which isn't specified) in DR, you'll need a like amount of compute resources; if that's not satisfied with a single host, you'll want a shared storage platform to land replicas on. If you work from backup copies, you can use local storage in the remote repository AND local storage in your compute environment: the remote repository can pull double-duty as a shared resource during "Instant Restore" followed by migration to local storage during the emergency. It's a shoestring budget option, but it will work. Ultimately, you need to determine what sort of outages you'll consider DR-worthy and determine how much capital outlay makes sense for the likelihood of needing to execute on the plan.

That brings up another important architectural decision: should you keep the Veeam server on the physical, or create a VM for the Veeam server that can be used as a restore proxy while the existing physical continues to be used as a backup proxy (for SAN-direct backups) and repository? I suggest this variant of the architecture because with a VM for the Veeam server, you can REPLICATE IT to the DR site and it's ready-to-run immediately in case of a DR event...

Finally, keep your tape library as close to the "mainline" backups as you can. Tape-out is a specialized form of backup copy, and unless your source storage (and proxies) have the performance to "keep the pipe full" you'll end up increasing the copy time due to tape "shoe shining" (the tape drive buffers empty, so the drive back-winds the tape a bit and waits until the buffers are full again before streaming forward again. It's time consuming & inefficient). Trying to get it to work over a 50Mbps WAN when tape ingest speeds are upwards of 140MB/s will result in LOTS of shoeshine, which in turn will not only make the jobs run longer, but also wear out the media (if reused) much faster.

ysm · Post by **ysm** » Nov 29, 2016 1:31 am this post

Hi Milardjk,
Thanks you for your comprehensive reply. Based on your insight and explanation on WAN acceleration, if I understand you correctly, WAN acceleration will essentially increase the processing time of each backup job, and can only transfer each backup/replication job over the pipe one by one. While it decreases the amount of data sent over the pipe, the total time taken to complete the backup/replication job might not be faster, is that correct?
In an ideal case scenario, we would definitely want a bigger pipe for the transfer of data, but we are constraint by budget to some extend. Unless the time taken to replicate/backup to DR is not acceptable in the business stand point, it is unlikely more budget will be allocated to that. 4h to backup/replicate is indeed a bit long and I am not sure if the management can accept that. We will have to see.
Thanks for affirming my interpretation of the backup vs replication job. I am aware of the spinning up of VM directly from backup feature, but I am not entirely sure the requirements required to do that in terms of storage and compute. Based on your reply, it seems to me, for "instant restore", we will need to make use of the local storage of the host on top of the shared storage, is that right? If that is the case, then it would seems to me direct replication of VM makes a bit more sense in terms of resource requirements. We do intend to put a shared storage in our DR site and multiple hosts to support the spin up of probably at least 50% of the production VMs in the case of DR.
Veeam as a VM instead of physical? Now that's something that we didn't consider in our design

. The ability to replicate it over for DR indeed makes a lot of sense. We will definitely look into that possibility. Thanks for the suggestion.
As for the tape library matter, the main consideration that we had is if we put our tape library at production site, is there a need to have a tape library at DR site as well for purpose like: seeding the initial DR setup, restore from tape under some circumstances? That would mean that we would need two tape drives, one at each location. That would no doubt increase cost. What's your take on this to address these issues?
To avoid the shoe-shining issue of tape, can we conduct the writing to tape after the replication/backup jobs are done? Of course that would increase the time of backups being written to tape, as the replication/backup will take at least 4hr, then we start writing to tape...that's something we need to think about if it is acceptable. Essentially we are writing to tape data that is at least 4-5hrs behind.
Thanks again and hope to hear from you soon.

millardjk · Post by **millardjk** » Nov 29, 2016 3:17 am this post

ysm wrote:WAN acceleration will essentially increase the processing time of each backup job, and can only transfer each backup/replication job over the pipe one by one. While it decreases the amount of data sent over the pipe, the total time taken to complete the backup/replication job might not be faster, is that correct?

A qualified "yes." It may NOT take longer than doing it without acceleration, especially once the seed is done, but there is a risk.

ysm wrote:4h to backup/replicate is indeed a bit long and I am not sure if the management can accept that.

Individual VMs may (will probably) take less time, but if they're upset with it taking 4h but satisfied with a 24h RPO then there's a disconnect between their tolerance and their understanding. 4h transfer time is really only a concern when you're looking at RPO of <4h (or, as some would argue, <8h, because the currently-transferring restore point can't be considered valid, making the last restore point which was transferred *at least* 4h old when the current one is being sent. For that reason, you can instead send multiple restore points each day; each restore point will be smaller because less change will have occurred, so the time to get it transferred will be similarly scaled down).

ysm wrote:I am aware of the spinning up of VM directly from backup feature, but I am not entirely sure the requirements required to do that in terms of storage and compute. Based on your reply, it seems to me, for "instant restore", we will need to make use of the local storage of the host on top of the shared storage, is that right? If that is the case, then it would seems to me direct replication of VM makes a bit more sense in terms of resource requirements. We do intend to put a shared storage in our DR site and multiple hosts to support the spin up of probably at least 50% of the production VMs in the case of DR.

Compute for either Instant Restore or Replica is identical: it's the compute (CPU+Memory) required by the VM when running. While Instant Restore is running from the backup restore point, some "writable" storage--it can be local or shared--is required to act as a "delta disk" in conjunction with the restore point; the maximum required would be 100% of the source disk, but only in a worst-case scenario where every single block of the virtual disk is written to while running as an Instant Restore. Of course, migrating from Instant Restore into "production" (aka Full VM) does require sufficient datastore capacity to hold the VM.

Replication, holding a copy of the full VM--plus snapshots representing recovery points--can use local, but shared would be preferred so you don't have to directly manage both space and compute requirements: with shared, you only have to worry about compute, and if DRS is available, even that's minimal. The VMs you consider necessary to spin up during a DR event should be considered for replication; that will be the fastest failover, and sending multiple replicas each day will decrease your RPO from 24h to something appreciably lower.

ysm wrote:As for the tape library matter, the main consideration that we had is if we put our tape library at production site, is there a need to have a tape library at DR site as well for purpose like: seeding the initial DR setup, restore from tape under some circumstances? That would mean that we would need two tape drives, one at each location. That would no doubt increase cost. What's your take on this to address these issues?

Personally, I don't like relying on tape for anything other than archive; if I'm relying on a drive at the DR site, it's because not only Plan A failed, but Plan B (and possibly Plan C) failed as well. If you're getting replicas and backup copies to the DR site, then you're covered for an "instantaneous" disaster. You meet or exceed 24h RPO, and your infrastructure should provide a <1h RTO for 50% of your infrastructure--ie, the most business-critical functions. Keep in mind that the one of the most compelling reasons for tape as primary backup media--$$/TB--have pretty well been superseded by well-designed backup repositories comprised of spinning disk and deduplicating file systems. However, I have run into organizations that insist on having current backups on tape, even when using dedupe storage; in that case, we've convinced them that a bare drive (no autoloader) is an acceptable accessory at the DR site for reading tapes in case of "plan D" emergency while using a full automated tape library (sometimes with 2 or more drives) makes sense at the production site, closest to the source data. Yes, that scenario does impose the requirement of a second physical host for the DR-site tape drive, but that was significantly less expensive than a host+autoloader for both sites.

ysm wrote:To avoid the shoe-shining issue of tape, can we conduct the writing to tape after the replication/backup jobs are done?

Your tape drive should indicate the minimum streaming performance; that's the rate at which data must be fed to the tape in order to avoid shoeshine. I don't know of any Veeam-compatible (ie LTO3 or better) drives that have a minimum streaming speed under the maximum speed of your stated WAN capacity. You'll have the problem irrespective of other activity happening. The only reasonable way to use the tape library at DR would be to replicate 100% of your VMs, then backup the VM replicas (not the original VM) to a repository in DR, then copy the backup to tape. In that scenario, 100% of the backup activity is occurring at the DR site, but it's also going to lag behind production by at least the time between replication passes.

When considering tape, you must remember: Veeam will NOT backup directly to tape; you must first backup to disk, then copy the backup to tape. Further, you can only send a first-line backup to tape; you cannot send a backup copy to tape.

ysm · Post by **ysm** » Nov 30, 2016 1:50 am this post

Hi millardjk,
Thanks again for the reply.
Indeed, I was pondering on the idea of full replication to DR and tape out from there. But I guess with the kind of bandwidth we can afford, I don't think that's very feasible or practical. That left us with the backup at production, tape out at production. Replicate important VMs to DR. Get a tape drive at DR to seed the DR and for tape restore if necessary.
Ya, it is indeed rare nowadays for us to fall back to tape for backups since we have backups on spinning disks. As we still believe in tape for as you mentioned archival purposes, i.e. storing the backup for longer term like 1 month or 1 yr, and as a form of offline backup (safe from ransomware if backup taken before it strikes) the tape library is still in the design consideration.
As for the replication frequency over to DR, we will really have to test it out and see. We will most likely do what you have suggested, i.e. replicate more frequently within a day to reduce the transfer load. If the changes are consistent throughout the day, then the method would work. Otherwise, if there are sudden big changes (in our example, 40-50GB/day, if suddenly within the next hour, 20GB changes, and that has to be transferred over, and we are replicating every hr,), then we might not be able to meet the replicating window.

R&D Forums

Disaster Recovery Setup using Veeam

Re: Disaster Recorery Setup using Veeam

Re: Disaster Recorery Setup using Veeam

Re: Disaster Recorery Setup using Veeam

Re: Disaster Recorery Setup using Veeam

Who is online