Backup and Backup Copy off-site job design

dbr · Post by **dbr** » Apr 17, 2017 2:31 pm this post

Hi folks,

First of all: Sorry for long text and thanks for reading in advance.

I'm facing some tough challenges. Let me explain the situation:

We have an environment with an amount of approx 16TB of backed up data (sum of all jobs in datacenter from table storedsize of database view dbo.ReportSessionsView) and a change rate of ca. 10 percent. One restore point per day, 30 restore points in total. All should be copied off-site over WAN link with let's say a bandwith that can handle the change rate per day when copying at full capacity all the time within one day backup cycle (150-300mbit targeted). At the moment there are 23 backup jobs with overall 511 vms to process. Every backup job is linked to a separate backup copy job.

Given environment:
- Veeam B&R 9.5 U1 Enterprise Plus
- Mixed VMware environment: 5.0, 5.5, 6.0
- 23 backup jobs
- 23 backup copy jobs
- 511 vms
- 16TB full backups (may be reduced through consolidate backup jobs and thus benefit from more dedup)
- 10 percent change rate
- 1 WAN connection to process 1,6TB within 24 hours
- 2 backup repositories on-site with respectively 54,6TB of capacity (scale-out repository is not yet used)
- 2 backup copy repositories off-site with respectively 54,6TB of capacity (scale-out repository is not yet used)

Goals to achieve:
- Decrease the number of backup jobs and copy jobs to reduce the load on the backup server and benefit more from dedup. The high number of jobs is due to missing parallel processing feature prior to verion 7. Remark: We have around 100 backup / backup copy jobs in total. We back up additional vms in our plants (repectively one backup and one backup job per plant).
- Copy all incremental data within a backup cycle of one day off-site.

Problems:
- A copy job only starts once the source backup job has finished and not when the first vm processing has finished even if per-vm backup files is configured at source repository.
- If we configure less backup jobs a copy job will be idle until the first job has finished.
- Reverse incremental or forever forward incremental additionally extend the job duration due to merge while running (reverse) or at the end (forever forward).
- Weekly active fulls are only possible when decreasing the number of restore points cause of the large full backup size. Approach would be running forward incremental active fulls to avoid merge and end in shorter job duration to begin backup copy earlier at the cost of additional capacity which we don't have (yet).
- When using only 2 copy jobs merge will take forever and while merging is active the WAN link is idle.
- We cannot use active GFS with a lower number of restore points in total because this would only help if we can copy with "Read the entire restore point from source backup..." option enabled. And this is not possible due to the limited WAN link.

Considerations so far
- Configure two copy jobs (one for every respository on-site), use current copy repository as staging repository on-site and buy a NAS sized to hold the complete data of both copy jobs. Then copy with rsync instead of using backup copy jobs. Pro: No merge needed because the both primary copy jobs have merged all data already.; Rsync should be able to copy only delta changes to off-site NAS. Cons: Unreliable; Rsync isn't aware of running Veeam backup copy jobs, Rsync and backup copy jobs may overlap; I read "If rsync was any good, we would not have developed Backup Copy." -> I guess rsync wouldn't be a good option in any case, right?
- Configure two copy jobs (one for every respository on-site) that copy data of all jobs to the current backup copy repository but on-site and use GFS with "Read the entire restore point from source backup..." if necessary to avoid merge. Use two more backup copy jobs to process the data for off-site copy to a newly buyed NAS over WAN on even days and additional two copy jobs that copy the data over WAN on odd days. In this szenario both copy pairs should have enough time to merge the 1,6TB incremental backup once the maximum number of set retore points is reached.
- Configure just two backup copy pairs (2 copy jobs x 2 Repositories) for copy over WAN alternating daily wihtout an additional on-site copy respository. How would I configure alternating copy jobs, anyway? Just set cycle = two days? What if backup cycle of both copy jobs start at the same time (e.g. at backup service start)? Is it even possible to configure alternating backup copy jobs?
- Until now we havn't used WAN acceleration cause of the current number of backup copy jobs and the WAN acceleration process limit to one task. Would WAN accelaration help me in this situation as well?

Conclusion:
Are there any suggestions how to configure the complete backup and backup copy process? What would you do if you were in my place?

Any recommendations, suggestions or other ideas welcome, thanks in advance.

Daniel

ssjgogeta · Post by **ssjgogeta** » Apr 19, 2017 9:55 am this post

Hi Daniel,

What are the RPOs for the each of the 511 VMs you are attempting to copy offsite daily? Are they all the same?

I'm assuming that you have critical production VMs which must be maintained offsite daily - but are all 511 of them the same priority?

If not, maybe only the VMs which need to be shipped offsite daily should be copied offsite. You're always going to have a problem with your backup window trying to offsite 16TB per day (as you need adequate time to backup the production data first before the new restore points are shipped offsite via the copy jobs). Unless of course you have super-fast storage end-to-end and a massive pipe to your DR site....

I don't think that 23 jobs for 511 VMs is excessive either...but others may beg to differ?

Without knowing the criticality of your data, I would look to keep most of your disk-based backups onsite at the production datacenter. Copy off to tape locally and move offsite daily. Replicate critical VMs to your DR site leveraging WAN acceleration.

Dave.

dbr · Post by **dbr** » Apr 19, 2017 10:47 am this post

Thanks for your reply, Dave.

Unfortunately, cause we got no off-site-RPO from business off-site-RPO is all the same: One day for everything. All I was told is "Transfer all backuped up data in datacenter off-site.". Yes, I know it's a very general statement but I cannot get a more specific one. Anyway, shouldn't it be possible to use a backup copy job to copy backups to our second datacenter (which is also on-site) via 10G link and schedule parallel two additional alternating backup copy jobs to copy the increment of 1,6TB over a WAN link with a cycle of 2 days? In that case on day 1 first job has enough time to copy 1,6TB over WAN and merge all data afterwards. On day two the copy cycle of second backup copy job starts and has also 2 days to process both copy and merge.

Remarks:
- My boss doesn't want to use tapes. And me too...
- I only want to process 1,6TB and not the full backup of 16TB once a day via WAN link. Seeding would be done on-site in order not to copy a full backup through WAN. Backup copy without GFS is always forever forward incremental. So there should be no need to copy all 16TB of data more often than one time.

Daniel

ssjgogeta · Post by **ssjgogeta** » Apr 19, 2017 12:09 pm this post

With the lack of well-defined recovery metrics you are always going to be pushing s**t uphill.

Go back to your boss and work toward classing/categorizing the systems within your environment according to their criticality.

3 Categories which have always worked well for me are:

- Production, Critical (GOLD)
- Production, Non-Critical (SILVER)
- Non-Production, Non-Critical (BRONZE)

GOLD systems should be in-scope of DR (assuming you have a DR plan). These systems could be replicated/copied daily offsite (if required to suit a tight RPO). Whether you leverage tape or otherwise, you should be following the 3-2-1 backup rule. If all your backups are online, you're not sufficiently protected.

SILVER systems should have longer RPO & RTO timeframes. These systems could be replicated/copied offsite less frequently (e.g. Weekly to suit the longer RPO).

BRONZE systems would have an even longer recovery timeframes and perhaps they are stored only onsite as they are not required in a DR event?

From what you are trying to achieve, it sounds like you need either:

a) better/faster infrastructure in order to copy everything offsite, everyday.

b) begin prioritizing your backups according to criticality and copy only what you need to offsite daily.

Either way you're screwed if your online backups are compromised somehow....

dbr · Post by **dbr** » Apr 19, 2017 1:40 pm this post

That is very helpful. Basically I know how to do backup right (3-2-1 etc.) and I have told my boss many times that we have to classify our systems into at least 3 groups. Categories will help to ensure to do only what it needed for a system (not only for backup (RPO/RTO) but also in terms of availability questions and DR szenario). I agree and in my opinion the approach "one size fits all" is always more expensive than doing it "right" or in other words "what is really needed". Thanks a lot for your ideas.

R&D Forums

Backup and Backup Copy off-site job design

Re: Backup and Backup Copy off-site job design

Re: Backup and Backup Copy off-site job design

Re: Backup and Backup Copy off-site job design

Re: Backup and Backup Copy off-site job design

Who is online