RTO RPO Discussion

gingerdazza · Post by **gingerdazza** » Jun 16, 2017 7:14 am this post

Interested to have people thoughts on the following:

For aggressive RTO and RPO, oftentimes we deploy highly available, highly replicated, highly automated failover. Think for instance about a stretched synchronous cluster for VMs. In theory that would achieve very aggressive RTO/RPO (sometimes 0).

And of course, you'd still backup those VMs with Veeam.

But then you think... my Veeam backups need to have tertiary backup to tape and taken off site - for archive and for air-gapping.

So, what RTO/RPO in this scenario have we actually achieved?
From a design stand point we have DESIGNED the platform for RTO/RPO 0 in site failover scenarios.
But if something goes wrong with the entire compute platform we have to run full restores of all VMs from Veeam backup. As good as Veeam is, that could be 8 hours (say). So, say RPO 12 hours and RTO 8 hours.
And if you've been ransomwared, and your Veeam backups are also encrypted, you may need to retrieve tapes, restore tapes data to Veeam, and then run a further 8 hours of Veeam restore.

How do we express this to the business in terms of RTO/RPO? Do we say it has RTO/RPO of < 1 hour (but qualify this based on scenario)? Or do we say, the worst case scenario could be a 24 hour plus recovery so we call I tout as an RTO of 24 hours?

Thoughts please?

Post by **Mike Resseler** » Jun 16, 2017 9:41 am this post

What I have done in the past is always explained to my management that there are different layers of defense. As an example, I gave a couple of hours of RTO when it came to item level recovery and the RPO was dependent on the workload being protected. An outage of a workload under certain requirements (think hardware or VM space/ resources available) also had a specific RTO. In fact, in many cases the RTO was more strict for certain workloads compared to others. All dependent on the importance. This was obviously when I could restore from the disk-based backups. Another example would be that I agreed with management that ILR of something over 30 days would take multiple days because I needed to get it from the tapes as it was not on disk anymore. The same when there would be a bigger disaster and I would need to grab my tapes, then we would start with a delay already of many hours just to retrieve them from the outside storage. (Not to mention that I might need to purchase hardware and so on...)

But in conclusion, I always believe that it is best that you discuss this with management/ workload owners and so on. If you give them honest numbers with different scenario's, and they don't agree on them, that is the moment to ask for additional funding

My 2 cents
Mike

gingerdazza · Post by **gingerdazza** » Jun 16, 2017 3:16 pm this post

Any other thoughts?

This must surely be a popular debated subject?!

larry · Post by **larry** » Jun 16, 2017 3:26 pm this post

I have below, I then have a spreadsheet with every VM with RTP/RPO in all buckets. The RTO is far a whole site, worst case which is why RTO from local backup is a couple of hours. One VM then low RTO from local but everything then a couple hours. I quote worst case for each bucket to IT committee.

All Backups are application consistence. If a SAN snapshot it not application consistence but only crash consistence, then it is not considered a backup.

Local SAN Backup (Snapshot)
· Very Low RTO – As low as server boot time
· Very Low RPO – As low as 15 minutes
· Very Low Impact on Production
· Very Expensive
· Only couple days online
Remote DR site SAN Backup (SnapMirror)
· Very Low RTO– As low as server boot time
· Very Low RPO– As low as 15 minutes
· Very Low Impact on Production
· Impacts WAN
· Very Expensive
· Only couple days online
Local Disk to Disk (Veeam)
· Minimum of couple hours RTO
· 24 hour RPO
· Low Impact on Production
· Inexpensive
· 30 days spinning
Remote DR site Disk to Disk (Veeam)
· Minimum of couple hours RTO
· 24 hour RPO
· Low Impact on Production
· Inexpensive
· 30 days spinning
· No impact on WAN if done after SAN SnapMirror.
Tape – Locked in fire proof safe Groton.
· Minimum 24 hours RTO
· 24 hour RPO
· Low Impact on Production
· Inexpensive
· 30 days offline
· No impact on WAN
· Offline Copy – protected from the constant risk of being corrupted, deleted, infected or hacked.

gingerdazza · Post by **gingerdazza** » Jun 16, 2017 3:35 pm this post

Thanks Larry

How would you deal with scenarios where multiple/all systems have been blown away? Because you may have an RTO of 4 hours for AppA on it's own, but if AppA, AppB, AppC, .....AppZ have all been blown away, that RTO is generally not achievable. The aggregated RTO might be definable in days, especially from low cost options like tape. No?

And if your Veeam backups have been encrypted, your RTO of say 2 hours might turn into 8 hours because you have to retrieve tapes - you had to default to a tertiary "last chance" backup. So do you tell the business they have a 2 hour RTO designed system, or an 8 hour RTO system in case you ever have to default to that last resort?

larry · Jun 16, 2017 6:48 pm

My RTO is quoted in case of loss of all services at a site. Knowing the order of restore scales the RTO to priority. We test whole site losses every year so I have real numbers from when IT has hands on a keyboard. A single server RT is only boot time from local SAN snapshot with known RPO.

gingerdazza · Post by **gingerdazza** » Jun 19, 2017 8:24 am this post

Thanks Larry. But I would hypothesise that it is still possible that someone/something could compromise your SAN snapshots in both DCs, therefore destroying your RTOs. You'd then have to have (hopefully) another form of backup to default back to (perhaps Veeam - which could still ALSO be compromised as there's no air gap) or even to a tertiary air gapped backup like tape?

For me, these days we still need Veeam type backups even for replicated SAN resources, and we also still need tape on top of that for air gapping.

larry · Post by **larry** » Jun 19, 2017 1:07 pm this post

I use Veeam created SAN snapshots for RPO 1 hour for last 36 hours. RTO 4 hours but can do in 2, a few VMs, RTO minutes
Veeam created remote snapshots RPO 1 hour, last 36 hours, RTO 4 hours
Veeam backup on disk RPO 24 hours for 90 days. RTO 4 hours
Veeam backup on disk off-site RPO 24 - RTO 6 hours 30 days
Veeam Tape RPO 24 hours RTO 24 hours - 7 years of data.

R&D Forums

RTO RPO Discussion

Re: RTO RPO Discussion

Re: RTO RPO Discussion

Re: RTO RPO Discussion

Re: RTO RPO Discussion

Re: RTO RPO Discussion

Re: RTO RPO Discussion

Re: RTO RPO Discussion

Who is online