R&D Forums

gingerdazza · Post by **gingerdazza** » Feb 04, 2022 8:12 am this post

Hi. I'm trying to understand how best to articulate RTO to an organisation. My query really centres around this;

If you have 1000 VMs across 100 different applications/services, and the business says "what RTO can you give us?" or "how long would it take to recover?", do you respond with the time it would take to recover each services (perhaps it would take 2 hours to recover App1 on it's own) or do you respond with an RTO that reflects how long it would take to recover all 1000 VMs? After all, if you've been hit by a complex attack and all VMs need recovering you're not going to be able to recover in that same 2 hour RTO you would have if just one app was required for recovery?

And,. separately, what about if an attacker has managed to delete the backup data you were hoping to recover from, no matter what defense mechanisms you put in place? How do you articulate to teh business that the RTO and RPO would not be met in this scenario?

Would appreciate peoples thoughts

Feb 04, 2022 9:12 am

Hi,

there are two different situations you ask about:
- RTO if a limited amount of services fail or data gets lots
- RTO if a whole datacenter (all 1000 VMs) fails or gets encrypted

Depending on what scenario you/your customer looks at the answers will be different.
Both have in common, that you will need to create a "emergency recovery concept" or "DR concept" that covers the required RTO and as important, the dependencies of each service.
It is almost useless to talk about "how fast can we recover" if in such a situation no one knows what order the services need to be recovered (dependencies) and which of them are the most important ones.

When I do such designs we usually go with at least 3 service level (SLA1, SLA2, SLA3 or Gold, Silver, Bronze...etc.) where the IT has to put the services into.
Each of the service level contains the information of both RTO and RPO required for the services in that level.
Once that is done it is getting more easy to tell what is needed to fullfill e.g. a ROT of 1 hour for the SLA1.
There is a difference if there are only 10 VMs out of 1000 in SLA1 or if you have 100 VMs in that level.

Only if that full picture is clear you can say "how long it will take" and "what RTO do we have".
If done without categorization you can only guess what it will look like and most likely you will be wrong as no one has ever really looked into the needs and the dependencies.
That was the case for ALL companies where I had to assist in a DR in the past that didn't had it.
Reality will differ a lot from the guess.

And my personal favorite in those designs and concepts is always the "DTO" (Decision time objective) as even if the above is clear it always needs someone or a process that tell when and how to start recovery in situation X.
For example for Ransomware I know that most customers need 7-14 days until it is clear what to recover and to be sure they don't do anything wrong.

So long story short.
No way in my eyes to answer the "what RTO can you give us?" or "how long would it take to recover?" without a full picture and lots of details.

Hope that helps.

gingerdazza · Post by **gingerdazza** » Feb 04, 2022 10:27 am this post

Thanks very useful.
The things is, and is difficult to articulate, is that OK I have a capability to give you an RTO of say 2 hours, from backup/replication. However, if a cyber attack has swiped away that very mechanism you have to recover (disk backup), you might still be able to recover them but it might take 2 days (i.e. from tape) . But if the cyber attack is sol complex that it's reset all your hardware to factory reset, you may actually take 7-10 days to recover them. No? What you don't want is to give the business an RTO of 2 hours, and in the event that the worst happens you take 10 days and they say "I thought you said 2 hours!".... How do you articulate that in your experience, or don't you?

Feb 04, 2022 11:41 am

I always tell customers that there is a way to keep the 2 hours (btw. 2 hours is already a very good RTO if you talk about a complex service) if they are ready to put a significant amount of money into the hardware an infrastructure needed to guarantee the 2 hours for their let’s say SLA 1 services. That would mean 3 copies of the backup on very fast disk systems (in combination with replication technologies).
For most customers it is not affordable to invest lots of budget for something that, hopefully, never happens.
It’s for them about finding the right balance between invest and outcome with the knowledge that if the worst case happens the will need in your example 7-10 days not 2 hrs.
But as I said above, if such a disaster happens usually it not about the pure refinery time but but about all the surrounding tasks (where to get new hardware, who is deciding, is the DC still usable…).
As you said it’s a very complex topic and guaranteeing a RTO is never a good idea as most likely it will look different once the disaster happens.
Try to have a good dialog and understand the needs, then propose solutions (maybe one that could hold the 2 hrs but with lots of investment and two that can’t hold it in any case but less investment) and let them choose what way they want to go with.

soncscy · Feb 06, 2022 12:27 pm

gingerdazza wrote: Feb 04, 2022 10:27 am Thanks very useful.
The things is, and is difficult to articulate, is that OK I have a capability to give you an RTO of say 2 hours, from backup/replication. However, if a cyber attack has swiped away that very mechanism you have to recover (disk backup), you might still be able to recover them but it might take 2 days (i.e. from tape) . But if the cyber attack is sol complex that it's reset all your hardware to factory reset, you may actually take 7-10 days to recover them. No? What you don't want is to give the business an RTO of 2 hours, and in the event that the worst happens you take 10 days and they say "I thought you said 2 hours!".... How do you articulate that in your experience, or don't you?

I completely get your concern here, and the answer is just be honest up front.

I've worked with a lot of clients that we both manage and just consult for on their infrastructure. When we're consulting, we have a special time when we talk about Ransomware, and I usually make sure my team explains the following points:

1. Ransomware incidents probably won't go the way you expect in most cases
2. If the attackers got in to Ransom you in the first place, anything connected to a network potentially is fair game for them
3. True physical air-gaps are the only proven solution, so your last resort must accommodate the recovery time

Basically, we consistently have this (sometimes) difficult talk with clients who imagine perfect protection against ransomware or some perfect system they buy from vendors that will "never ever be circumvented or hacked". Only once we get it through that Ransomware needs to be treated differently, do we continue to discuss SLA.

The strategy I've noticed seems to go well is breaking ransomware down into two categories:

1. Some data lost, but otherwise functional. This can have normal SLAs.
2. All is lost

In the latter, I straight up recommend that they don't offer SLAs but instead simply share the Disaster Recovery plan and the estimates, and then share update points (that are at a reasonable spacing). Just be clear how the recovery will go and when stakeholders can expect updates. Use the same tiering list to define the order of importance for restoring data, decide in advance if there should be a means for someone to "escalate" their application to get on a higher priority, just set it all in advance. In a real Ransomware scenario, you and your team are going to be tired and nervous and probably really emotionally wrecked too, so write your guidelines in advance and share it with the responsible teams. Keep it as simple as possible; try to avoid overly complex systems and scenarios, and instead make it something that you're relatively sure you can follow even when you're feeling at your absolute worst. (We joke "read your DR plan with a major hangover and see how well you process it." Now imagine also feeling emotionally awful, and try to estimate how well you would process it)

Basically, give yourself the benefit of honesty and don't assume you and your team will be able to run business as usual for Ransomware.

For normal SLA development, test test test, and then design your SLAs based on that. Don't be upset if some people are shocked by the times; pushback if they demand faster restores and ask calmly "how did you determine this was an appropriate time?", as often there isn't much basis for this. Demonstrate with real numbers and test what reality looks like, and use that as the basis for your tier levels. All SLA should be couched with an exception that in the event of unexpected difficulties (software issues, unexpected network issues, etc), SLA may need to be adjusted.

Avoid mentioning guarantees except for communication marks, because as we know in IT, every minute can be a new surprise. Focus instead on the process and the communication, not the result. The result is everyone's goal, so it's implied everyone wants the same thing; a fast and successful restore. There's no value in over-promising on successes when it comes to recovering data; meeting that promise rarely pays out well, but violating it only causes huge issues in communication later on, so just game this and focus on the SLA being exactly what should be: the process and the communication.

Hope it helps, but I don't envy the conversations you will likely have to have soon. (Since I have these a few times a day

)

R&D Forums

RTO

Re: RTO

Re: RTO

Re: RTO

Re: RTO

Who is online