Feature Request - duplication protection

ejenner · Post by **ejenner** » Apr 16, 2021 11:48 am this post

Hi there,

Not sure if this feature request has already been suggested previously but I was wondering if it was possible to solve the following problem:

Problem:

In unusual cases the data held in the repository for a job can appear old or unrelated to the job. Veeam will begin a fresh backup. In the case of multi-TB jobs this can cause saturation of bandwidth, processors ect. It can also fill up a repository and stop all the other jobs from completing. It's also potentially a moment for Veeam to lose track of the restore points making it more difficult to restore from a moment before the problem occurred.

This has happened to me 2 or 3 times over my time with Veeam.

Most recently this has happened with my o365 backup. In the past it has occurred with a clustered file server. These are unusual cases but severely disruptive when it does occur.

Solution

I'm unsure of the specific detail of how this could be implemented but I imagine a system a bit like a single line railway track where the train has to collect a token from a booth before it can travel along the track. i.e. if for any reason Veeam is unable to detect the token it does not proceed with the backup. The problems can be identified and a duplication of all the existing data may be avoidable. Of course you could make this switchable for scenarios where this isn't required.

Post by **PetrM** » Apr 16, 2021 1:19 pm this post

Hello,

Thanks for the idea! Not sure that this is a true "duplication": Veeam creates a new full but not just a copy of the existing backup. The main purpose is to archive the actual state of data but you won't have an actual backup if we restrict job runs due to whatever reason. Also, I'd dig deeper into the cause of these "unusual" cases, perhaps it would be better to prevent such a case instead of building some logic which is aimed to handle this sort of problems?

Thanks!

ejenner · Post by **ejenner** » Apr 16, 2021 3:43 pm this post

It's a 'duplication' if you already have all the data backed up once already.

I don't think fixing the root cause is going to be possible because it could be something different every time.

The example I gave of a file cluster earlier occurred because the Microsoft DFS server reset the data making it look all new. Next time Veeam tried to back it up it was going to do a first time backup of our entire file server from scratch.

There could be other reasons why this would happen. Failures in change block tracking somehow.. I'll try not to get too technical. Essentially there can be many ways Veeam can be fooled into thinking it will have to perform a first time backup.

I suppose another way of looking at this would be as an extension of the incremental forever logic. If there was some way for Veeam to decide whether or not an incremental backup is improbable before attempting to transfer a new first time backup into a full repository which already holds the same data.

I know it's possible to say Veeam just backs up what it sees. But the idea of a backup product is to fine tune the process and make it easy for the user to get fast and reliable backups. Otherwise at the root of everything we're saying all it does is copy files... which is underselling slightly.

soncscy · Apr 16, 2021 7:34 pm

I think you're fundamentally misunderstanding the situations you've described as these are limitations of CBT.

The hypervisor has no idea what's happening in the guest when it comes to CBT; the B in CBT is block, and the C is for changed, and no matter how you slice it, at a hypervisor level, DFS resetting the data, even if it's just updating a few bits that reside on different blocks at the hypervisor level means you have a changed block; not effectively, not from a certain point of view, it's really a changed block.

There's not really a way to circumvent this from the hypervisor, even if you have an in-guest agent working in tandem. I suppose there might be a way to hack it and somehow do meta-blocks or otherwise alter the data returned by the CBT API, but now you're using CBT in a way that's not supported by the Hypervisor vendors, and that is a bigger risk than the extra space on the repository.

>Otherwise at the root of everything we're saying all it does is copy files... which is underselling slightly.

I think this is not exactly a forthright assessment -- CBT is very complex (try writing an application to make it work; the API is great, but conceptually you do have to think a bit about what's actually going on at the datastore level and that can be a bit daunting). It's not a file copy, it's a data firehose, and CBT determines if you get a glass of water or a swimming pool's worth of data from the hose. If it was just a file copy, you'd get full VMDK files each time and the restore narrative would be a lot different.

I get what your overall point is, but when you go through the provided APIs (as you should, hacking someone else's toys is a recipe for disaster and a good way to end up with a non-functional product in the end), you have to play by the rules. If anything, the request should go to VMware to figure out a way to avoid the traditional "big" CBT backups, as they're known, predictable, and it should happen at the hypervisor level, not the client application level.

Post by **PetrM** » Apr 16, 2021 9:02 pm this post

I'm getting the point about fine-tuning anyway, it's always a good idea to make the logic a little bit "smarter" but I guess that it can be scripted, for example by PowerShell. For instance to interrupt the job if it creates full instead of incremental or to validate a relationship between job and its backup and either to run or to restrict a job depending on validation result. Nevertheless, I wouldn't go this path as I'd prefer to have a full backup than do not have backup at all.

Thanks!

ejenner · Post by **ejenner** » Apr 19, 2021 9:08 am this post

Petr, you say you'd rather have a full backup than no backup at all. What often happens with our environment is that there isn't room for another full backup to fit onto the repository. There also isn't time or processing resources. So if Veeam does (for whatever reason) start copying all the data for the backup like a first time backup it'll fill up the disk and eat up all the bandwidth. The effect of this is that while it might be doing a great job backing up regardless, it puts too much load on the infrastructure and that often prevents many other backups from completing. i.e. you don't get a full backup instead of no backup. What you get is the whole system failing, you get half of a full backup and loads of other backups missed entirely.

Soncscy, this isn't only to do with VM backups. In one of my examples above we're talking about a Windows file server using Windows Agent. I take your point about CBT being complicated and I agree you shouldn't design software to misuse it.

I also made the point that this feature could be turned on or turned off. So if you have small jobs which can easily complete within a backup cycle even if they start from scratch there would be no use for the feature. You could turn it off on small or important jobs.

I suppose another way of describing the problem would be to say you're trying to detect a resource hogging job. I used the word 'improbable' above. This means if by certain metrics a job appears to be performing abnormally you could have a feature in the software which will kill the job. Abnormal behavior could be the job taking more than 50% longer than it usually would. i.e. if it takes 10 minutes to complete for the last 50 cycles but suddenly one day it goes to 20 or 30 minutes it could be assumed that something has gone wrong. Therefore to protect other backups and storage the job is killed for the overall good of the system on the whole.

soncscy · Post by **soncscy** » Apr 19, 2021 2:07 pm this post

No, I get what you're saying, but in fact regardless of the job type we're talking I think it's the wrong approach. The above examples were just things where these __are__ predictable events that can be known.

I'm not keen on hidden kill catches; OOM-killer on Linux is a perfect example of this where you have a fairly logical construct that is clearly documented, but it's skill kind of a crap shoot on what OOM Killer will sacrifice to keep the system running. You can adjust OOM killer with preference values for different processes, but then you get into conversations about "wait, why didn't OOM Killer take out this process and instead decided to kill our MongoDB instance?" and it turns out someone tweaked the values without telling anyone.

I can see the same sort of problem just occurring here as well; an invisible kill process that ends up killing jobs without a lot of explanation, when we maybe have a predictable event that triggers the kill logic. Your example logic (I get it's an example) is not great for me because a longer backup could also be as simple as a network issue or an intentional operation which takes longer (manual active full); I get you can write some logic to accommodate for intentional operations, but this just gets even more complex. Even if you can turn it off, clients/coworkers can easily turn it back on, and it just becomes a real headache in my opinion.

Instead the answer I prefer is proactive monitoring and reporting:

1. Check the machines in question for these given events that take longer and create alerting ahead of time from that level
2. Do capacity planning to accommodate for the eventuality that there will be a mistake in the job

The business rules of such a feature seem really hard to maintain as a user (I imagine code-wise it's very simple, but the business rules likely start to get complex as you have many competing rules), and knowing that such a beast can be unleashed just makes me very antsy.

Post by **PetrM** » Apr 19, 2021 2:17 pm this post

Hello,

I fully understand the challenge. However, I have two ideas which perhaps could partially address this issue: the first one is to plan for having enough space for an active full and the second idea is to leverage SOBR functionality. Also, I believe that our I/O control and throttling options might help to workaround the issue related to excessive load on infrastructure during full backup.

Speaking of reasons for unplanned fulls, I see just a short list of rare technical issues so far and it would be difficult to justify the engineering resources required to deliver a feature which will be turned off in most cases.

Anyway, the idea is interesting and we can explore different options to resolve this problem as long as we have enough similar requests.

Thanks!

ejenner · Post by **ejenner** » Apr 20, 2021 3:28 pm this post

I mustn't be explaining this very well.

The response: you shouldn't allow these unexpected events to occur so it should be fine.

If that's how we run things then there's no requirement for backup either. Because you'd know about all unexpected events in advance and would not allow them to happen.

Post by **PetrM** » Apr 20, 2021 7:30 pm this post

Hello,

In fact, that's not what I was meaning, the main point is that the issue described above is not so common and we have no enough requests for this functionality. The requirement for having backup and following 3-2-1 rule is still valid since the likelihood to encounter any of the potential issues with production data is statistically much higher than to face the specific problem we're discussing in this topic. Also, there are no options to prevent all issues which can happen due to whatever reason while the root cause analysis of unplanned full might be sophisticated but by no means impossible.

Thanks!

ejenner · Post by **ejenner** » Jul 02, 2021 8:44 am this post

I've had another instance of duplication. It's 120tb of data and I don't have enough room on the repository to take it. So I'll have to clear the old backups off the repository to tape. Then write new backups. This will probably take 2 or 3 weeks! In the meantime it won't be backing up.

This was instigated on this occasion by firstly doing an in-place upgrade on the OS, which I don't think caused it. Then having to reset the permissions on the data drives after the upgrade. The data itself is the same but the drives became inaccessible after the upgrade due to some quirk of the process and our configuration. The data hasn't been touched but will have to be backed up fresh as if it's brand new data.

Post by **PetrM** » Jul 02, 2021 9:21 am this post

Hello,

Thanks for info, looks like one more possible case.

Thanks!

ejenner · Post by **ejenner** » Sep 07, 2021 10:35 am this post

This has happened again recently. A reoccurrence of the file server job backing up from scratch as per one of the examples above. Our file server hasn't backed up properly since 21st August. I've had to migrate jobs off the repository to any places where I can find sufficient space to store the files for these very large jobs. There was probably 60TB free on the repository before the duplication started but after a few of the jobs started to run it filled it up and all the backups stopped. So for the last couple of weeks I've been moving jobs off the repository to try and clear some space for the backups to complete.

This time I've logged this as a technical support issue case no: 04985706

Although unrelated to this topic of a feature request to prevent this the error I've logged the ticket for was: "05/09/2021 23:14:12 :: Full backup is required for cluster disk 6014441d-d5db-43b3-adad-676ab86326eb: cluster membership was changed"

In this case there was no resynchornisation of our DFS so it isn't the same as the previous duplication of our file server. To be clear, although this particular instance I'm battling presently is about a cluster there have been other examples of direct attached non-clustered servers doing the same thing.

At the end of the day having a feature to prevent abnormal backups from 'blocking' other backups I believe would be useful. We have 160 jobs running once a day and if one of those jobs decides it is going to do a complete backup from scratch it'll block other jobs from using the backup infrastructure resources and fill up the repository so even if the other 159 jobs can get processing resources they'll still fail because the repository is full.

All a feature like this would allow you to do is to sacrifice a problematic job in favor of the other jobs rather than the current behavior where a problematic job can bring down the whole system and sacrifice all jobs to try and complete one job which isn't working properly.

Post by **PetrM** » Sep 13, 2021 6:35 pm this post

Hello,

Basically, I agree that it might be useful to restrict unplanned fulls to avoid redundant consumption of infrastructure resources but we can consider it as a potential improvement for future releases only if we have enough similar requests. So far, I'd suggest to continue working with our support team in order to understand what exactly did trigger the full backup to run out of schedule last time.

Thanks!

R&D Forums

Feature Request - duplication protection

Re: Feature Request - duplication protection

Re: Feature Request - duplication protection

Re: Feature Request - duplication protection

Re: Feature Request - duplication protection

Re: Feature Request - duplication protection

Re: Feature Request - duplication protection

Re: Feature Request - duplication protection

Re: Feature Request - duplication protection

Re: Feature Request - duplication protection

Re: Feature Request - duplication protection

Re: Feature Request - duplication protection

Re: Feature Request - duplication protection

Re: Feature Request - duplication protection

Who is online