Agentless, cloud-native backup for Microsoft Azure
Post Reply
Service Provider
Posts: 192
Liked: 38 times
Joined: Oct 28, 2019 7:10 pm
Full Name: Rob Miller

Failed State in Azure - No Alarm

Post by RobMiller86 »

We have now seen a couple of occurrences where a VM enters a failed state in Azure. The VM is still up and responding, so our other monitoring doesn't detect any issues. But in Azure, the VM shows as being in a failed state and you have to click the button to reboot and redeploy.

When this happens, Veeam can't see the VM. If you look at the sources in the backup policy, the VM in question has a blank ID/Value column. As soon as you redeploy in Azure, the column is populated and Veeam can back it up.

We haven't seen any alarms for this condition in our VSPC. Shouldn't Veeam trigger an alarm when a VM simply drops off like that? Is this a known issue or perhaps do we have something misconfigured? It seems like Veeam considers the VM deleted, so ignores the condition. Unfortunately, I don't have an example right now to open a ticket but will do so when this occurs again. I'm just wondering if anyone else has witnessed this condition.
Product Manager
Posts: 5869
Liked: 1230 times
Joined: Jul 15, 2013 11:09 am
Full Name: Niels Engelen

Re: Failed State in Azure - No Alarm

Post by nielsengelen »

Hey Rob,

Did you already contact support for this and do you have a case ID? Looks like we'll need to do some more troubleshooting to understand what is causing this.
Service Provider
Posts: 192
Liked: 38 times
Joined: Oct 28, 2019 7:10 pm
Full Name: Rob Miller

Re: Failed State in Azure - No Alarm

Post by RobMiller86 »

Hi Niels,

Unfortunately, I do not. This was addressed by some other staff members over the past couple of months. I've been told it has happened 3 times across our client base. We are in the process of migrating a lot of customers over to Veeam. I will open a ticket the next time it's brought to my attention, and I have an example. I was just curious what the expected behavior of VBAZ is when an Azure VM enters a failed state like this, where VBAZ can no longer see the VM in Azure, but the VM is still up.
Service Provider
Posts: 192
Liked: 38 times
Joined: Oct 28, 2019 7:10 pm
Full Name: Rob Miller

Re: Failed State in Azure - No Alarm

Post by RobMiller86 »

I just found another one of these "The virtual machine is in a failed state" in Azure. Luckily for us, this deployment is still using out our older method where we manually added the VMs to the policy, rather than adding them via tag. So the policy was failing as it was manually configured with the VM, and it couldn't find it.

However, this is giving me more concern with our current method of adding VMs by tag. Can I get any confirmation from Veeam on what happens in the following scenario:

1. We are adding VMs to a policy with a specific tag. We then add the tag to the VMs we want to backup with the policy. We are not directly adding VMs to the policy, just the tag.
2. All vms are discovered, backups begin.
3. Sometime later, a VM enters a failed state in Azure. The VM is still up and running, so other monitoring tools don’t detect the failure. However, in Azure, if you pull up the VM overview, up top it says "The last operation performed on this VM failed. The VM is still running. View error details". Then if you click that, in Azure it says "The virtual machine is in a failed state. The fabric operation failed. Reapplying the virtual machine may resolve the issue." Error code: InternalExecutionError. Provision state: Failed. If you click reapply, the VM is reprovisoned in Azure, rebooted, and all is well again.
4. When you pull up the Veeam backup policy in this state, before fixing it above in step 3, the VM is still listed in the policy in the Name/Key column, but the ID/Value (The azure resource ID) is blank.

Will VBAZ still fail a policy in this state? If it previously found the VM to protect via tag, and now it can't find the VM via tag, can I have 100% confirmation that it will now fail the policy instead of thinking the VM was just removed and no longer needs to be backed up? Like I said before, I don't have a current example to open a ticket, but I can't recreate this issue to test, and it's very hard to find unless you are doing manual reviews. I'm considering reverting and not adding VMs via tags due to this issue that I have witnessed myself twice, and someone else witnessed it once. These were on V5, not V6. So I'm not sure if this has improved or not. Is it not safe to add a VM via tag due to this?

I do know the policy fails in this azure failed state if the VM was directly added by "Resource types: virtual machine [name or id]". But I have seen it not fail the policy if the VM was added by "Resource types: tag [key] [value]". With the tag added to the VM. I know it's tough as I can't provide a current example. But it's a big concern as we could go months of longer in this state without realizing it and having no backups. Confirmation for exactly how VBAZ will handle this condition would be much appreciated, or I guess we just shouldn't use the tag method at all for peace of mind.
Veeam Software
Posts: 177
Liked: 54 times
Joined: Oct 04, 2021 4:08 pm
Full Name: Lyudmila Ezerskaya

Re: Failed State in Azure - No Alarm

Post by lyudmila.ezerskaya »

Hi Rob! We will investigate this issue and keep you updated on the results. Thank you!
Veeam Software
Posts: 177
Liked: 54 times
Joined: Oct 04, 2021 4:08 pm
Full Name: Lyudmila Ezerskaya

Re: Failed State in Azure - No Alarm

Post by lyudmila.ezerskaya »

Hi! By design, Veeam Backup for Microsoft Azure skips VMs with the Failed provisioning state during synchronization.

If a VM was manually added to the backup scope, during the backup session, the policy will attempt to process it. However, since it cannot be reached, the policy will fail.

However, when VMs are added to the backup scope using tags, resource groups, or subscriptions, the backup scope adjusts dynamically with each synchronization with Azure. If a VM is in the Failed state during synchronization, it will be skipped and excluded from the backup scope. Since this VM is no longer included in the backup scope, the policy will not attempt to protect it and therefore will not fail.
Service Provider
Posts: 192
Liked: 38 times
Joined: Oct 28, 2019 7:10 pm
Full Name: Rob Miller

Re: Failed State in Azure - No Alarm

Post by RobMiller86 »

Yup. Just found a couple more. Failed state in Azure. No alarm at all from VSPC, and the policy shows fine in VBAZ. This is a huge problem. No one should be using tags to backup VMs, if they actually care about the backups. Personally, I think VBAZ should handle this more intelligently. As it stands, the only way to be sure you are backing up VMs, is to manually add them. Nothing else can or should be trusted. We will now convert all of our VBAZ. A bit irritating honestly. This is going to burn someone at some point.
Vitaliy S.
VP, Product Management
Posts: 27483
Liked: 2831 times
Joined: Mar 30, 2009 9:13 am
Full Name: Vitaliy Safarov

Re: Failed State in Azure - No Alarm

Post by Vitaliy S. »

Hi Rob,

In your case, I would recommend using the "VM without backup (RPO-based)" alarm in VSPC to be notified when a VM was skipped from processing for whatever reason.

As for the VB itself, we will discuss how our handling of "failed VMs" can be improved. For example, we can warn (optional setting) that the VM is no longer backed up to ensure it is international.

Service Provider
Posts: 192
Liked: 38 times
Joined: Oct 28, 2019 7:10 pm
Full Name: Rob Miller

Re: Failed State in Azure - No Alarm

Post by RobMiller86 »

Thanks Vitaliy. We have the RPO alarms enabled, however they don't alert for this condition, or at least not for us. I know in the past there were discussions about changing RPO alarms so they don't alarm for VMs that are no longer part of a job. Wouldn't this condition cause that? Veeam thinks they are removed from the job, so doesn't give you an RPO alarm.

Regardless, we are now directly assigning VMs. But I am pretty sure if I remove a VM from the policy, then VSPC won't hit it with an RPO alarm, as it was removed from the policy.
Vitaliy S.
VP, Product Management
Posts: 27483
Liked: 2831 times
Joined: Mar 30, 2009 9:13 am
Full Name: Vitaliy Safarov

Re: Failed State in Azure - No Alarm

Post by Vitaliy S. »

Hey Rob,
RobMiller86 wrote: I know in the past there were discussions about changing RPO alarms so they don't alarm for VMs that are no longer part of a job. Wouldn't this condition cause that? Veeam thinks they are removed from the job, so doesn't give you an RPO alarm.
We wanted to make it as a checkbox in the alarm settings. Don't have a lab handy, but it was something about "ignore imported backups and VMs that are no longer part of the job". Do you have this setting in the alarm configuration enabled?

Service Provider
Posts: 192
Liked: 38 times
Joined: Oct 28, 2019 7:10 pm
Full Name: Rob Miller

Re: Failed State in Azure - No Alarm

Post by RobMiller86 »

There is nothing like that on the Cloud VM RPO notification settings. There is a "ignore imported backups" on the VM RPO notification for VBR which we do have checked. The goal of checking that was hoping to have it no longer alarm about VMs we remove from backups as they will be decom'd.

But even that alarm acts in strange ways. For instance, it will still alarm about a decom'd server, sometimes. Then it will do things like open the alarm, then 10 minutes later auto resolve the alarm, and do this 2-3 times a day opening and closing tickets in CW. Server hasn't had a backup in 166 days yet it's opened and closed multiple times per day. And that's even with "ignore imported backups" checked.

We really struggle with monitoring backups with VSPC. It's good for general job status, but everything regarding how to handle situations like this is tough. Different RPO alarms having different settings. No clue why it suddenly alarms on something, or why it auto resolves and then opens again. No real clue how to handle removing VMs from jobs, and what to expect.

Another example. I removed SC VMM from Veeam and added the clusters directly. I then changed a job to remove the old VMs that were added via SC VMM and instead add them directly from the clusters. It took new fulls, ok. No alarms for 30 days. Then suddenly 30 days later, I'm getting RPO alarms for these VMs saying they haven't been backed up in 32 days. And I did this to many jobs on this VBR, but only 1 job alarms like this. It's a real problem that I think needs better explanations, better mgmt capabilities (no I don't want alarms for this VM, it's been decom'd) or (yes these VMs are in prod and I would like alarms if they aren't backed up). And more consistency as it just doesn't act the same across all servers or jobs. It seems very random.

Trying to use exclusion masks only for certain VMs is clunky and with common names, could accidentally exclude something I didn't want to.
Vitaliy S.
VP, Product Management
Posts: 27483
Liked: 2831 times
Joined: Mar 30, 2009 9:13 am
Full Name: Vitaliy Safarov

Re: Failed State in Azure - No Alarm

Post by Vitaliy S. »

RobMiller86 wrote:There is nothing like that on the Cloud VM RPO notification settings. There is a "ignore imported backups" on the VM RPO notification for VBR which we do have checked. The goal of checking that was hoping to have it no longer alarm about VMs we remove from backups as they will be decom'd.
Ah, I must have mixed it with the a similar alarm for the virtual infrastructure.

RobMiller86 wrote:But even that alarm acts in strange ways. For instance, it will still alarm about a decom'd server, sometimes. Then it will do things like open the alarm, then 10 minutes later auto resolve the alarm, and do this 2-3 times a day opening and closing tickets in CW. Server hasn't had a backup in 166 days yet it's opened and closed multiple times per day. And that's even with "ignore imported backups" checked.
Such behavior should be reported to the support team, as sometimes the fix should be done in the VBR APIs we use in Console.

RobMiller86 wrote:We really struggle with monitoring backups with VSPC. It's good for general job status, but everything regarding how to handle situations like this is tough. Different RPO alarms having different settings. No clue why it suddenly alarms on something, or why it auto resolves and then opens again. No real clue how to handle removing VMs from jobs, and what to expect.
I've asked our QA team to pay more attention to RPO-based alarms, and so far they cannot see the same behaviour in various internal labs, but they will keep trying to reproduce these situations.

RobMiller86 wrote:Another example. I removed SC VMM from Veeam and added the clusters directly. I then changed a job to remove the old VMs that were added via SC VMM and instead add them directly from the clusters. It took new fulls, ok. No alarms for 30 days. Then suddenly 30 days later, I'm getting RPO alarms for these VMs saying they haven't been backed up in 32 days. And I did this to many jobs on this VBR, but only 1 job alarms like this. It's a real problem that I think needs better explanations, better mgmt capabilities (no I don't want alarms for this VM, it's been decom'd) or (yes these VMs are in prod and I would like alarms if they aren't backed up). And more consistency as it just doesn't act the same across all servers or jobs. It seems very random.
If that's only for one job, then there must be something infrastructure-related (unique thing). Our support team should be able to investigate that.
Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest