Veeam and ESXi snapshot meltdown caused by VM reboot?

Feb 21, 2019 2:11 am

Veeam and ESXi snapshot meltdown caused by VM reboot?

TL;DR - Questions are at the bottom, but largely, if a VM is rebooted during a backup (where the VM is in a snapshotted state), can this cause the snapshot to be orphaned? I would think not, but I'm seeing evidence to the contrary.

Long version: I know some of this isn't Veeam specific, but rather ESXi. This incident happened during a Veeam process, so I figured that maybe I would not have been the only one to see this happen. I'm not pointing fingers at Veeam or ESXi - I know that there was a lot of human error here. Here's the story.

We had a client make a snapshot on their production Exchange server a month ago and they forgot about it. Backups have been running fine since then, three times a day, every day, so no alarms were raised (and they were not using VeeamONE or we would have known about the snapshot). Today, they were doing maintenance on the VM. They rebooted the virtual machine during a Veeam backup - therefore the VM had a second active snapshot during the reboot.

The VM booted up, but shortly after it stalled completely - it was unmanageable by ESXi, had no network connectivity, etc. We logged into vCenter, and vCenter said that a snapshot consolidation had been triggered, and that the VM needed consolidation. Wondering if Veeam had triggered a snapshot, I went to open the console locally on our backup server (we are using 9.5U4), and the console screen sat there at the loading screen. And sat there. For about 30 minutes. Usually, the console opens within 30 seconds. I checked the backup repository and noticed that the files were being modified by Veeam at that point in time.

After 30 minutes, the console finally opened, and as we suspected, our backup job was running, but it was now on the very last step, merging backup files. Upon inspecting the job statistics we had these two lines:
2/20/2019 3:14:54 PM :: Removing VM snapshot (this was there for 33 minutes and 33 seconds)
2/20/2019 3:48:28 PM :: SERVER NAME has stuck VM snapshot, will attempt to consolidate periodically

Suddenly, as soon as our backup console opened, our Exchange server became responsive again - network connectivity, mail started flowing, etc. However, vCenter events now showed that the snapshot consolidation had failed and that our VM console was still unavailable (the console for the VM had that corrupted symbol over it). We got in touch with VMWare support and they could tell via SSH that the merge of the gigantic month-old snapshot was actually happening in the background, despite the GUI message that it had failed.

To recap, here are the facts:
We had a VM with a large snapshot that had been backing up fine for a month.
That VM was in the middle of a backup (therefore a second snapshot) and the guest was rebooted.
The VM AND Veeam console became unresponsive at the same time.
The VM and Veeam console became responsive again at the same time.
The job stats show that a consolidation was attempted for about the same amount of time that our VM was stalled.

Here's what I'm assuming happened:
The VM rebooted during the backup job, causing the snapshot to orphan itself somehow.
At the end of the job, when Veeam's "Snapshot Hunter" runs, it found the recently orphaned snapshot (Veeam had never triggered on the old snapshot over the course of 90 previous runs, which makes sense since it was not orphaned).
Veeam tells ESXi to perform a "Delete all snapshots" on the VM.
Due to some unknown issue with the very large first snapshot, this freezes the VM, causing it to hang.
After 30 or so minutes, the attempt times out, unfreezing the guest and giving us the snapshot consolidation failure in the GUI.
Veeam's Snapshot Hunter then tries to consolidate again (maybe using the hard consolidation method? I know it tries three times, with different methods each time).
One of the other consolidation attempts was successful, which lead to VMWare support seeing the (ongoing) successful consolidation attempt via SSH.

There were failures on many parts here:
1. Human error - snapshots were forgotten about, and they rebooted the VM during a backup. Yes, I know VeeamONE would have helped here

2. Veeam error - it appeared the backup console would not load during the snapshot merge operation. As soon as it was done, the console opened right up. This caused additional panic because we thought we may need to restore from backups, and we couldn't access them.
3. ESXi error - A snapshot merge operation caused a VM to totally stall and lose management for over half an hour.

Here are my questions:
1. What are the implications of a guest OS rebooting during a backup?
A) I *feel* like it shouldn't matter (except that Application-Aware processing will fail). Plus, why would a guest operation matter to the hypervisor? Is there some sort of tie in that I'm not aware of?
B) This experience, and past experiences where rebooting a guest VM during a backup has coincidentally led to CBT corruptions is making me wonder what I'm not understanding about rebooting guest VMs during a backup.
2. It seems like the VM stalling was also tied into Veeam console stalling (me unable to get past the loading screen when trying to launch the console locally on our Veeam server). How can this happen?
3. Has anyone seen this behavior before?

I know this is a lot of information. I'd rather give too much than too little. Can anyone provide insight here? I'm planning on opening a case tomorrow, but wanted to see what the great minds in the forums had to say. I am sure that one or more of my "assumptions" must be wrong. Please correct me so that I can fully understand what is happening.

Thanks in advance,
Cory

Feb 21, 2019 8:09 am

Hello Cory,
without knowing the setup it is not easy to answer some of the questions (support could help here). But at least for the most important one... Backup does not care about OS reboots inside the VM as the backup is done from a snapshot. If a OS reboot breaks CBT and you can reproduce it, then contacting VMware support might help. The only thing that can happen if you reboot during backup is that post-backup tasks inside the VM might not run as expected.

About the Veeam console stalling: this depends on the environment. There are some timeouts which could make the UI hang for some time if VCenter (or direct connected ESXi hosts) are not available during backup / restore. The timeouts how long the console waits for ESX / VCenter response depends on the Veeam version.

Best regards,
Hannes

Mar 06, 2019 1:36 am

Alright, so I worked with Veeam support and got to the bottom of it. Case number 03426575. I didn't want to leave everyone without an explanation!

First, I have to say how fantastic Veeam support proves themselves to be time and time again. As per usual, they went above and beyond to go through the logs with us and determine where the (very clearly VMWare based) issues were.

Second, the boring part. The console stalling appeared to be due to "<22539> Warning [SatelliteGateway] Veeam Backup Service is under a heavy load. To many opened client sessions exist.". I'm not sure why, because we really only had 2-3 consoles open max. I didn't want to focus on that part of the case, because that wasn't the important part. It seemed to be a fluke that was solved by a reboot.

Third, the fun part. Essentially, it looks like it was a lot of ill-timed coincidences. Essentially, my deep-seated understanding of everything was correct (as you all know) - reboots don't affect snapshots, and the stalled VM didn't equate to the stalled Veeam console. These were things I really thought to be connected in this case because the timing was very coincidental, but at the end of the day, all it all was a coincidence. Here is the sequence of events, from my understanding:

1. Backup started as normal – successful
2. Veeam took a backup snapshot per normal – successful
3. Veeam backup ran – successful
4. Veeam called VMWare to remove snapshot per normal – successful
5. VM was rebooted
6. Snapshot removal triggered VMware snapshot consolidation of Veeam snapshot into the old snapshot from a month ago
7. 30 minutes later (during which the VM was stalled), VMware returned that the snapshot had failed.

Although the timing is coincidental, it looks like the VM reboot had nothing to do with the snapshot meltdown. It appears that the time had just come for that month old snapshot to die (who would have guessed). We wanted to get VMWare support to take a deeper look as well, but we were not able to unfortunately.

R&D Forums

Veeam and ESXi snapshot meltdown caused by VM reboot?

Re: Veeam and ESXi snapshot meltdown caused by VM reboot?

Re: Veeam and ESXi snapshot meltdown caused by VM reboot?

Who is online