IBM storage integration: Veeam deleted datastores instead of snapshot volumes

pirx · Post by **pirx** » Mar 07, 2023 9:54 am this post

Case #05900716

We had an incident a week ago where Veeam deleted 5 IBM SVC production datastores with 126 VMs running instead of their snapshot volumes. I don't know exactly how SVC storage snaps work but I know that it is different than NetApp. In any case I would never expect this to happen. I remember that there was a similar case right after the V11 release (we are on latest V11 CU).

It looks like there was a timeout for one of our ssh commands sent to the storage device for a reason that is still under investigation. This led to retry of the command that got a response for a previous request. Since that moment each next ssh request we did got reply for a previous one effectively creating a queue of responses. This led to incorrect IDs being used in the requests they do not belong.

This description frightens me a lot! Maybe I don't understand this correct but it sounds like there are commands executed on SVC side but no real check about the result or status is happening (no real error handling). The described solution in V12 also doesn't sound that more error handling was added, just the way a ssh connection is used. There was maintenance on storage side that weekend which might have been the reason for the timeouts, but still I'd expect that Veeam is able to take care of the production datastores. This all leaves me wondering how risky this storage integration really is.

Post by **Gostev** » Mar 07, 2023 10:46 am this post

Statistically speaking it's extremely safe: I can't remember many similar occurrences in almost 10 years since the feature existed, and over 1 million active Veeam installs currently. Meaning we're talking about a corner case that takes massive amount of bad luck to run into, perhaps due to some environment-specific issue.

pirx · Post by **pirx** » Mar 07, 2023 2:14 pm this post

Our issue sounds exactly like fixed #3 of the top issues of V11. Anyhow, Veeam B&R should (must!) be aware which volumes are "production" and which are snapshot copies. I'm pretty sure this behavior can be prevented. Maybe someone of product management can take some time to discuss with R&D if the precautions are really sufficient. This is not a small issue, this was our biggest data loss and recovery event we ever had. And the description from support how this could happen sounds not very promising.

veeam-backup-replication-f2/top-issues- ... 72418.html

#3: Possible data loss with IBM storage snapshot integration
Symptoms: Wrong LUN may be deleted from IBM SAN during storage snapshot retention processing.
Cause: A certain sequence of storage snapshot management operations may result in a LUN with duplicate ID appearing on storage, that will be later deleted by the retention policy.
Status: Fixed in P20210319 or later.

Post by **Gostev** » Mar 07, 2023 2:25 pm this post

The ultimate solution would be IBM implementing a Universal Storage API plug-in for Veeam like most other storage vendors did. Until then, our R&D will be at mercy of SSH behavior peculiarities resulting in unforeseen corner cases which are impossible to run into in test labs. If you are a large IBM storage customer, perhaps you could also apply pressure from your end as well. Meanwhile, our R&D will of course keep addressing new corner cases as they appear and learn from these to make the code more resilient to unexpected SSH problems.

pirx · Post by **pirx** » Mar 07, 2023 2:34 pm this post

We will talk to IBM about this too. I understand that its not a very elegant way to this and what IBM provides. Agreed. My main point still is that whatever B&R does, there must be checks that prevent such issues. I really don't see this as corner case. That's more about being protected in case of corner cases.

Post by **Gostev** » Mar 07, 2023 3:19 pm this post

I've confirmed with the devs that the SSH commands logic was changed completely in V12 to address this and similar scenarios of SSH misbehavior.

Mar 07, 2023 5:24 pm

pirx wrote: Mar 07, 2023 2:14 pm Our issue sounds exactly like fixed #3 of the top issues of V11. Anyhow, Veeam B&R should (must!) be aware which volumes are "production" and which are snapshot copies. I'm pretty sure this behavior can be prevented.

It is indeed a very specific corner case that is further complicated by the fact that it is caused by interaction with a third-party library (SSH). Agree, better error handling on the Veeam B&R side could help to prevent issues like that and we will discuss this. Currently, Veeam B&R fully relies on storage in terms of providing the volume IDs. The call to the storage to get the IDs is done via SSH. In v11, we used an SSH call to execute commands on the IBM storage that didn't get recreated upon retry after a timeout but used the same request/response thread for subsequent retried calls. In a case of a number of requests queued due to a high load, this resulted in sending a wrong reply to one of such requests.

In v12 we have switched to a different SSH command that starts a new isolated thread for every new request so the case where sending an answer intended for a previous request or something like that is simply not possible, so v12 is not affected.

Post by **gmajestix** » Mar 07, 2023 7:54 pm this post

One option to prevent such situation is to use Volume Protection on SVC. See more at https://www.ibm.com/docs/en/sanvolumeco ... protection.

pirx · Post by **pirx** » Mar 08, 2023 9:44 am this post

Not sure if this is really working with storage snapshots.

vmware-vsphere-f24/veeam-fails-to-delet ... 68203.html

Post by **foggy** » Mar 08, 2023 12:54 pm this post

Well, in general, it works to the extent that Veeam B&R fails to delete protected snapshots (during retention or auxiliary snapshots during backup) and proxy server mappings so you should keep an eye on the number of snapshots with storage limits in mind.

aceit · Post by **aceit** » Mar 14, 2023 12:07 pm this post

pirx wrote: Mar 07, 2023 9:54 am Case #05900716
We had an incident a week ago where Veeam deleted 5 IBM SVC production datastores with 126 VMs running instead of their snapshot volumes. I don't know exactly how SVC storage snaps work but I know that it is different than NetApp. In any case I would never expect this to happen. I remember that there was a similar case right after the V11 release (we are on latest V11 CU).

Sorry to hear that... but you can consider to enable volume protection of the IBM's on important production online pools on the Storwize. Usually is enabled by default. With volume protection you can delete a volume only if absolutely no I/O was served by the volume in X minutes, even if the delete operation was -force'd by command or API.

Stated that, I agree that using plain SSH and command line to automate operations on storage is kind of "scary"... even more if we "shoot in the dark" SSH command expecting a side effect, but maybe an error occurred... and so we end with the Veeam view of the world and Storage view of the word during a session... not a great thing to automate workflows.... and maybe even if the error is occurred and even if is parsed by the software, the command line string could be not consistent between IBM storwize sessions... the command line is not a protocol.

I recall that I stumbled upon this "shoot commands in the dark" in the past, when I wanted to test the integration... and I desisted till further refinement, because it didn't even recognized the basic storwize underlying storage taxonomy and possibilities... I even opened a thread vmware-vsphere-f24/ibm-storwize-support ... 73546.html I don' t know if the situation is changed in v12 regarding that, maybe I will give it another shot (instead of using only vmware snapshot way).

Sincerely, I personally don't think at this point that there are many representative veeam customers requesting a real full integration with IBM stuff (maybe those who do just use IBM software for particular things...).

Anyway, in case, Storwize comes also with a classic REST / JSON interface to send commands with proper error mapping codes in return, if one wants to avoid SSH.

R&D Forums

IBM storage integration: Veeam deleted datastores instead of snapshot volumes

Re: IBM storage integration: Veeam deleted datastores instead of snapshot volumes

Re: IBM storage integration: Veeam deleted datastores instead of snapshot volumes

Re: IBM storage integration: Veeam deleted datastores instead of snapshot volumes

Re: IBM storage integration: Veeam deleted datastores instead of snapshot volumes

Re: IBM storage integration: Veeam deleted datastores instead of snapshot volumes

Re: IBM storage integration: Veeam deleted datastores instead of snapshot volumes

Re: IBM storage integration: Veeam deleted datastores instead of snapshot volumes

Re: IBM storage integration: Veeam deleted datastores instead of snapshot volumes

Re: IBM storage integration: Veeam deleted datastores instead of snapshot volumes

Re: IBM storage integration: Veeam deleted datastores instead of snapshot volumes

Who is online