Thought I had better highlight some strange behaviour we saw yesterday in our production systems. One of the main production hosts we have threw its toys out of the pram yesterday morning and failed. The VM’s it hosted were brought down and HA did its job as per VMware’s tool box doing its job. We had them on a call for the better half of the day trying to track down what caused the issue and it turned out to be another boot volume was present and the Hypervisior was having a hard time trying to work out which one it should use.
We traced this down to a Nimble snap for that Hosts Boot Volume being online which as I am sure you will agree was not something we wanted to see or expected to see. Once it was set to offline the host stop complaining and started looking a lot better.
So I engaged Nimbles support to try and track this down and during the session with them, while scanning for online snaps (which there were a fair few) we got to see what I think was the cause of the snaps being online. We use the Nimble integration with Veeam and it works very well. We can and do use the storage snaps to recover data all the time. Veeam scans the storage arrays and looks for VM’s present in the snap volumes by putting them online, having a look inside and setting them to offline again. We were watching this happen in front of us while we were tracking down these online snaps last night. As I was going through the GUI and setting the snaps to offline, Veeam was turning one on at a time and doing its looksee and moving on to the next. We had to track down 21 online snaps that weren’t needed and 1 of them being the boot volume of our host.
When I called Nimble and explained what I had seen they did mention that this had happened a few weeks before with another customer and that Veeam had been involved as well. It is what led me to start looking at how Veeam interacts with the storage a bit closer.
So what I have done so far to mitigate this from reoccurring, Given Veeam its own service account for the Nimble Arrays. This Helps with the Logging so Nimble can see what or who is doing to interacting with what.
I have masked off the boot volumes from being scanned by the storage integration to make sure they are not picked up and touched by it again.
But I have come back to it this morning and I have another load of Snaps that are online after being scanned, I have logged a support case with Veeam and this is the case ID: - Case # 0239 [ ref:_00D30RWR._5000e1GOwLW:ref ]
I have uploaded the logs required of the Main B&R server as well as the Physical proxies that do the direct san access. And this is something that needs to be said as well, It doesn’t appear to be anything to do with the Direct SAN access to the volumes. This only seems to be done at the B&R server when you add in the Storage infrastructure.