Host-based backup of VMware vSphere VMs.
Post Reply
gregwatson
Influencer
Posts: 20
Liked: never
Joined: Aug 21, 2014 9:42 pm
Full Name: Greg Watson
Contact:

006428 - Problem restoring Exchg server (corrupted backup?)

Post by gregwatson »

Hi.

We've been trying to restore a backup of an Exchange server for weeks now without success.

The most recent Veeam backup was taken using version 7 (patched) and was taken when the Exchange VM was shut down. When we try to restore the backup, it seems to restore fine but then within a few minutes of powering the VM on, we start to see messages in the event logs relating to the information store being corrupt. Things like event ID 476 (database page read failed verification) and 203 (database copy appears to have an I/O error)

The source esxi version was 4.1 and we were initially trying to restore onto a 5.5 host. We found various articles about silent data corruption with certain version of the tg drivers etc, so we've updated those drivers on the target host with no joy.
We then rebuilt one of the hosts as an esxi 5.1 host - but the same thing happens. We've tried

- restoring the most recent active full backup (taken when the VM was powered off)
- restoring different points from the backup chain (taken whilst the VM was running)
- restoring onto SAS and SATA datastores
- restoring onto three different hosts (all poweredges, but different types, with latest Broadcom tg drivers installed)
- restoring onto 5.5 and 5.1 versions of ESXi
- updating the VM hardware version on the VMs before powering them on


At this point we really cannot explain what on earth is going on. It really should not be this hard to get a backup restored. What's especially odd is that it seems to be Exchange database that has the issue - the operating system seems fine. I don't know if that's because the Information Store is a database and therefore has some inherent structure that is somehow not being reproduced faithfully...??

Has anyone got any ideas what on earth might be going on?

Given that the Veeam backup files have been transferred to the new datacentre by copying them onto a NAS and then transporting the NAS, is there some way that the backups themselves are corrupt? If they were corrupt, would they restore at all (which they are doing) or would the Veeam restore process somehow notice any corruption of the backup file (by validating a checksum or somethign before it even attempts a restore) ? Is there some way to validate the integrity of a backup file?

The files were originally written by Veeam directly to a QNAP NAS repository, which was then copied to a second QNAP using the real time replication facility. The second QNAP was then transported to the new site.
foggy
Veeam Software
Posts: 21070
Liked: 2115 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: 006428 - Problem restoring Exchg server (corrupted backu

Post by foggy »

Greg, did you try to restore from the first QNAP NAS? Also, the provided number does not seem to be a correct support case ID number, could you please check it?
gregwatson
Influencer
Posts: 20
Liked: never
Joined: Aug 21, 2014 9:42 pm
Full Name: Greg Watson
Contact:

Re: 006428 - Problem restoring Exchg server (corrupted backu

Post by gregwatson »

OK I'm even more concerned now.

We drove to the second datacentre and transported the new hardware across to the old datacentre, created a windows share on a server on the new hardware, configured this within Veeam as a new repository, then did an active full backup of the DC and the Exchange server writing to the new repository (on the windows file share) so we aren't touching the QNAP NASs at all.

Restores fine. Power up the DC first, let it do its reboot, wait til it's settled down, then boot up the Exchange VM. After 5 minutes, no corruption. Go away, come back a few hours later - loads of messages in the Exchange logfile about corrupt information stores.

So as of this moment we do not have ANY working backups of any of the Exchange servers. Veeam backups are our primary line of defence, along with the replicated copies to another NAS. It now appears that NONE of them work, even when I remove the NAS units completely from the equation. So all of the Exchange servers, right now, are completely without any working backup at all.

What on earth do we do now??? Is there 24x7 support available because this has now put us in an incredibly vulnerable position.....
Gostev
Chief Product Officer
Posts: 31522
Liked: 6700 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: 006428 - Problem restoring Exchg server (corrupted backu

Post by Gostev »

Hi, Greg.

There is nothing we can do to facilitate the resolution until you provide us with the correct support case ID number.

But from your description, there is a possibility that DR site storage is misbehaving, thus introducing corruptions into the running VM, as it performs writes. Basically, full VM restore process implicitly validates that your backup is not corrupted, because it verifies each restored block against the checksum stored next to it. So, you can be 100% sure that what you have restored in the DR site was a bit-identical copy of what the backup job has gathered from your production storage with an Active Full backup.
gregwatson wrote:Is there 24x7 support available
Yes it is, but I don't know if you have bought this option.

Thanks!
gregwatson
Influencer
Posts: 20
Liked: never
Joined: Aug 21, 2014 9:42 pm
Full Name: Greg Watson
Contact:

Re: 006428 - Problem restoring Exchg server (corrupted backu

Post by gregwatson »

Hi Gostev.

I really appreciate your reply.

Sorry I thought I had updated the ticket ref, which is 00642879

I haven't put much info into the ticket yet for two reasons
1. I wanted to know if there was some way to check the integrity of backups first and
2. I wanted to carry out this test which we have now completed

I am not convinced that the backup is working properly, these are the same Exchange servers that I had issues restoring (I did log a previous ticket a few weeks ago) because the vmx file was wrong. It was somethign to do with the fact that both these servers have their snapshot folder changed from default in the original VM, and it doesn't look like Veeam is handling that. We had to restore the VM files rather than do a "full VM restore". I was kinda surprised the Veeam didn't want to investigate that issue any further at the time - the response basically consisted of "try restoring the vm file and then editing the vmx file to remove references to the snapshot folder.

I don't know if that has anything to do with this, but give that Veeam doesn't appear to correctly recognise and cope with an alternate snapshot path in the source VM, perhaps it could be?

Just to reiterate we have tried these restores on 3 physical servers (two identical, one different) running ESXI 5.5 and ESXi 5.1 and they all produce this corruption after a while. Ad these servers are using local storage so it's three different sets of drives. We've tried SATA and SAS datastores.

How can I check if we are entitled to 24x7, and are there options for paying if we aren't?
Gostev
Chief Product Officer
Posts: 31522
Liked: 6700 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: 006428 - Problem restoring Exchg server (corrupted backu

Post by Gostev »

Greg - I have checked, and looks like you are entitled to 24x7 (at least from what I understand, I am not very hands-on with our support system). I have sent your case to the internal escalation DL requesting attention.

Alternate snapshot path should not cause any problems as far as gathering correct/actual data from your VM disks. Veeam actually works on much higher level, it is VMware vStorage API that goes to individual VM files to collect virtual disk data blocks.

Thanks!
gregwatson
Influencer
Posts: 20
Liked: never
Joined: Aug 21, 2014 9:42 pm
Full Name: Greg Watson
Contact:

Re: 006428 - Problem restoring Exchg server (corrupted backu

Post by gregwatson »

Hang on, I am seeing some worrying errors in the ESXi vmkernel.log file.

Code: Select all

2014-09-27T20:17:00.005Z cpu1:34056)World: 14296: VC opID hostd-f2d0 maps to vmkernel opID 307f037d
2014-09-27T20:17:37.587Z cpu6:34059)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x1a (0x412e832c6c40, 0) to dev "t10.DP______BACKPLANE000000" on path "vmhba0:C0:T32:L0" Failed: H:0x0 D:0x2 P:0x0 Valid
2014-09-27T20:17:37.587Z cpu6:34059)ScsiDeviceIO: 2337: Cmd(0x412e832c6c40) 0x1a, CmdSN 0x1422 from world 0 to dev "t10.DP______BACKPLANE000000" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20
2014-09-27T20:17:38.605Z cpu5:36677)World: 14296: VC opID hostd-623c maps to vmkernel opID 39c25adb
2014-09-27T20:17:40.005Z cpu16:34112)World: 14296: VC opID hostd-1a01 maps to vmkernel opID 28e65f7a
2014-09-27T20:18:00.004Z cpu8:34056)World: 14296: VC opID hostd-8cd1 maps to vmkernel opID d009f48b
2014-09-27T20:18:14.955Z cpu8:34056)World: 14296: VC opID hostd-d503 maps to vmkernel opID 885fb703
2014-09-27T20:18:34.626Z cpu20:34112)World: 14296: VC opID hostd-a0e6 maps to vmkernel opID 3c7e2e90
2014-09-27T20:18:37.992Z cpu0:33054)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x85 (0x412e80415180, 34569) to dev "naa.6842b2b05a28e6001b9505850b5a2dcc" on path "vmhba0:C2:T0:L0" Failed: H:0x0 D:0x2
2014-09-27T20:18:37.992Z cpu0:33054)ScsiDeviceIO: 2337: Cmd(0x412e80415180) 0x85, CmdSN 0x9f from world 34569 to dev "naa.6842b2b05a28e6001b9505850b5a2dcc" failed H:0x0 D:0x2 P:0x0 Valid sense data
2014-09-27T20:18:37.992Z cpu0:33054)ScsiDeviceIO: 2337: Cmd(0x412e80415180) 0x4d, CmdSN 0xa0 from world 34569 to dev "naa.6842b2b05a28e6001b9505850b5a2dcc" failed H:0x0 D:0x2 P:0x0 Valid sense data
2014-09-27T20:18:37.992Z cpu0:33054)ScsiDeviceIO: 2337: Cmd(0x412e80415180) 0x1a, CmdSN 0xa1 from world 34569 to dev "naa.6842b2b05a28e6001b9505850b5a2dcc" failed H:0x0 D:0x2 P:0x0 Valid sense data
2014-09-27T20:18:40.005Z cpu1:267397)World: 14296: VC opID hostd-1a01 maps to vmkernel opID 28e65f7a
2014-09-27T20:18:55.085Z cpu15:36548)World: 14296: VC opID hostd-3ce8 maps to vmkernel opID c130672e
2014-09-27T20:19:00.004Z cpu4:34056)World: 14296: VC opID hostd-8df2 maps to vmkernel opID 65c18209
Gostev
Chief Product Officer
Posts: 31522
Liked: 6700 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: 006428 - Problem restoring Exchg server (corrupted backu

Post by Gostev »

Code: Select all

2014-09-27T20:17:37.587Z cpu6:34059)ScsiDeviceIO: 2337: Cmd(0x412e832c6c40) 0x1a, CmdSN 0x1422 from world 0 to dev "t10.DP______BACKPLANE000000" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20
These indicate failed SCSI I/Os...
gregwatson
Influencer
Posts: 20
Liked: never
Joined: Aug 21, 2014 9:42 pm
Full Name: Greg Watson
Contact:

Re: 006428 - Problem restoring Exchg server (corrupted backu

Post by gregwatson »

OK - although there seem to be some people who think they are spurious warnings and can be ignored.

I'm not so sure, since I found these errors by looking in the esxi log files at the time the event logs started reporting problems. So at this stage I believe they are relevant.

However - bearing in mind we have had this issue on three individual servers, two identically specced but a third different spec (the first two had old broadcom tg drivers until we upgraded them; the third was not using broadcom) - bearing this in mind, how likely is it that the problem is a scsi issue with the hardware?
Gostev
Chief Product Officer
Posts: 31522
Liked: 6700 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: 006428 - Problem restoring Exchg server (corrupted backu

Post by Gostev »

That D:0x2 part is "02h Check Condition" in SCSI, which means that an error occurred when attempting to execute a SCSI command. So, I'd say it is very likely and matches what you are observing perfectly. I have been wrong before though :D

One thing I have no doubt about, is that you are starting off from a good clean VM after full VM restore from an active full backup...
gregwatson
Influencer
Posts: 20
Liked: never
Joined: Aug 21, 2014 9:42 pm
Full Name: Greg Watson
Contact:

Re: 006428 - Problem restoring Exchg server (corrupted backu

Post by gregwatson »

OK thanks Gostev.

That's really what i was trying to establish initially from the support ticket - how possible is it that the backup and restore would complete without errors IF corruption had been introduced somewhere along the line.

I appreciate your assistance.
gregwatson
Influencer
Posts: 20
Liked: never
Joined: Aug 21, 2014 9:42 pm
Full Name: Greg Watson
Contact:

Re: 006428 - Problem restoring Exchg server (corrupted backu

Post by gregwatson »

I don't think those errors are related to the issue after all, for the following reasons:

1. The log is full of them, including from before the VMs were even restored
2. I have checked a different site which is running fine, and the vmkernel log file on that server has lots of them in as well
3. The following article says messages like this can be ignored (http://www-947.ibm.com/support/entry/po ... gr-5091090)
4. The hardware components are all in the ESXi HCL, they have the correct driver versions and firmware
5. It's happened across three servers and two versions of ESXi 5

In which case I am back to square one, with no clues...
Any ideas anyone?

I suppose the temptation is to say it must be an esxi issue and not a Veeam issue, but things are usually not that clear cut or simple..... can we 100% rule out Veeam?
gregwatson
Influencer
Posts: 20
Liked: never
Joined: Aug 21, 2014 9:42 pm
Full Name: Greg Watson
Contact:

Re: 006428 - Problem restoring Exchg server (corrupted backu

Post by gregwatson »

plus, as mentioned here:

http://kb.vmware.com/selfservice/micros ... Id=1036874

those SCSI "errors" happen every 5 minutes (sometimes 10) so they would indeed appear to what is described in that article.
gregwatson
Influencer
Posts: 20
Liked: never
Joined: Aug 21, 2014 9:42 pm
Full Name: Greg Watson
Contact:

Re: 006428 - Problem restoring Exchg server (corrupted backu

Post by gregwatson »

I've also noticed that when the corruption starts, it seems to happen in the exact same page within the information store every time. I need to repeat the test a few times, I've only done it twice since I started looking at the detail of the error message, but this is what I've tried so far:

1. Snapshot the exchange server just after its been restore (so we can repeat tests without having to restore the backup every time)
2. Power up the Exchange VM, run an integrity check on the database to trigger activity / corruption errors
3. Record the errors
4. Revert to snapshot
5. repeat from 2

I've only recorded the page number twice so far, I'll do it again shortly but

- If it is indeed the same page every time then I do not believe it can be a fault in the underlying hardware. If it was, it would be more random
- If it is the same page every time that suggests to me that the restored (and snapshotted) VM has the corruption already inside it, waiting to be revealed. There is some sort problem inherent in the restored VM that triggers when it's booted, and that problem surfaces at the exact same page every time. There is no other explanation, is there, if the page is the same each time we boot having gone back to the initial snapshotted restored VM? It means that the problem is already in the VM before it's powered up, no?
Gostev
Chief Product Officer
Posts: 31522
Liked: 6700 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: 006428 - Problem restoring Exchg server (corrupted backu

Post by Gostev »

gregwatson wrote:It means that the problem is already in the VM before it's powered up, no?
Yes, there is a possibility that you will encounter the same problem if you reboot your production VM. In fact, we used to see this kind of issues quite often in our support... so often this even made us to implement SureBackup functionality back in v5.
gregwatson
Influencer
Posts: 20
Liked: never
Joined: Aug 21, 2014 9:42 pm
Full Name: Greg Watson
Contact:

Re: 006428 - Problem restoring Exchg server (corrupted backu

Post by gregwatson »

Hi Gostev.

what I mean is - the production VM is fine - I can run the integrity checks on that server and they come back perfectly and do not trigger any corruption messages.
What I meant was that the restored VM has some sort of corruption already inside it.
Gostev
Chief Product Officer
Posts: 31522
Liked: 6700 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: 006428 - Problem restoring Exchg server (corrupted backu

Post by Gostev »

Did you try to reboot the production server?
Post Reply

Who is online

Users browsing this forum: No registered users and 82 guests