Exchange DB recovery with VEX randomly failing

Post by **mdiver** » Aug 22, 2014 9:18 am this post

We are backing up a large Exchange server with 12 DBs (11+PF, ~4TB in total) two times daily in an incremental job (~20min runtime). An active-full is carried out on saturdays (~4h runtime).
Backup is carried out with SAN integration (FC) from a DataCore based storage array (very low latency). Exchange is 2010 on Windows 2008 R2. Veeam is 7.0.0.871 (patch 4).
VSS and log truncation are switched on. The backup job always runs flawlessly with success - especially regarding those two objectives.

Unfortunately not the recovery: Trying to recover with VEX we get statistical errors for one or two DBs: JetError -1018, JET_errReadVerifyFailure, Checksum error on a database page
Sometimes all DBs are fine as well in VEX. Strangly it is always another combination of DBs and no error can be found on the production system.

We've already reset CBT and tried active-fulls many times to get rid of the behavior.

Here an example that shows the most recent chain:

Date/Time ----- Type --------- DBs mountable in Veeam Exchange Explorer
18.08. – 18:00 - Active - Full - All
19.08. – 06:00 - Incremental - All
19.08. – 18:00 - Incremental - All except DB3
20.08. – 06:00 - Incremental - All
20.08. – 18:00 - Incremental - All except DB2 & DB4
21.08. – 06:00 - Incremental - All
21.08. – 18:00 - Incremental - All except DB3 & DB6
22.08. – 06:00 - Incremental - All except DB5 & DB-PublicFolder

We've been working for several weeks on the case together with support. No solution so far. Case #00607414

Does anyone have ever seen a similar behavior?

It scares me a bit that this means we randomly have a corrupt recovery despite having had a successful backup run without any sign of an error.

Thank you and regards,
Mike

Post by **foggy** » Aug 22, 2014 9:47 am this post

Mike, am I understanding right that the failed DBs are always the same for each given restore point but different DBs fail for different restore points?

Have you tried to run eseutil against those databases after the restore point is mounted?

Post by **mdiver** » Aug 22, 2014 1:46 pm this post

Correct. Same point shows same DBs in every VEX session.
Different point shows other(s) or even none as corrupt.
Currently all based on the same active-full that shows not errors in VEX. But we also had active-fulls that showed errors while increments based on them were fine.

ESEUTIL /k shows checksum errors in many pages on the DB in question.

Post by **foggy** » Aug 22, 2014 2:44 pm this post

mdiver wrote:ESEUTIL /k shows checksum errors in many pages on the DB in question.

This means that the issue is at the backup stage, not during restore. I've asked your engineer to escalate the ticket so R&D could take a closer look at it.

Post by **mdiver** » Aug 22, 2014 3:15 pm this post

I agree as it's not VEX's fault. But it might still be the FLR process by itself that shows up corrupt data in the mount process.

As I pointed out: In the same chain the next point has the same DB in good shape again.

If the backup process introduces the error with a single incremental shot, the next shot seems to clean up things again.
Anyway - I look forward to be contacted to sort this strange behavior out.

Feb 28, 2015 10:24 am

The issue could not be solved under V7. VEX was not the reason but it was really the backup itself that carried the issues.

Support finally suggested to go to V8 once GA, because of significant changes in the underlying backup engine.
That solved this issue. We checked around 60 restore points, each with 14 Exchange DBs. No further checksum errors were observed. Before we had 1-2 DBs out of the 14 with a checksum error every second run.

The rest of the system still is the same - so our Veeam as well as the customers Exchange implementation was generally fine.
Though now we are going to brake this huge and monolithic Exchange down to 4 servers as well as the Veeam installation down to a tiered 3 server cluster for the customer.

We are working with Veeam B&R with many customers for almost seven years now. Usually "it just works!".
With this experience I can state that this really was a tricky issue - especially in isolating if it is VSS (->MS) or Veeam itself causing the problems.

The case was worked on for more than half a year with constant efforts from Veeam support. Thank you for your very professional and patient work.

R&D Forums

Exchange DB recovery with VEX randomly failing

Re: Exchange DB recovery with VEX randomly failing

Re: Exchange DB recovery with VEX randomly failing

Re: Exchange DB recovery with VEX randomly failing

Re: Exchange DB recovery with VEX randomly failing

Re: Exchange DB recovery with VEX randomly failing

Who is online