Host-based backup of VMware vSphere VMs.
Post Reply
adb98
Enthusiast
Posts: 63
Liked: 13 times
Joined: Jul 21, 2016 5:03 pm
Full Name: Aaron B
Contact:

Data Domain - Restore Failures - Hot Fix

Post by adb98 » 1 person likes this post

Case # 04122956
Dev # 227127

I am posting this to hopefully save someone else the days and weeks of troubleshooting I have done to figure out this issue. It took us a month with support to finally find the fix.

Starting with Veeam 10 we started having random restore failures when restoring databases. This was when we were doing any large restores (75gb or more). There seemed to be no rime or reason for the failures. It would start restoring and at random fail with an IO error and crash. We were also seeing SCSI disk IO errors on the server that was the mount server at the time. It was like the disk that was being presented from Veeam just disappeared in the middle of the restore and we would see VeeamAgent crash events in the event logs. We could go several times in a row with only 1 failure or several times with only 1 success. It was completely random.

Long story short after getting all the way to Lv2 who then got with development, it was found that there is a bug in VeeamAgent.exe that causes it to crash at random with deduplication devices. This is all I know so not sure if it is other devices besides Data Domain. I guess the dev change number is 227127 as that is what they asked me to put on the old Veeam agents when I replaced them with a private fix that I was given. Below were the instructions for this.

1. Make sure no jobs are running;
2. On Veeam server, backup repositories/gateway servers and mount servers go to C:\Program Files (x86)\Veeam\Backup Transport;
3. Rename VeeamAgent.exe to VeeamAgent.exe_227124 in both x64 and x86 subfolders;
4. Place the new VeeamAgent.exe files from the fix zip to the corresponding folders

If you are experiencing this issue call support and reference my ticket or the dev ticket (I am assuming that is what the number on the fix was). I would also recommend that anyone with a deduplication device, try a heavy file level or application level restore several times to ensure all is ok. It would suck to find this in an emergency situation and have to deal with it.
foggy
Veeam Software
Posts: 21070
Liked: 2115 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: Data Domain - Restore Failures - Hot Fix

Post by foggy » 1 person likes this post

Hi Aaron, thanks for sharing, will definitely help future readers experiencing the same issue. To add more context here, the actual problem shows up on repositories with decompression enabled only and is caused by a race condition occurred while reading data during FLR. Neither does it depend on the size of the restored workload nor is specific to Data Domain. As a workaround, you can disable asynchronous mount by setting the EnableAsyncMount registry value to 0 to avoid the probability of a race condition, but in cases where this is not applicable (like in your environment), the mentioned hotfix is required. The fix will be included in one of the next updates. Thanks!
Post Reply

Who is online

Users browsing this forum: No registered users and 57 guests