Issues with large File Server backups

jhladish · Post by **jhladish** » Oct 18, 2013 12:53 pm this post

Hey all,

I have a client that has a job that processes roughly 7TB worth of data in 7 different file server VMs. We currently are running a reverse daily incremental on the VMs. We have had Veeam setup for around a month or so now, and have had nothing but issues with this job. It began with running out of space on the repository due to the large size, which we resolved by moving to a new repository dedicated to this one job. After that, we had a Zlib decompression error which we resolved doing a new active full backup. This wasn't the most undesirable solution as we lost our entire chain of backups before this.

We now are having a new error processing one VM - "Client error: end of file Failed to process [srcCopyLocal] command. Exception from server: RLE decompression error: [526296] bytes decoded to [524288]. Failed to process [srcCopyLocal] command.

It seems as though the only explanation for these repeated failures are network packet loss where the backup ends up corrupt. Anyone correct me if I'm wrong assuming that. Is there any way we can set up a job so that in case there's a failure during on the backups, we don't have to keep running new active fulls and losing our backup chain? Or is there an option to do a better backup verification before confirming success so the job doesn't move forward with a corrupt point that it is working off of? Would changing the storage optimization to a WAN target help?

FYI, currently running inline deduplication, dedupe-friendly compression, LAN target storage optimization, and am running deduplication on the target Win Server 2012 repository box.

Thanks in advance,
Jordan

case# 00462685

Post by **veremin** » Oct 18, 2013 1:55 pm this post

I’m wondering what type of repository it is. Common Windows repository or CIFS share?

I’m asking, because starting from version 6.5, we have had, indeed, inline network traffic validation. In general, with this functionality blocks that become corrupted during network transfer are resent automatically.

Traffic is validated between two Veeam agents. Though, it’s not the case with CIFS share that can’t run Veeam transport service.

Thanks.

jhladish · Post by **jhladish** » Oct 18, 2013 5:49 pm this post

The repository is a common windows repository.

jhladish · Post by **jhladish** » Nov 18, 2013 8:38 pm this post

I have now experienced this issue with a different Veeam job. I have a current case open now for this one, #00478940. I've searched the forums for an answer but haven't come across much besides my own post here.

Can anyone provide assistance? At first it seems like a network data transfer corruption but with inline data verification I would hope that wouldn't be the issue.

kte · Post by **kte** » Nov 18, 2013 9:52 pm this post

did you use a vmxnet3 adaptor?

jhladish · Post by **jhladish** » Nov 19, 2013 2:32 am this post

Yes, well, we weren't sure if it was only the proxy that was necessary to change the network adapter or not. So all of the 60+ VMs have the default e1000e adapter while the proxies have the vmxnet3 adapter.

I was told that with the inline data verification that this shouldn't be an issue though.

Post by **tsightler** » Nov 19, 2013 6:26 am this post

I'd probably be more suspicious of the Windows 2012 dedupe as I've seen several users report strange "corruption" style issues when using it. I'm not sure that's the problem, but there have been several threads regarding strange corruption issues and in almost all cases they were using Windows 2012 dedupe with the repository:

http://forums.veeam.com/viewtopic.php?f ... 94&start=0

If you're using reverse incremental there's probably not a big advantage to using Windows 2012 dedupe anyway, just use Veeam compression and you'll probably get very similar savings.

jhladish · Post by **jhladish** » Nov 19, 2013 1:30 pm this post

tsightler wrote:If you're using reverse incremental there's probably not a big advantage to using Windows 2012 dedupe anyway, just use Veeam compression and you'll probably get very similar savings.

I am using reverse incremental for the job, and am currently deduping almost 2tb of data using Windows 2012 dedup capabilities

I originally implemented this after reading the following article
http://www.veeam.com/blog/how-to-get-un ... ation.html

I'll attempt to turn off deduplication and see if there's any resolution. It's just unfortunate this occurred because the article, written/endorse by Veeam, strongly pushes and suggests using this option for the highest performance of your backup repository.

jhladish · Post by **jhladish** » Dec 06, 2013 1:51 pm this post

So all, I hope that I can get some more thoughts on this issue.

To recap:
- I am backing up to a Windows Server 2012 repository with volume level deduplication enabled.
- I received the RLE decompression error with per job deduplication enabled AND disabled.
- Reducing compression from a higher setting to the dedup-friendly level didn't help the situation.

I have opened a ticket regarding this issue a few times now with the ending suggestion each time being one of the above options being the resolution. I end up restarting a backup chain, and of course this resolves the issue right away, but it has happened again after an amount of time. This will be the 2nd or 3rd time I've experienced this with one of our jobs.

With the article that I linked in the post above, I was confident in enabling a volume AND job based deduplication. The results I'm seeing are phenomenal without much space is being saved. Is disabling the volume level deduplication the only option for moving forward troubleshooting this issue or is there any other options I can explore tweaking? I would love to not have to remove the volume deduplication due to the increased retention length this allows us to have.

Thanks in advanced for any input.

Also - I have opened a new case regarding this #00487453

zoltank · Dec 06, 2013 4:01 pm

7TB across 7 files servers? My first reaction would be to split it into multiple jobs, even one job per server, to reduce the size of your backup. Aside from speeding up the backup process, it would help you troubleshoot this issue. It would also keep you from having to do an active full on all 7TB and 7 servers if a job bombs.

jhladish · Post by **jhladish** » Dec 06, 2013 4:10 pm this post

Thank you for the response.

The issue that I am most recently posting about is regarding a job to backup a set of lync servers totaling around 500gb. For the file servers, there is a little under 7 tb of data spread out over 7 different file server VMs. The job that backs up these file servers are split into two parts, which point to the same repository.

I appreciate your input regarding splitting the file servers into multiple jobs though and I think I will do this. You make a good point that this prevents me from happening to need to rerun an active full on multiple file servers. The only down side is that if I do this, I will have to end up rerunning each job afterwards thus creating a new active full on the repository which we currently do not have space for.

Post by **yizhar** » Dec 08, 2013 4:34 pm this post

Hi.

I also endorse splitting such large jobs (the 7tb job) into multiple smaller jobs, in your case 1 file server per job.
Having several smaller VBK files will make your experience much better and safer - as the scope for corruption (and also troubleshooting scope) is more focused.
Also it will help win2012 dedup to cope with smaller files.

I also suspect that win2012 dedup might be related to the problems so suggest trying first on a plain non-deduped target volume, then only after you have good results, add dedup to the picture on the same or different volume.
I suggest starting with a fresh NTFS volume if possible, instead of trying to rehydrate the exsiting data. If you don't want to wipe your existing backups you can take a different windows PC with enough free space (even a pc with single sata drive for testing), and configure it as another repository. Just make sure that it is configured with 1 concurrent job max.
Then target a single job to that repository, and check backup results for several days.

Yizhar

R&D Forums

Issues with large File Server backups

Re: Issues with large File Server backups

Re: Issues with large File Server backups

Re: Issues with large File Server backups

Re: Issues with large File Server backups

Re: Issues with large File Server backups

Re: Issues with large File Server backups

Re: Issues with large File Server backups

Re: Issues with large File Server backups

Re: Issues with large File Server backups

Re: Issues with large File Server backups

Re: Issues with large File Server backups

Who is online