Zlib decompression error [-3] - Unable to Restore

tbsam · Post by **tbsam** » Oct 23, 2012 2:01 pm this post

Case # 00096180 / ID#5187896
Veeam 6.0 & 6.1

We backup to a remote DR site over a WAN.

When we tried to restore one of our servers it failed with a Zlib decompression error [-3] when trying to restore a backup.
This gave us some concern as the backup was still running everyday reporting success.
We asked for a tool so that we could check our backups, some months later Veeam created the veeam.backup.validator tool.

We've run this now on several of our backups and it has reported that some others are also corrupt. Even though the backup is still running daily and reporting success, the "Enable automatic backup integrity checks" option in the Advanced Settings is ticked.

What is also unique about this is that you can run a surebackup task on the backup and it appears to work ok, the virtual server boots up and runs, you can run a chkdsk on the virtual machine and it comes back clean, however if you happen to run a program that accesses data in the corrupt region of the backup then the whole virtual machines crashes.
It is impossible to restore the whole virtual machine as the restore process crashes out because of the zlib decompression error.

Veeam have advised that there is a custom agent that can bypass the corrupted blocks but that is not acceptable.

I think its very important to point out to the community these 4 points.

1. Just because the Job reports success doesn't mean the backup was successful.
2. The "Enable automatic backup integrity checks" doesn't detect decompression errors.
3. The surebackup job will not verify that the backup image is error free or free of decompression errors / corruption.
4. The only way of validating the backup is successful is to perform a full restore or run the veeam.backup.validator tool.

I would be interested to see if anybody is checking their backups any further than with a surebackup.
If not, i would advise them do so pretty sharply.

The case is still grumbling on after about 6 months, I have found a problem with the veeam.backup.validator that is going to slow down progress as well.

Oct 23, 2012 2:34 pm

tbsam wrote:We backup to a remote DR site over a WAN.

Yes, I have highlighted this possible issue with WAN backups in the past (see the forum digest couple of weeks ago). This issue is the reason why we are adding our own network traffic verification engine in 6.5 release. Unfortunately, some network hardware vendors just do not follow TCP/IP specs... so, in future we will no longer trust any NIC's opinion on whether the traffic received was corrupted, or not

tbsam · Post by **tbsam** » Oct 23, 2012 3:37 pm this post

The Host ESX Server has a BCM5709 Gigabit Network Card.
It is LAN connected to the Veeam Server which is running a Broadcom BCM5709C NetXtreme II GigE network card.
The Veeam Server is WAN connected to the repository server which is running a Intel Pro/1000 PT network card.

With the repository server running an Intel Network card is this setup still capable of allowing packets with invalid checksums through.

Post by **Gostev** » Oct 23, 2012 3:57 pm this post

While Broadcom is a suspect here, I cannot really comment on specific NIC makes and models (and actually, this also appears to depend on NIC firmware versions as per comments in the above-linked topic). We did confirm this issue in a couple with a couple of specific combinations - both involved some Broadcom NICs - but, of course, we cannot (and are unable to) test every possible NIC/firmware combination.

Bottom line is, our engine will no longer "trust" TCP/IP checksums, so it simply would not matter what combination you are using.

tbsam · Post by **tbsam** » Oct 24, 2012 11:52 am this post

I believe this problem may not actually be related to running over a WAN link.
Some recent investigative work has highlighted that the compression corruption first appears after the Synthetic Full Backup operation.
This is an amalgum of the first full backup and subsequent incrementals all of which individually checkout ok with the veeam.backup.validator tool.
The current situtation is that there has been 1 synthetic full backup on the 22-October, and incremental backup on the 23-October and another on the 24-October.
The incremental taken on the 24-October contains compression corruption.
I am in the process of checking the backup for the 22-October and 23-October.

I have taken MD5 checksums of all the backup files prior to the 22-October. If they checkout ok then we can rule out storage corruption which has also been blaimed up to this point. I've always doubted storage corruption as it is enterprise class storage that we are using.

Post by **Gostev** » Oct 24, 2012 12:10 pm this post

Since synthetic full does not involve any data processing (such as decompression/recompression), and is just about moving the existing data into the new file, corruption appearing during synthetic full creation can only be caused by storage issues.

tbsam wrote:I've always doubted storage corruption as it is enterprise class storage that we are using.

I find it surprising you think so. I like to say that all of our best case studies come from disasters caused by production data corruption on the enterprise class storage people use to run their VMs on. Anything ranging from bit rot (ever heard of term URE?) to massive corruptions of multiple volumes with random data written by malfunctioning controller. The only difference is that when this happens on production storage, the result is usually easily noticeable (as soon at least one application is impacted).

I also noticed one other incorrect statement that I missed earlier, and that is SureBackup job not able to detect this type of corruption. It only matters of proper SureBackup job setup. Of course, SureBackup job will detect such corruption, if the test will involve reading every disk block (for example, having custom test script run chkdsk with the corresponding option). This just takes unreasonably long, and makes it a bad default considering the chance of URE, and the fact that more common types of corruption will come up even with the basic SureBackup test.

Post by **Gostev** » Oct 24, 2012 12:25 pm this post

tbsam wrote:The current situtation is that there has been 1 synthetic full backup on the 22-October, and incremental backup on the 23-October and another on the 24-October. The incremental taken on the 24-October contains compression corruption.

In this case, definitely looks like data corruption issue during data transfer over WAN. As far as I understand from your explanation, this is "freshly brought" forward incremental backup file. Its data was collected at the source site - in other words, its production process did not use any data you have in the target site.

The data was collected at source, compressed and sent to target over WAN. Most likely cause is that part of that payload corrupted during the transfer, so the data stored no longer matches CRC we include with each backup block. Less likely cause is that the data came over without corruption and was written to backup storage correctly, but when read back by restore process, corrupted data was provided by the storage.

tbsam · Post by **tbsam** » Oct 24, 2012 12:52 pm this post

The original backup was actually transferred to the repository server via an external usb drive.
Each file was MD5 checked at each stage to ensure it was copied to the repository server intact.

Post by **Gostev** » Oct 24, 2012 1:00 pm this post

As per my earlier explanation, this does not matter. Incremental run collects all data at the source site, so its production process does not use any raw block data from backup files you have in the target site.

tbsam · Post by **tbsam** » Oct 31, 2012 3:47 pm this post

I've been thinking about this further.

We have a separate system to veeam that compresses files into a RAR achive on the source server, a CURL ftp client on the target server at the DR site downloads the file from the source server (running filezilla server) across the WAN. It then unpacks the RAR achive and extracts the files into a folder on the target server.
This operation happens every day the size of the achive being transmitted is between 40-60GB.
The target server is the one which also acts as the veeam target repository, utilising the same storage.

We have never had an occurrence of a RAR decompression error on the target server. The system pre-dates the installation of veeam onto it and has been running for years.
If we had inherent problems with our storage or our WAN network would we not have seen RAR reporting corruption by now.

Therefore, instead of pointing the finger at our WAN or Storage, is it not beyond the realms of possibility that the compression/decompression algorithms used by Veeam are carrying a bug that under a certain unusual particular circumstance results in corruption of the data or the inability to read the compressed data.

Post by **Gostev** » Oct 31, 2012 3:52 pm this post

tbsam wrote:Therefore, instead of pointing the finger at our WAN or Storage, is it not beyond the realms of possibility that the compression/decompression algorithms used by Veeam are carrying a bug that under a certain unusual particular circumstance results in corruption of the data or the inability to read the compressed data.

This one is actually completely beyond the realms of possibility with the algorithm we are using, zlib (the whole IT industry would be in huge trouble if it had a bug), the length of time we alone have been using it in our product without any changes to the libraries (5 years), and the amount of data processed during this timeframe by over 50'000 customers (remember there is CRC check built-in to detect corruption, so it would never go unnoticed). Law of big numbers is hard to argue with...

tbsam wrote:If we had inherent problems with our storage or our WAN network would we not have seen RAR reporting corruption by now.

Issues like sticky bits in RAM modules appears as electonic circuits wear. The router or switch can be working fine for years, and then start corrupting traffic at some point all of a sudden.

tbsam · Post by **tbsam** » Oct 31, 2012 4:06 pm this post

That's a very quick and arrogant assessment.
The facts do speak volumes here. If we see no corruption in our other backup systems and its fair to say that the only problem we have is with Veeam then the problem is with Veeam.
The problem could equally be a bug in the communication between the veeam server and the veeam respository worker.
Veeam isn't infallible, it might be your job to defend it, but the list of errors that are fixed each quarter demonstrate that each previous release does carry bugs.

tbsam · Post by **tbsam** » Oct 31, 2012 4:08 pm this post

"remember there is CRC check built-in to detect corruption"

That doesn't work!

We have backups upon backups that have reported success to discover that the backup is corrupt and was corrupted several backups previously.

tbsam · Post by **tbsam** » Oct 31, 2012 4:12 pm this post

"Issues like sticky bits in RAM modules appears as electonic circuits wear."

Nope, the server hardware is 100% ok.

However the target server is running an AMD processor, could it be that there is an issue with AMD that your software hasn't accounted for.

Post by **Gostev** » Oct 31, 2012 4:32 pm this post

Sorry if you find my response arrogant - it was not meant to be, and I personally don't see it - but that might be due to the cultural difference (I am not a native speaker).

CRC check does work - reading through my explanation above, CRC check is performed upon restore by comparing restored data with known CRC of source data. Considering the size of our customer base and the amount of protected data, we would have thousands of reports of failed restores over the course of part 5 years - if the actual compression algorithm had an issue. And we are just not seeing this.

I never said Veeam is infallible (I am the one writing those patch notes anyway), but there are certain parts of code that are absolutely unlikely to have bugs. You happened to question such part - unchanged for years, and proven to be 100% reliable industry-standard compression algorithm. My response regarding bugs likelihood was to that specifically. Implying that every piece of code in the product may have bugs because we release plenty of fixes for the newly introduced functionality does not make any sense. There is new functionality, and there is functionality that is unchanged for years, and proven to be stable on all possible configurations (including all possible processors). Of course, the respective quality and stability of these two groups of features would be very different.

I think we've both made our points very clearly. Since no one else joined this discussion in the whole week, and the last posts show that discussion between us is going nowhere, I am locking the topic down until new FACTS appear based on our support or your own investigation.

R&D Forums

Zlib decompression error [-3] - Unable to Restore

Re: Zlib decompression error [-3] - Unable to Restore

Re: Zlib decompression error [-3] - Unable to Restore

Re: Zlib decompression error [-3] - Unable to Restore

Re: Zlib decompression error [-3] - Unable to Restore

Re: Zlib decompression error [-3] - Unable to Restore

Re: Zlib decompression error [-3] - Unable to Restore

Re: Zlib decompression error [-3] - Unable to Restore

Re: Zlib decompression error [-3] - Unable to Restore

Re: Zlib decompression error [-3] - Unable to Restore

Re: Zlib decompression error [-3] - Unable to Restore

Re: Zlib decompression error [-3] - Unable to Restore

Re: Zlib decompression error [-3] - Unable to Restore

Re: Zlib decompression error [-3] - Unable to Restore

Re: Zlib decompression error [-3] - Unable to Restore

Who is online