Inline dedupe on active fulls

RGijsen · Post by **RGijsen** » Nov 17, 2016 12:17 pm this post

We are running 9.0.0.1715 for Hyper-V. I'd thought to first ask this here before logging a ticket as maybe I'm overlooking something very obvious. I have a seperate job for our Exchange 2016 environment. It's always been just one Edge-server and one mailbox server with about 950GB of disks and about 700-800GB of actual data including the OS itself. The Exchange stores are about 500-600GB. About 2 months ago I've installed another Exchange mailbox server and formed a DAG for redundancy and stuff. Added it to the job. In the job I have inline dedup enabled and optimized for WAN as I have to be restrictive on space and want as good dedup results as possible.

Now something that looks strange to me. We have 1 full per week, rest incrementals. As both mailbox hosts host about the same data, as expected the dedupe results of the job are quite good:

But when the full runs, it doesn't seem to dedupe at all:

This is also reflected in the .vbk file. Size of the full is over 1.1TB while there is about 600GB of 'identical' data. Now I don't see why the full job has practically no dedupe at all, as even the incrementals show that quite a bit of the changes that day can be dedupe quite well. If I have 6 days of quite good dedupe, I would expect the full backup to have at least more than 1.0.

Don't full jobs dedupe from Veeams perspective?

alanbolte · Post by **alanbolte** » Nov 17, 2016 12:50 pm this post

This view doesn't completely show the duplication results, particularly not if there is identical data on different disks. Look in Backups>Disk, find your job name, and Right-click>Properties.

RGijsen · Post by **RGijsen** » Nov 17, 2016 1:05 pm this post

Hmm ok. That gives me the following:

I've just checked, the total datasize is 1301GB. As said the mailstore is about 600. So at least about 1200GB should dedupe to about 600, as those stores are rather identical. Windows' dedupe dedupes these vbk files down tremendously (but that uses dynamic blocksize, in this case 64K I guess as that's my clustersize on the mailstore disks - so no smaller 'blocks' there anyway). This makes me believe that even on WAN setting, with 256KB blocks, Veeams dedupe is rather useless. But still I don't really get it. I know an Exchange DAG initially does an actual filecopy of the store. After that it seeds the logs, and I can imagine the files don't keep identical that way. But by far the biggest part of these 600GB are not changed, and therefore those mailstore files on disk should be pretty much the same. I wonder why Veeams dedupe works so bad on them?

The underlying reason is that with regards to Veeam 9.5 we are evaulating whether or not it's feasible to move to REFS. However as (Windows) dedupe is not working on that, and Veeam's dedupe is not really functional, than in terms of diskspace it's a no-brainer and we'll have to stay at NTFS.

alanbolte · Post by **alanbolte** » Nov 17, 2016 3:46 pm this post

While it is expected that larger blocks will result in less deduplication, to properly compare space savings between Veeam B&R and Windows deduplication (or deduplication appliances), you should also enable compression in Veeam (there is both a job setting and a repository setting), because both Windows deduplication and every appliance I've worked with compresses the data blocks after deduplicating between them.

RGijsen · Post by **RGijsen** » Nov 19, 2016 11:52 pm this post

Hmm ok valid point. I never enabled compression as that kills dedupe. However when using REFS's block-copy the 'dedupe' is on another level where compression doesn't matter I guess. At least when compression is done per block. Is that the case?
Also, would you have any insights on encrypted backups vs REFS block-copy? Encryption more or less implies every single block is unique, so it kills dedupe, but does it kill REFS copy as well?

Post by **Gostev** » Nov 20, 2016 4:51 pm this post

You can have both compression and encryption enabled with no impact on our ReFS integration. It's important to understand that ReFS integration does not do deduplication, instead it prevents duplication from happening in the first place.

RGijsen · Post by **RGijsen** » Nov 20, 2016 7:57 pm this post

Hi,
yes I understand the difference between REFS block-copy and dedupe perfectly well. Still one could see the resemblance of both in the result; both dedupe and REFS prevent identical blocks multiple times. With forward incrementals REFS will shine. I just don't understand yet how it functions with other things like compression and encryption. How does Veeam compress and encrypt? Is that also on a per-block level? How else could the REFS 'dedupe' work as the blocks would never be identical?
If it would work per block-level, that would (or could) also mean encryption does not necessarily kill regular dedupe. My assumption (and several forum posts) that encryption DOES kill dedupe made me choose not to encrypt my backups now, which is something I am still not too happy with. Actually even the Veeam GUI states compression kills regular dedupe, so I don't understand how it wouldn't kill REFS block-copy?

Please shine a light on this

Post by **Gostev** » Nov 20, 2016 8:18 pm this post

Veeam tracks identical blocks based on raw content, so we know whether the two blocks are identical regardless of their encryption or compression status. The main difference here is that we don't need to analyze the actual content of the block to figure out whether it is identical to some other block we have observed before, which is something deduplicating storage must do in order to function.

RGijsen · Post by **RGijsen** » Mar 20, 2017 8:36 am this post

I know it's an older thread, but I still have some questions. I wonder when or if Veeam will ever use something like fastclone / identical block tracking / ReFS 'dedupe' on active fulls. On our offsite backup we use ReFS as that does incremental-forever, and ReFS just works great there for us. However on our main site, I do weekly active fulls. While I have no facts or anything, it's just my guts that tell me I don't want to do incremental forever on my production site. That makes ReFS a very raw-storage intensive file system. Just like using NTFS without dedupe. As ReFS can use pointers with merged backup files, the file system should be able to do that with any files - it's a feature of ReFS now.

So, will there ever be some implementation for full backups as well? That would be a real killer feature for us.

Post by **Gostev** » Mar 20, 2017 1:05 pm this post

Can you share the exact reasons why active fulls satisfy your gut feeling? In other words, what specific properties of an active full you require for your periodic fulls in production site? Depending on what are they, I potentially have an idea on how to merge the best of both worlds.

RGijsen · Post by **RGijsen** » Mar 20, 2017 1:46 pm this post

Well, it's more or less 'just to be safe'. We used reversed incrementals before in 7.x or 8.x, but at time the merging simply took too long. ReFS would fix that. But we don't have THAT much data (few TB) so doing active fulls fits our backup window perfectly fine. Although Veeam has all kinds of verifications and health-checks, one reason I don't necessarily rely on it is that since Server / Hyper-V 2016 we have to rely on MS' Change tracking now, rather than on Veeams one. With Server 2016, apart from 2 or 3 months, every cumulative update has killed either applications or OS related things for us, sometimes even both (like this months KB4013429). There been issues with Resilient Changed Tracking before, which aught to be fixed now, but still I'm not fearless. For example, I still can't backup our SQL Guest Cluster with Veeam because of bugs in the VHDS-snapshot-system. So yes, the main reason is that I don't trust MS in this.

As I have the time in my backup-windows, I just want to have that extra safety of just reading everything again. While dedup in 2016 work reasonably (should be fixed with KB4013429 - ah, except that we can't backup anything at all anymore with that installed on our hosts, issues again) but of course it would be more optimal if the duplication did not occur in the first place.

By the way I am fully aware of the tradeoff in terms of restore times, which would be noticeably longer (as of more random IO) but honestly we don't have to do that many restores anyway. We are only a small company and the cost-savings on storage won that 'battle'. Our backup SAN only accepts SAS drives and hence it's relatively expensive to have huge arrays for us.

Post by **Gostev** » Mar 22, 2017 11:45 am this post

OK. In that case, would it work for you if instead of doing an active full, we would do only a half of it so to speak: specifically, only reading the entire latest VM state from the production storage as it normally does - but not transferring and storing it in its entirety to the target storage? The latter is not really needed, as we can say right at the backup proxy if the block we read has the same content as the block we would put into synthetic full if we were to do one from data already in the backup repository. This approach guarantees catching any VMware CBT or Hyper-V RCT bugs, as effectively we're still doing an active full (getting the entire VM image), just not transferring it over the network and not storing its blocks again to the backup repository.

RGijsen · Post by **RGijsen** » Mar 22, 2017 7:02 pm this post

Gostev,
yes I think that comes down to the same thing. I suppose having a synthetic full, together with a in our case weekly 'full source scan' would end up with the same full-backup as a old-school active-full. So that option would be a good one for me. Maybe I'm paranoia, but I just don't trust MS anymore the last few years. Too many severe bugs have popped up in our environment.

Yet I wonder; you have the ReFS block cloning already in place. What's the difficulty or difference with calling the ReFS block-cloning API with 'regular' active fulls? Bottom line comes down to the same thing but you skip the synthetic full creation step, which would be offloaded to ReFS during the full-read already. Just trying to understand

Post by **Gostev** » Mar 22, 2017 9:58 pm this post

Yes, that's basically where I was leading to with my semi-active full backup idea. We do need to read and hash the content of each source block, then figure out what existing full or incremental backup file has the same block. Only then we can clone that block into the new semi-active full backup file in the backup repository.

RGijsen · Post by **RGijsen** » Apr 12, 2017 7:55 pm this post

Gostev, maybe my question wasn't clear. Still trying to understand what the difference would be between a normal Active Full on ReFS with fast-block-clone opposed to an synthetic full with a full disk-scan. Either way all data would be read, and either wvay you would need to hash and compare. But with synthetic full you would need to merge the VIBs into a VBK still. That's fast in ReFS, but you end up writing just as many pointers to existing blocks as with an active-full. But with active full you would skip the synthetic step, 'offloading ' the pointer creation to ReFS while with synthetic the whole chain since last full has to be processed again by Veeam.

Either way, I'd really appreciate this feature at a point. The MS bugs in 2016 keep piling up for us.

R&D Forums

Inline dedupe on active fulls

Re: Inline dedupe on active fulls

Re: Inline dedupe on active fulls

Re: Inline dedupe on active fulls

Re: Inline dedupe on active fulls

Re: Inline dedupe on active fulls

Re: Inline dedupe on active fulls

Re: Inline dedupe on active fulls

Re: Inline dedupe on active fulls

Re: Inline dedupe on active fulls

Re: Inline dedupe on active fulls

Re: Inline dedupe on active fulls

Re: Inline dedupe on active fulls

Re: Inline dedupe on active fulls

Re: Inline dedupe on active fulls

Who is online