Host-based backup of Microsoft Hyper-V VMs.
Post Reply
RGijsen
Expert
Posts: 124
Liked: 25 times
Joined: Oct 10, 2014 2:06 pm
Contact:

Inline dedupe on active fulls

Post by RGijsen »

We are running 9.0.0.1715 for Hyper-V. I'd thought to first ask this here before logging a ticket as maybe I'm overlooking something very obvious. I have a seperate job for our Exchange 2016 environment. It's always been just one Edge-server and one mailbox server with about 950GB of disks and about 700-800GB of actual data including the OS itself. The Exchange stores are about 500-600GB. About 2 months ago I've installed another Exchange mailbox server and formed a DAG for redundancy and stuff. Added it to the job. In the job I have inline dedup enabled and optimized for WAN as I have to be restrictive on space and want as good dedup results as possible.

Now something that looks strange to me. We have 1 full per week, rest incrementals. As both mailbox hosts host about the same data, as expected the dedupe results of the job are quite good:
Image

But when the full runs, it doesn't seem to dedupe at all:
Image

This is also reflected in the .vbk file. Size of the full is over 1.1TB while there is about 600GB of 'identical' data. Now I don't see why the full job has practically no dedupe at all, as even the incrementals show that quite a bit of the changes that day can be dedupe quite well. If I have 6 days of quite good dedupe, I would expect the full backup to have at least more than 1.0.

Don't full jobs dedupe from Veeams perspective?
alanbolte
Veteran
Posts: 635
Liked: 174 times
Joined: Jun 18, 2012 8:58 pm
Full Name: Alan Bolte
Contact:

Re: Inline dedupe on active fulls

Post by alanbolte »

This view doesn't completely show the duplication results, particularly not if there is identical data on different disks. Look in Backups>Disk, find your job name, and Right-click>Properties.
RGijsen
Expert
Posts: 124
Liked: 25 times
Joined: Oct 10, 2014 2:06 pm
Contact:

Re: Inline dedupe on active fulls

Post by RGijsen »

Hmm ok. That gives me the following:
Image

I've just checked, the total datasize is 1301GB. As said the mailstore is about 600. So at least about 1200GB should dedupe to about 600, as those stores are rather identical. Windows' dedupe dedupes these vbk files down tremendously (but that uses dynamic blocksize, in this case 64K I guess as that's my clustersize on the mailstore disks - so no smaller 'blocks' there anyway). This makes me believe that even on WAN setting, with 256KB blocks, Veeams dedupe is rather useless. But still I don't really get it. I know an Exchange DAG initially does an actual filecopy of the store. After that it seeds the logs, and I can imagine the files don't keep identical that way. But by far the biggest part of these 600GB are not changed, and therefore those mailstore files on disk should be pretty much the same. I wonder why Veeams dedupe works so bad on them?

The underlying reason is that with regards to Veeam 9.5 we are evaulating whether or not it's feasible to move to REFS. However as (Windows) dedupe is not working on that, and Veeam's dedupe is not really functional, than in terms of diskspace it's a no-brainer and we'll have to stay at NTFS.
alanbolte
Veteran
Posts: 635
Liked: 174 times
Joined: Jun 18, 2012 8:58 pm
Full Name: Alan Bolte
Contact:

Re: Inline dedupe on active fulls

Post by alanbolte »

While it is expected that larger blocks will result in less deduplication, to properly compare space savings between Veeam B&R and Windows deduplication (or deduplication appliances), you should also enable compression in Veeam (there is both a job setting and a repository setting), because both Windows deduplication and every appliance I've worked with compresses the data blocks after deduplicating between them.
RGijsen
Expert
Posts: 124
Liked: 25 times
Joined: Oct 10, 2014 2:06 pm
Contact:

Re: Inline dedupe on active fulls

Post by RGijsen »

Hmm ok valid point. I never enabled compression as that kills dedupe. However when using REFS's block-copy the 'dedupe' is on another level where compression doesn't matter I guess. At least when compression is done per block. Is that the case?
Also, would you have any insights on encrypted backups vs REFS block-copy? Encryption more or less implies every single block is unique, so it kills dedupe, but does it kill REFS copy as well?
Gostev
Chief Product Officer
Posts: 31460
Liked: 6648 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Inline dedupe on active fulls

Post by Gostev »

You can have both compression and encryption enabled with no impact on our ReFS integration. It's important to understand that ReFS integration does not do deduplication, instead it prevents duplication from happening in the first place.
RGijsen
Expert
Posts: 124
Liked: 25 times
Joined: Oct 10, 2014 2:06 pm
Contact:

Re: Inline dedupe on active fulls

Post by RGijsen »

Hi,
yes I understand the difference between REFS block-copy and dedupe perfectly well. Still one could see the resemblance of both in the result; both dedupe and REFS prevent identical blocks multiple times. With forward incrementals REFS will shine. I just don't understand yet how it functions with other things like compression and encryption. How does Veeam compress and encrypt? Is that also on a per-block level? How else could the REFS 'dedupe' work as the blocks would never be identical?
If it would work per block-level, that would (or could) also mean encryption does not necessarily kill regular dedupe. My assumption (and several forum posts) that encryption DOES kill dedupe made me choose not to encrypt my backups now, which is something I am still not too happy with. Actually even the Veeam GUI states compression kills regular dedupe, so I don't understand how it wouldn't kill REFS block-copy?

Please shine a light on this :)
Gostev
Chief Product Officer
Posts: 31460
Liked: 6648 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Inline dedupe on active fulls

Post by Gostev »

Veeam tracks identical blocks based on raw content, so we know whether the two blocks are identical regardless of their encryption or compression status. The main difference here is that we don't need to analyze the actual content of the block to figure out whether it is identical to some other block we have observed before, which is something deduplicating storage must do in order to function.
RGijsen
Expert
Posts: 124
Liked: 25 times
Joined: Oct 10, 2014 2:06 pm
Contact:

Re: Inline dedupe on active fulls

Post by RGijsen »

I know it's an older thread, but I still have some questions. I wonder when or if Veeam will ever use something like fastclone / identical block tracking / ReFS 'dedupe' on active fulls. On our offsite backup we use ReFS as that does incremental-forever, and ReFS just works great there for us. However on our main site, I do weekly active fulls. While I have no facts or anything, it's just my guts that tell me I don't want to do incremental forever on my production site. That makes ReFS a very raw-storage intensive file system. Just like using NTFS without dedupe. As ReFS can use pointers with merged backup files, the file system should be able to do that with any files - it's a feature of ReFS now.

So, will there ever be some implementation for full backups as well? That would be a real killer feature for us.
Gostev
Chief Product Officer
Posts: 31460
Liked: 6648 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Inline dedupe on active fulls

Post by Gostev »

Can you share the exact reasons why active fulls satisfy your gut feeling? In other words, what specific properties of an active full you require for your periodic fulls in production site? Depending on what are they, I potentially have an idea on how to merge the best of both worlds.
RGijsen
Expert
Posts: 124
Liked: 25 times
Joined: Oct 10, 2014 2:06 pm
Contact:

Re: Inline dedupe on active fulls

Post by RGijsen »

Well, it's more or less 'just to be safe'. We used reversed incrementals before in 7.x or 8.x, but at time the merging simply took too long. ReFS would fix that. But we don't have THAT much data (few TB) so doing active fulls fits our backup window perfectly fine. Although Veeam has all kinds of verifications and health-checks, one reason I don't necessarily rely on it is that since Server / Hyper-V 2016 we have to rely on MS' Change tracking now, rather than on Veeams one. With Server 2016, apart from 2 or 3 months, every cumulative update has killed either applications or OS related things for us, sometimes even both (like this months KB4013429). There been issues with Resilient Changed Tracking before, which aught to be fixed now, but still I'm not fearless. For example, I still can't backup our SQL Guest Cluster with Veeam because of bugs in the VHDS-snapshot-system. So yes, the main reason is that I don't trust MS in this.

As I have the time in my backup-windows, I just want to have that extra safety of just reading everything again. While dedup in 2016 work reasonably (should be fixed with KB4013429 - ah, except that we can't backup anything at all anymore with that installed on our hosts, issues again) but of course it would be more optimal if the duplication did not occur in the first place.

By the way I am fully aware of the tradeoff in terms of restore times, which would be noticeably longer (as of more random IO) but honestly we don't have to do that many restores anyway. We are only a small company and the cost-savings on storage won that 'battle'. Our backup SAN only accepts SAS drives and hence it's relatively expensive to have huge arrays for us.
Gostev
Chief Product Officer
Posts: 31460
Liked: 6648 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Inline dedupe on active fulls

Post by Gostev »

OK. In that case, would it work for you if instead of doing an active full, we would do only a half of it so to speak: specifically, only reading the entire latest VM state from the production storage as it normally does - but not transferring and storing it in its entirety to the target storage? The latter is not really needed, as we can say right at the backup proxy if the block we read has the same content as the block we would put into synthetic full if we were to do one from data already in the backup repository. This approach guarantees catching any VMware CBT or Hyper-V RCT bugs, as effectively we're still doing an active full (getting the entire VM image), just not transferring it over the network and not storing its blocks again to the backup repository.
RGijsen
Expert
Posts: 124
Liked: 25 times
Joined: Oct 10, 2014 2:06 pm
Contact:

Re: Inline dedupe on active fulls

Post by RGijsen »

Gostev,
yes I think that comes down to the same thing. I suppose having a synthetic full, together with a in our case weekly 'full source scan' would end up with the same full-backup as a old-school active-full. So that option would be a good one for me. Maybe I'm paranoia, but I just don't trust MS anymore the last few years. Too many severe bugs have popped up in our environment.

Yet I wonder; you have the ReFS block cloning already in place. What's the difficulty or difference with calling the ReFS block-cloning API with 'regular' active fulls? Bottom line comes down to the same thing but you skip the synthetic full creation step, which would be offloaded to ReFS during the full-read already. Just trying to understand :)
Gostev
Chief Product Officer
Posts: 31460
Liked: 6648 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Inline dedupe on active fulls

Post by Gostev »

Yes, that's basically where I was leading to with my semi-active full backup idea. We do need to read and hash the content of each source block, then figure out what existing full or incremental backup file has the same block. Only then we can clone that block into the new semi-active full backup file in the backup repository.
RGijsen
Expert
Posts: 124
Liked: 25 times
Joined: Oct 10, 2014 2:06 pm
Contact:

Re: Inline dedupe on active fulls

Post by RGijsen »

Gostev, maybe my question wasn't clear. Still trying to understand what the difference would be between a normal Active Full on ReFS with fast-block-clone opposed to an synthetic full with a full disk-scan. Either way all data would be read, and either wvay you would need to hash and compare. But with synthetic full you would need to merge the VIBs into a VBK still. That's fast in ReFS, but you end up writing just as many pointers to existing blocks as with an active-full. But with active full you would skip the synthetic step, 'offloading ' the pointer creation to ReFS while with synthetic the whole chain since last full has to be processed again by Veeam.

Either way, I'd really appreciate this feature at a point. The MS bugs in 2016 keep piling up for us.
Post Reply

Who is online

Users browsing this forum: No registered users and 22 guests