Dramatic Increase in Time to Index After Turning on NTFS Deduplication

mkaec · Post by **mkaec** » May 06, 2022 6:04 pm this post

I had a disk with 8TB of data and about 6.8 million files. I turned on NTFS deduplication. Disk usage dropped 46% to 4.6 TB. The dedup store only contains 17,400 files, nothing compared to the original 6.8 million. But indexing time in the job went from 1 - 2 hours to 20 - 40 hours.

Post by **jorgedlcruz** » May 07, 2022 2:36 am this post

Hello,
What is the Veeam role, or the connection to Veeam?

Is this NTFS volume a Backup Repository? Or is this NTFS volume part of a NAS job?

We need to be mindful of what Deduplication is, which is nonetheless that pointers to similar blocks, to save storage. It is meant to be add more data, and do not play much with current data to avoid these "issues" Every time you try to query data, in this case that Indexing you are mentioning will perhaps query all the data, your system needs to rehydrate a whole bunch of data, and I am betting the RAM/CPU/DiskIO is quite high at that time.

Please clarify what is this Volume and how is linked to Veeam.

Thanks

mkaec · Post by **mkaec** » May 09, 2022 2:18 pm this post

The volume in question is neither a backup repository nor a NAS job. It's a Hyper-V VM disk being backed up by Veeam. I see how that was ambiguous in the original post.

I believe Veeam indexing is not actually looking at file contents, but rather just directories and file names. So, there shouldn't need to be much rehydration. I did try adding an exclusion of the dedup folder in System Volume Information to the indexing settings. That hasn't resulted in a noticeable improvement. CPU/Memory/Disk on the backup server, Hyper-V host, and VM aren't high during the indexing. I wouldn't even know the backup was running if I didn't check in the B&R console.

I thought I read that Veeam reads the NTFS MFT to optimize indexing performance. Tools like MasterSeeker and WizTree can read the MFT from this volume in a few minutes. I looked at VeeamGuestIndexer.exe in Process Explorer on the guest and I see it enumerating through the files. So, maybe it's not using the MFT. What's interesting is that there aren't any volume shadow copies active during the indexing process. So, the final result probably doesn't match what's actually in the backup because 10-20 hours of changes are going to occur during the indexing. It would be a nice implementation if B&R retrieved the MFT during backup and used that to do the indexing. Performance would likely be much improved and the index would exactly match the state of the file system at the time of backup.

May 09, 2022 3:46 pm

Hello,
Thank you for coming back with some answers. It is a bit strange to take this long for a normal Hyper-V VM Backup, considering you have all CBT enabled, and volume snapshots, etc.

Another option to check, but I thought this would not affect the VM, is this File Indexing on the job.

Nevertheless, if I were you, I would open a Support ticket so we can clarify if this is the expected behavior, or is there something wrong with some settings, etc.

Please let us know the case id.

Thank you

mkaec · Post by **mkaec** » May 09, 2022 4:40 pm this post

The backup itself is not taking any longer than expected. It's the file indexing piece. The backup completes, except for that and then spend several more hours to complete the indexing (and as I noted, outside of any sort of VSS snapshot).

Post by **jorgedlcruz** » May 09, 2022 4:52 pm this post

Thank you, Marc,
Please kindly open a support ticket, and let us know the case number. It might end up on an addition to the Helpcenter perhaps, etc. Thank you

mkaec · Post by **mkaec** » May 17, 2022 11:13 pm this post

Case 05429864.

mkaec · Aug 03, 2022 1:22 am

The support case did not answer the original question of why indexing time increased so much. But it did uncover something else. It seems there are two algorithms for indexing. One reads the MFT from the volume and parses it. The other algorithm recursively walks the file system. I've used those MFT tools and they can read the full listing of a volume in a fraction of the time it would take to do it via a recursive walk using the API. What was discovered is that VMs hosted on Hyper-V can never use that faster algorithm. It seems that since the Veeam agent doesn't own the VSS shadow copies in Hyper-V like it does in VMware, use of the faster algorithm is off-limits for some reason. The support tech is going to put in a feature request to fix this problem and allow indexing on Hyper-V to access the faster algorithm. Personally, I'd like something a little different. I'd like B&R to not need an agent at all to index NTFS volumes. It seems reasonable that B&R should be able to read the MFT directly out of the image and use that to created the index.

mkaec · Post by **mkaec** » Aug 16, 2022 5:56 pm this post

I used contig to defragment the MFT. I thought maybe the dedup process just caused fragmentation there. But the index process did not return to prior levels and is still taking over 12 hours.

Post by **jorgedlcruz** » Aug 17, 2022 10:48 am this post

Hello Marc,
Thanks for coming back to us. I can see you are working with Tier 2, and you are trying different approaches. I can see as well the feature request created by the Engineer working with you on this.

I think that for the moment we will need to wait for the feature request to be reviewed and picked, it might land as a hotfix, or in one of the future cumulative updates, depending on the complexity, etc. We can not say at the moment.

I would recommend keeping working with your Engineer, as he will be notified as soon as this is picked up.

I understand now by reading on the ticket, that you are relying heavily on Veeam indexing as you are using Enterprise Manager file-search daily.

-- Possible idea --
I am starting to think that perhaps NAS Backup Job for that specific volume or windows server might help you, as the indexing works very nicely on Enterprise Manager. This means:

You keep doing Backups of this server daily to your normal repo, without File Indexing as you said that works fast and sweet.
You create a NAS Backup Job for the volumes, or folders you require, with even more granularity in terms of time, you can do hourly backups with NAS Backup and keep as many days as you usually use to retrieve files, perhaps 7 days. And see how it behaves in EM

Before you turn off the indexing, give NAS Backup a try of course. NAS Backup consumes VUL, in case you do not have enough VUL, or want just to give it a try, please engage your Systems Engineer to obtain trial licenses, and to help you with all the processes, etc.

Please let us know.

mkaec · Aug 18, 2022 5:17 am

Thanks for the idea. I appreciate it. But I don't think NAS backup is an option. The current licensing is socket, not VUL, and I'm not a fan of per GB licensing. The fewer purchase requests I have to submit, the better.

It's not correct that I'm using Enterprise Manager file search daily. But that is what is preferred when a restore is needed from Veeam. It tends to take 60 - 90 seconds to mount a restore point. If a user says "I lost file X, please restore the latest version", I might have to look through 10 - 15 different mount points to find when the file was deleted. If I do it through the main console, that's 10 - 20 minutes of sitting around staring at the Veam wait cursor, in 10 - 15 painful increments. The conventional backup application will allow me to browse a catalog quickly and find the file before mounting anything. That process is a lot faster and why indexing was turned on to be able to use Enterprise Manager.

I do hope the feature request gets picked up, for the benefit of all Veeam customers that use Hyper-V.

Mgamerz · Post by **Mgamerz** » Aug 22, 2022 10:06 pm this post

I have this same issue (indexing takes about 5 hours), but my Hyper-V server has tens of millions of files, about 30TB of mostly deduplicated (in-VM) data. I think that's pretty normal for me since doing a right click -> properties on the drive with the most files takes almost 4 hours to calculate the size and count of all files. Just pray I never have to update permissions again on the folder as it takes like 5 hours to make 1 change!

Also I use indexing because mounting a restore point to see if a file exists on it takes way too long (like 8+ mins for me). It'd take me all day to find a file if I had to do it without indexing. As for NAS option, I can't even imagine how expensive it would be to purchase enough storage for our servers, it'd be astronomical compared to our current cost.

Due to how long indexing takes I have to make my backup interval every 8 hours. Otherwise I'd need to split the VM or turn it off since it's like 5-6 hours per indexing. This is on mix of SSD and HDD in multiple different RAID.

mkaec · Post by **mkaec** » Aug 23, 2022 2:44 am this post

Mgamerz,

As a fun experiment, try WizTree or MasterSeeker on your system. You'll be in disbelief after it indexes the volume (WizTree) or entire server (MasterSeeker) in < 15 minutes. (...unless the disks are ReFS. Then the tools won't work.)

mkaec · Oct 06, 2022 5:07 am

I think the support case has come to a close. The technician the case ended up with was pretty cool. He was genuinely interested in trying to solve the issue, had a curiosity about the specifics of what was going on, and put in a great deal of thought to devise potential work-arounds. We learned a few things.

1. Veeam has two algorithms for indexing. One algorithm slowly traverses the file system and is like an algorithm a CS student might write for a class project. The second is a more optimized algorithm that pulls the file information from the MFT. The second algorithm is probably magnitudes faster, but I'm not able to measure it as this time.
2. Veeam used to be able to use both algorithms to index VMs hosted by Hyper-V. However, when Microsoft redesigned things for Hyper-V 2016, this broke the Veeam optimized implementation and it started falling back to the slower algorithm. This means anyone running Hyper-V on an OS still in mainstream support is getting very bad Veeam index performance. VMware users continue to get the best indexing performance.
3. Part of the problem is that indexing occurs inside the VM. If Veeam wanted to be really slick, it could pull the MFT out of the backup image and do the indexing on the backup server without worry about what is going on with VSS inside the VM.
4. The technician provided some PowerShell code to test with (Get-ChildItem "C:\" -Recurse). This completes in about 90 minutes on the volume in question while Veeam indexing has been taking 12+ hours. It seems something in the Veeam indexing algorithm is calling FSCTL_GET_REPARSE_POINT, which may be unnecessarily rehydrating the data. This appears to be an unfortunate side effect of the way Microsoft chose to implement deduplication, using a file system feature designed years earlier for something very different. The Veeam index algorithm is checking the reparse point to avoid some problem that can occur with traditional reparse points. The algorithm could know to skip this check if it sees the IO_REPARSE_TAG_DEDUP flag set, but it is not doing that.

The conclusion the technician got back from the higher tier is that everything is operating as designed and thus fixing either item would be considered an enhancement request (which has been logged). I, of course, have a differing opinion. To me, the enhancement request would be doing something new like having the backup server do the indexing by getting the MFT out of the backup image; not fixing things that are broken.

I suppose one could argue that Microsoft breaking the optimized indexing algorithm is not a Veeam bug. But one could argue that the enhancement to make Veeam compatible with Windows Server 2016 was not fully completed. The same can be said for the task to make Veeam compatible with NTFS deduplication. The software engineer(s) that handled that task left in performance crippling code, with a 6x likely performance improvement from the addition of a single "if" statement.

Such seems to be the norm in the software industry these days. The higher tiers like to classify things as enhancement requests so there is maximum flexibility in dealing with them (or ignoring them). But I have some optimism that the Veeam product team might actually show an interest in getting indexing on Hyper-V fixed. Veeam seems to be a different kind of softawre company. (The existence of these forums is proof of that.) And there's nothing special about my environment. A year from now, the only way to get speedy Veeam index performance on Hyper-V will be to host the VMS on an EOL OS.

mkaec · Post by **mkaec** » Apr 28, 2023 3:23 pm this post

FYI - There was no improvement in regards to this issue in V12. Not that it was expected. The issue went into the general feature request log.

While turning on NTFS deduplication showcased the problem by amplifying it, it still is an issue with regular volumes. I had a job run last night, that doesn't involve deduplication, that read the changed blocks in 5 minutes 21 seconds, but then spent 33 minutes doing indexing.

natehun · Post by **natehun** » Jun 06, 2023 2:02 am this post

I experienced this as well wherein sometimes the indexing would take 10-15 mins only and sometimes it goes on for 3 hours or more. I'm on vmware and guest OS is Windows Server 2012 R2 which is a file server with data deduplication enabled. I managed to workaround this by changing the I/O priority (as seen on some post here in the forums) for VeeamGuestIndexer.exe using process hacker 2 application. So far so good since yesterday. Indexing time average is about 10 mins.

R&D Forums

Dramatic Increase in Time to Index After Turning on NTFS Deduplication

Re: Dramatic Increase in Time to Index After Turning on NTFS Deduplication

Re: Dramatic Increase in Time to Index After Turning on NTFS Deduplication

Re: Dramatic Increase in Time to Index After Turning on NTFS Deduplication

Re: Dramatic Increase in Time to Index After Turning on NTFS Deduplication

Re: Dramatic Increase in Time to Index After Turning on NTFS Deduplication

Re: Dramatic Increase in Time to Index After Turning on NTFS Deduplication

Re: Dramatic Increase in Time to Index After Turning on NTFS Deduplication

Re: Dramatic Increase in Time to Index After Turning on NTFS Deduplication

Re: Dramatic Increase in Time to Index After Turning on NTFS Deduplication

Re: Dramatic Increase in Time to Index After Turning on NTFS Deduplication

Re: Dramatic Increase in Time to Index After Turning on NTFS Deduplication

Re: Dramatic Increase in Time to Index After Turning on NTFS Deduplication

Re: Dramatic Increase in Time to Index After Turning on NTFS Deduplication

Re: Dramatic Increase in Time to Index After Turning on NTFS Deduplication

Re: Dramatic Increase in Time to Index After Turning on NTFS Deduplication

Who is online