Any Vmware CBT experts out there?

cfizz34 · Post by **cfizz34** » Jul 15, 2020 9:47 pm this post

So I am trying to find out why Veeam is being told from VMWare there are changed blocks when the environment is pretty small and not much change occurs but yet I have 100's of GB's that must be read by Veeam and then transferred to my repositories each day and even on the weekends when we are closed. According to this article, after a backup is performed then there should not be any .ctk files left yet after the full just completed I am still seeing many .ctk files listed in the VMFS datastore. I'm just trying to get a grasp on this if its normal or if something is not configured right. Thanks for any help?

https://kb.vmware.com/s/article/1020128

Post by **veremin** » Jul 15, 2020 9:59 pm this post

Just out of curiosity - is there any activity scheduled inside the VMs in question? Deduplication garbage collection, antivirus scan updating last access time on each file, etc.? Or are these VMs running highly-transnational applications like databases? Thanks!

cfizz34 · Post by **cfizz34** » Jul 15, 2020 11:41 pm this post

The ones with the biggest change reported are SQL servers so I'm curious if also any SQL guru's out there that might shed light on something we are doing that might be causing this self inflicted pain?

cfizz34 · Post by **cfizz34** » Jul 15, 2020 11:47 pm this post

If anyone wants to throw out their thoughts on items to be on the look out for, shoot them out my way as I'd appreciate it.

Disk Defrag?
Database Index rebuild?
SQL backups?
Virus Scanning?

What tools can I use that show exactly what changed or is that deep inside the vmware API that regular customers don't get to see?

But what makes me go hmmmm is what there are always .ctk files even after a backup job?

BTW, I am NOT quiescencing the vm when taking a backup if that matters? At what point does the vmware know the vm has been backed up and consolidates the files and removes the .ctk files?

cfizz34 · Post by **cfizz34** » Jul 15, 2020 11:55 pm this post

Per Vmware:
Note: After a successful backup and full snapshot consolidation, there should be no snapshot related .ctk files remaining in the datastore. For example: vmname-000001-ctk.vmdk.

Post by **Egor Yakovlev** » Jul 16, 2020 7:01 am this post

I would start by capturing daily Disk Monitor data(MS or any 3rd party) and analyzing results - countless of times we have seen applications going nuts in guest OS - from obvious ones(AV,SQL so on) to rarest business-specific apps that decide to rebuild the cache every night, copy its database once an hour, sort the indexes every so often so on.
/Cheers!

Jul 16, 2020 7:54 am

All operating systems will do a lot of changes on block level because of background optimizations and usual services running and changing things.
Think about as well the swap area and as well the windows pagefile that change things on disk a lot.
So change rate of ~1% for systems that do "nothing" is expected. Btw. if you would backup the same on file level you would end up in backing up much more data because on image level you backup only the changed blocks.

Antivirus scans, Disk Defrag, Index services (windows and others) can as well create huge change rates on disk.
SQL Server will store all changes in the Transaction Log and as well consistently reading and writing things from memory to disk. Without any real load on a SQL Server this processing and Index work can cause arround 10% change rate depending on configuration. A database under load can have easily a change rate of 20%. DataWarehouse Application can have even a way bigger change rate.

You can use the Veeam ONE 30day trial to let Veeam ONE identify the systems with the most IO changes. Then you can have a look yourself what is running there and if everything is OK.

An important point is as well to look into the Veeam Job statistic and see if Veeam was able to use change block tracking or not. If it was not usable we read always 100% of the data, identify changes and then write only thos changes into the backup file.

cfizz34 · Post by **cfizz34** » Jul 16, 2020 1:22 pm this post

Thanks for the input and Im going to try to see what Veeam ONE has to say.

So tell me if what I am doing is a valid test or insanity.

I have a full taken of the my fileserver. Then I ran a incremental. Next day, I run this command to find the change block files.

[root@esxr3sl08:/vmfs/volumes/5d03bbcc-95f6fa46-43d8-a4badb1e0aaa/FILESHARE] ls -lh *ctk*
-rw------- 1 root root 6.3M Jul 16 01:04 FILESHARE_1-ctk.vmdk
-rw------- 1 root root 4.2M Jul 16 01:04 FILESHARE_2-ctk.vmdk
-rw------- 1 root root 3.8M Jul 16 01:04 FILESHARE_4-ctk.vmdk
-rw------- 1 root root 5.4M Jul 16 01:04 FILESHARE_6-ctk.vmdk
-rw------- 1 root root 6.5M Jul 16 01:04 FILESHARE_7-ctk.vmdk

Then I run a incremental again. While its runnings, I run this again.
[root@esxr3sl08:/vmfs/volumes/5d03bbcc-95f6fa46-43d7-a4badb1e0aaa/FILESHARE] ls -lh *ctk*
-rw------- 1 root root 6.3M Jul 16 13:01 FILESHARE_1-000001-ctk.vmdk
-rw------- 1 root root 6.3M Jul 16 13:00 FILESHARE_1-ctk.vmdk
-rw------- 1 root root 4.2M Jul 16 13:01 FILESHARE_2-000001-ctk.vmdk
-rw------- 1 root root 4.2M Jul 16 13:00 FILESHARE_2-ctk.vmdk
-rw------- 1 root root 3.8M Jul 16 13:01 FILESHARE_4-000001-ctk.vmdk
-rw------- 1 root root 3.8M Jul 16 13:00 FILESHARE_4-ctk.vmdk
-rw------- 1 root root 5.4M Jul 16 13:01 FILESHARE_6-000001-ctk.vmdk
-rw------- 1 root root 5.4M Jul 16 13:00 FILESHARE_6-ctk.vmdk
-rw------- 1 root root 6.5M Jul 16 13:01 FILESHARE_7-000001-ctk.vmdk
-rw------- 1 root root 6.5M Jul 16 13:00 FILESHARE_7-ctk.vmdk

Veeam reports that it processed 4.9TB, Read 132.4GB and transferred 63.5GB <so does that mean it thinks there was 132GB worth of changed blocks?

Jul 16, 2020 1:31 pm

Actually much more.
CTK files have references to much bigger block matrix. So if you for example have a 32MB block matrix and one bit changes. Then VMware mark the full 32MB as changed.

Overall we have mechanisms to reduce source side data. We exclude the pagefiles, exclude blocks from deleted data. Do Source Side deduplication as well that filters out a lot of the duplicates.

The ctk file size has nothing to do with the actual backup size.

And to have a 132 GB change rate on a 5TB Server is expected (normal windows processing on disk).

cfizz34 · Post by **cfizz34** » Jul 16, 2020 1:50 pm this post

Some background Info:
Guest OS: Microsoft Windows Server 2016 (64-bit)
Compatibility: ESXi 6.5 and later (VM version 13)
VMware Tools: Running, version:11270 (Current)
PURE Array
Thin Provision

File Server is 95% for unstructured data. Small shop and underutilized file server but will research if I can find the culprit of the change blocks.

You are correct, the ctk files are just a list and I should be looking at the 000001 files and compare those between snapshots to get a real # for the amount of change.

So in just just 45MIN, I ran another backup and had 2.5GB of change.

Jul 16, 2020 1:59 pm

That is below 1MB/s change rate. Even way less if you look at the bigger change block matrix size used because of the huge server size.
Maybe windows diagnostic will help to identify the changes, but I guess it is just the Windows index server, pagefile and some other background services.

The size of the CTK file is I think unrelated to the amount of data changed.

DonZoomik · Jul 16, 2020 2:49 pm

Also to keep in mind that Veeam's granularity is usually much larger than file system. If your app changes a 4k block on disk, whole 1MB block (default) will be read by Veeam. Correct me if I'm wrong.
For example - we have a database application that has ~20% incremental every 6 hours and we consider it quite normal (a monitoring database with a lot of small semi-random writes).

With thin provisioning, even file deletes might be included in incremental. If guest OS UNMAP-s the deleted data (even without actually deallocating it due to alignment problems), blocks get marked by CBT. During backup Veeam will show a lot of reads with no actual disk access with little to none repository writes and absurd compression ratios.

Even monitoring disk writes with for example ProcMon would probably show the culprit.

Post by **Gostev** » Jul 16, 2020 3:22 pm this post

cfizz34 wrote: ↑Jul 15, 2020 11:41 pmThe ones with the biggest change reported are SQL servers so I'm curious if also any SQL guru's out there that might shed light on something we are doing that might be causing this self inflicted pain?

I answered this for SQL Server just a few weeks ago in another similar topic.

cfizz34 · Post by **cfizz34** » Jul 16, 2020 4:00 pm this post

I'm working with the SQL guys for the SQL servers but in this example I think it should be easier to find the culprint....or so I hope

So in two hours veeam is stating it is being told from vmware it had 13.5GB of read data and it transferred 13.5GB of data to the data domain (no compression or dedup...geez that stinks!!!!)

Looking on the vmware side, I see this when the snapshot is taken:
-rw------- 1 root root 6.3M Jul 16 15:49 FILESHARE_1-000001-ctk.vmdk
-rw------- 1 root root 1.6G Jul 16 15:52 FILESHARE_1-000001-sesparse.vmdk
-rw------- 1 root root 395 Jul 16 15:49 FILESHARE_1-000001.vmdk
-rw------- 1 root root 4.2M Jul 16 15:49 FILESHARE_2-000001-ctk.vmdk
-rw------- 1 root root 8.5G Jul 16 15:49 FILESHARE_2-000001-sesparse.vmdk
-rw------- 1 root root 396 Jul 16 15:49 FILESHARE_2-000001.vmdk
-rw------- 1 root root 3.8M Jul 16 15:49 FILESHARE_4-000001-ctk.vmdk
-rw------- 1 root root 277.4M Jul 16 15:52 FILESHARE_4-000001-sesparse.vmdk
-rw------- 1 root root 395 Jul 16 15:49 FILESHARE_4-000001.vmdk
-rw------- 1 root root 5.4M Jul 16 15:49 FILESHARE_6-000001-ctk.vmdk
-rw------- 1 root root 1.4G Jul 16 15:49 FILESHARE_6-000001-sesparse.vmdk
-rw------- 1 root root 395 Jul 16 15:49 FILESHARE_6-000001.vmdk
-rw------- 1 root root 6.5M Jul 16 15:49 FILESHARE_7-000001-ctk.vmdk
-rw------- 1 root root 26.4G Jul 16 15:52 FILESHARE_7-000001-sesparse.vmdk

So the one that jumps out to me is that 24.4GB file on here - FILESHARE_7-000001-sesparse.vmdk & 8.5G Jul 16 15:49 FILESHARE_2-000001-sesparse.vmdk

So does that really mean those two drives saw that much change on those two disks between the last snapshot/backup taken two hours earlier?

cfizz34 · Post by **cfizz34** » Jul 16, 2020 5:14 pm this post

This might be the culprit on this server in this instance - System Volume Information <very large disk with VSS turned on and going back to 4/22 so there you go!
Report shows 50GB of change for the day for that folder.

cfizz34 · Post by **cfizz34** » Jul 16, 2020 5:39 pm this post

Is there a way to skip/ignore the system volume information (SVI) folder? In addition, it seems compression and dedupe can't handle whatever is in that folder and that really stinks.

DonZoomik · Jul 16, 2020 7:14 pm

Dedupe is SVI folder. Whole chunk store lives there.
Some dedupe CBT discussion here: post366241.html#p366241

cfizz34 · Post by **cfizz34** » Jul 16, 2020 8:01 pm this post

What the root problem is that i have like 15 servers (Mostly SQL and one File Server) where vmware is flagging change blocks (100's of GB's a night per server) and veeam and data domain can't do anything with it (meaning it can't dedupe it I suppose) and sends over the data over and over each night and this backs up replication to my secondary data domain from the primary DD. I only have a 100Mbps mpls pipe between the two sites to work with.

soncscy · Jul 16, 2020 8:08 pm

Agree with DonZommik, and would further suggest dedup in-virtual-guests is a bad trade in virtually every situation I'm aware of.

It's tempting for sure; I get the idea that you want a small VM footprint, but this line of thought is classic physical machine thinking. Image-level *anything* should be understood that the hypervisor and ultimate storage is always going to handle the size far better than Windows dedup ever will. I don't want to fault Windows Dedup, it has its purpose, but you need to understand that if you're deduping the source data, that means that every deduplication/compression operation at the higher abstraction layers are affected. The benefit of saving dozens to hundreds of GB at the guest level means you forgo space savings at the Hypervisor/Backup level.

And especially for hypervisors, which have no visibility into the GuestOS files at all, this defeats CBT. (That's what the C and B in CBT are counting on

)

It's an annoying discussion for sure with clients, but I will fight this battle against in-guest dedup as long as people will pay me to handle their IT Infrastructure

DonZoomik · Jul 16, 2020 8:38 pm

I disagree about guest dedupe being bad. Not everyone has (nor needs) a Pure SAN with good SAN level dedupe. And even less featured SANs tend to be quite expensive. I noticed that you have Pure - I really don't understand why you would use in-guest dedupe on it. After rehydrating data, with SAN having good dedupe and primary backup going to DD... the tradeoff would be more data moved between systems but that's probably not much of a problem.

On other SANs - NTFS dedupe has very good efficiency over large datasets (sometimes surprisingly so) and it has hugely reduced our storage costs with negligible performance impact for file server roles. Backups tend to be a bit bigger but that's an easy tradeoff for me as backup goes to relatively cheap storage servers.

Also DD-s/StoreOnces are not that magical on incremental backups. Sure, many fulls compress well but perhaps your incrementals actually have unique data. Or your SQLs have data data compression enabled or something that messes with DD.

Jul 17, 2020 11:52 am

Just did some testing in my lab. Installed a win2019 server with an SQL Express edition. I have a script that creates every 1 min an empty table and delete the table. Backped up the VM and then again after 10 min. The Job read was 4GB and it was deduped/compressed to 125MB.
What you experience is just the usual changes that SQL/Windows/Index Service write on disk changes.

ChrisGundry · Post by **ChrisGundry** » Jul 20, 2020 8:04 am this post

Not sure if this is the case for you or not, but... If you have VSS snapshots enabled within a VM you can end up with roughly double the CBT showing. Say you change 10GB on D:, which has VSS configured. D: then takes a VSS snapshot and saves that 10GB change into it's VSS snapshot. You then have the actual change, taking up 10GB and the 10GB within VSS. This can result in a 20GB 'change' recorded within the CBT. For SQL servers I would generally turn off VSS within the VM, at least on the SQL data/log drives. I only really have it turned on for file servers these days.

Veeam will generally 'see' the VSS snapshot data within the VMs, but ignore it. So you might see 100GB of changes in total, so Veeam will read 100GB, but then only transfer say 30GB. 70GB could have been saved by a combination of skipped blocks for deletions, page files etc, then dedup and finally compression of the remaining data. If you take a copy of the job logs and send to Veeam support, they can breakdown a bit more what was excluded/skipped, but thye can't tell you which blocks were changed/read and why.

Also note that things like SQL are a PITA for 'changes' because they are constantly changing things due to maintenance processes and scheduled jobs etc. I have seen a job before that dropped and re-created a table, which was around 150GB. Because it dropped and re-created it, the CBT tracked this and every night we had 200GB+ change for that VM. Veeam dedup and compression couldn't do much with that, which meant we ended up transfering most of it each night.

DonZoomik · Jul 20, 2020 11:34 am

True, VSS can be the problem. I'd suggest moving VSS to a separate disk and using at least 16k clusters if possible. VSS works on 16k granularity and it greatly reduces file system fragmentation and VSS delta if they are aligned. As VSS seems to be excluded by BitLooker, this disk with VSS data will be effectively excluded from backup.

mkaec · Post by **mkaec** » Jul 20, 2020 4:41 pm this post

DonZoomik wrote: ↑Jul 16, 2020 8:38 pm I disagree about guest dedupe being bad.
...
On other SANs - NTFS dedupe has very good efficiency over large datasets (sometimes surprisingly so) and it has hugely reduced our storage costs with negligible performance impact for file server roles. Backups tend to be a bit bigger but that's an easy tradeoff for me as backup goes to relatively cheap storage servers.
...

I love NTFS dedup on my physical servers. It has given years more life to servers that were previously projected to run out of disk space. But it has wreaked havoc on guest VMs backed up by Veeam. No matter what I tried, garbage collection would regularly come to deliver massive incremental backups involving data that hasn't changed in years. I finally gave up and bought some extra disks to hold the uncompressed data.

It would have been killer if Microsoft implemented ReFS dedup using the block cloning API. That would have completely eliminated the need for garbage collection and all of the problems that come with it.

DonZoomik · Post by **DonZoomik** » Jul 20, 2020 6:15 pm this post

Just disable full GC! https://support.microsoft.com/en-us/hel ... cause-perf
By default, 4th GC rewrites whole chunk (1GB) even if only one byte is no longer referenced in it and needless to say, it's bad from CBT perspective.

For example, I run daily GC but never a full one. I'm not sure where the threshold for rewriting chunk is but my incrementals are reasonable. You could keep the default weekly schedule with little to no ill effects.

mkaec · Post by **mkaec** » Jul 21, 2020 2:22 pm this post

Have you been running with full GC disabled for a long time? I tried that, but I worried about leaving it off permanently.

DonZoomik · Post by **DonZoomik** » Jul 21, 2020 7:10 pm this post

Years... I'm probably losing *some* efficiency but I can afford to lose that last 5% (according to MS doc).

mkaec · Post by **mkaec** » Jul 21, 2020 10:05 pm this post

Thanks. That's good to know. If you're losing 5% efficiency, you're still probably gaining 20% - 25% in saved storage space.

DonZoomik · Post by **DonZoomik** » Jul 22, 2020 7:32 am this post

That's in the right ballpark, yes - if not higher on larger servers. I'm not aware of any way to check unreferenced chunk store size though so efficiency loss is just a guesstimate.

mkaec · Post by **mkaec** » May 25, 2021 11:54 am this post

DonZoomik, Are you still rocking dedupe with full GC disabled? Is it still working out to your satisfaction?

R&D Forums

Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Re: Any Vmware CBT experts out there?

Who is online