What is your "feeling" about CDP?

pfrancoeur · Post by **pfrancoeur** » Jan 25, 2022 4:49 pm this post

We have a client that recently asked us to consider CDP for their replication instead of the current VBR Replication. The idea was to reduce the RPO for some VMs.

We went and transitioned some VMs gradually. It worked well for some time, but it mostly went downhill in the last 3 weeks. Here is what happened:

- For a reason we don't know, one VM snapshot/meta/tlog grew to the point that it completely filled up the datastore on the destination. The VM itself hasn't had that much modification that we could track. It went from a 4Tb VM to over 7Tb of disk space usage. We were keeping 4 hours of short term retention and 3 days of snapshot every 8 hours. This caused all CDP replications to that datastore to fail. We went and reduced the short term retention to 15 minutes, snapshot to only one every 24 hours (kept for 1 day only). We went and removed another replica VM taking 1Tb of space on the destination datastore hoping to give it back some space. The problematic VM bloated to 8Tb of storage space (and again filling the datastore completely) in the blink of an eye. We opened a case with Veeam and it seems that the only solution was to delete completely the replica VM and restart from scratch. And that was for all VMs on that datastore. Veeam case #05226297
- Since then, we've had two time that we stopped the CDP Policy to do some maintenance not related to Veeam and when restarting the CDP, we are getting error like "VM configuration for CDP completed with errors Error: NULL virtual disk UUID detected for the VM: disk ID Scsi.0.1, disk label Hard disk 2, disk name VMNAME_replica_1-interim.vmdk". If you look at the VM, you can see that one of its drive sits at 0 bytes (the original is at 300Gb). And it just goes into loop from there. I guess that we will have to delete these VMs and start from scratch also

Considering the fact that we're replicating something like 6 VMs, I must say that I am not feeling all that confident in CDP right now and we are not planning in proposing this solution again, for now.

Were we just out of luck? Or were we just too bleeding edge and we should have waited before putting this into production? How is your CDP deployment going?

Jan 25, 2022 5:14 pm

It's doing very well from support statistics perspective. And yet we see it used heavily from our support big data.

Are you not considering a possibility that something happened in your environment 3 weeks ago that broke CDP? Because there's no reason for a real-time system that "worked well" for a few weeks not to continue working for months to come... a few weeks is basically eternity for such systems. When we had actual stability bugs during CDP development, replication would not survive over night/weekend.

pfrancoeur · Post by **pfrancoeur** » Jan 26, 2022 3:59 pm this post

Well, something clearly happened to the first VM to get so big on the destination datastore. We couldn't track down what happened as there was no noticeable change to the size of the VM itself. Still, the way CDP work with change going into a transaction log, it is possible that multiple change (like overwrite) happened on the VM that wouldn't have an effect on the VM size itself and on the backup size either.

Going from 4Tb to 8Tb means that most of the VM would have been "re-written" and I would be quite surprised this would be the case. Might be something else, but we couldn't find it.

Still, filling up the entire datastore that fast and having no other option but to completely delete the VM and restart the synchronisation definitely isn't nice. Maybe there could be some option to "cleanup" the VM to recover spaces and then restart the sync from that instead of losing everything (more on that below)?

As for the other problems, they seem to be unrelated and they all happened at different time over 3 weeks so I don't think it would be a single event/change that would have caused this. And not much has changed honestly.

As for the possibility to "cleanup" a CDP replica. Since the VM that had filled the space was dead anyway, I decided to try to clean it up and see how it goes. I removed the VM from inventory, deleted all meta and tlog files (over 4000 of them!), removed delta, sparse and interim vmdk and then pointed the VMX file to the "original" vmdk. Registered the VM again and tried to start it.... and it worked! Decided to re-add this VM to the CDP policy using the "repaired" VM as the replica seed. Did the initial sync and now it seems to be working normally. Probably not "support" approved, we'll do a clean re-sync anyway but there definitely seems to be something that can be done there.

Post by **veremin** » Jan 27, 2022 11:09 am this post

Still, the way CDP work with change going into a transaction log, it is possible that multiple change (like overwrite) happened on the VM that wouldn't have an effect on the VM size itself and on the backup size either.

CDP has mechanism that tracks different block states during first part of RPO interval and sends only the latest one to the target during the second part.

Say, there is an RPO interval that is equal to 20 seconds and there is a block A that changes three times during the first 10 seconds, once 11th second comes, CDP will start transferring only the third state of the block to the target.

This helps to prevent sending and storing unnecessary blocks.

Going from 4Tb to 8Tb means that most of the VM would have been "re-written" and I would be quite surprised this would be the case. Might be something else, but we couldn't find it.

This does not look expected, so if you still have this issue present, kindly collect all required debug logs including the vCenter ones and reach our support team. From our side we can make sure that the case is given due attention and closely followed by R&D team.

Maybe there could be some option to "cleanup" the VM to recover spaces and then restart the sync from that instead of losing everything (more on that below)?

Have you tried to decrease the short-term and long-term retention?

Thanks!

pfrancoeur · Post by **pfrancoeur** » Jan 27, 2022 9:48 pm this post

Have you tried to decrease the short-term and long-term retention?

Yes, we went from 4 hours and a snapshot every 8 hours on 3 days to 15 minutes and 1 snapshot every 24 hours for 1 day only.

At first, it didn't do anything, possibly because the datastore had literally 0 bytes left. We deleted a 1Tb VM replica from that datastore and restarted the CDP. Clearly there was some "backlog", snapshot merge or something with the VM still going on as it filled that Tb quickly and we ended at the same point, no space left on datastore.

If you want to see the log, they are attached to case #05226297.

Post by **veremin** » Jan 28, 2022 3:55 pm this post

The ticket was created for reclaiming space from the overfilled datastore for which the only solution was starting replica anew.

You seem to have followed this way, and the ticket owner is waiting for a final email from you, before he can close the case.

What got me interested was the situation with CDP replica unexpectedly swelling and filling the store, so if you see similar symptoms in future, kindly create a separate ticket for it and post its number here - we will investigate this issue internally.

Thanks!

pfrancoeur · Post by **pfrancoeur** » Feb 04, 2022 9:18 pm this post

I believe I do have something similar going on on another VM.

One of the VM being replicated through CDP is replicating "normally" but started recently giving this warning:

Code: Select all

2022-02-04 8:00:55 AM :: Target Daemon Replicator error: Failed to create long-term restore point; Failed to create snapshot of delta disk: {0000003 2022-02-01T01:15:00.000Z (--) ReadyToSnapshot/--}; Failed to read from virtual journal. Offset: 0x:20530cc00. Count: 2048; Failed to read file: "/vmfs/volumes/volume_id/VMNAME_repl/VMNAME_2-0000003.tlog". Offset: 8677325824 (8677325824). Size: 1048576: Input/output error

Note that the file does exist in the folder.

If I look at the replica size, it has now reached 1Tb of used space. The original VM is 600Gb. The size difference might be normal due to snapshot but what caught my attention is the number of tlog file in the replica VM folder. There is 70 of them for a 3 drive VM. If I compare that with another VM that is working correctly, that VM has 12 tlog file, 3 per drives (that VM has 4 drives).

I get the feeling that this is a similar problem where the CDP replication can't "merge" the tlog into the snapshot and just keep adding them, making the space used by the replica grow indefinitely.

Considering that the last time (original post above), the VM had 4000 of those tlog files, this would explain why it went from a 4Tb VM to 8Tb and filling up the entire datastore.

I have another ticket open to look at this: #05256988

Post by **veremin** » Feb 07, 2022 11:25 am this post

We'd better not jump into conclusion, based on preliminary information. Let our support do their job, find the root cause and escalate it to R&D, if it's confirmed to be a bug, broken logic or similar.

You are also welcome to provide your findings within the existing ticket.

Thanks!

R&D Forums

What is your "feeling" about CDP?

Re: What is your "feeling" about CDP?

Re: What is your "feeling" about CDP?

Re: What is your "feeling" about CDP?

Re: What is your "feeling" about CDP?

Re: What is your "feeling" about CDP?

Re: What is your "feeling" about CDP?

Re: What is your "feeling" about CDP?

Who is online