Host-based backup of VMware vSphere VMs.
Post Reply
FECV
Enthusiast
Posts: 41
Liked: 7 times
Joined: Mar 24, 2016 2:23 pm
Full Name: Frederick Cooper V
Contact:

Issues With "Collecting Disk Files Location Data" Hanging

Post by FECV »

I have a VMware VM with several large disks attached to it. In total there are 6 disks ranging from 100GB to 60TB. I have a separate backup jobs, one for each disk. Because this is normally a high I/O VM I decided to test out storage snapshots. My issue is after the VMware snapshot the backup hangs at the "Collecting disk files location data" stage for a long time. For the OS drive 100GB the task takes less than 1 min. For the 10TB drive the task seems to take 1-2 hours. For the 60TB drive i have never let it finished, i killed the Veeam job after 4 hours and spent the next 2 weeks consolidating. I did Google the issue and found two main responses. Most of them says I have a bunch of snapshots. I do not. I have checked the datastore files and snapshot tab of VMware. That is not the issue. The only snapshot active during this phase of the backup is the one that VMware creates for the current job. The second post i ran across is a word from Gostev from 2018, and talks about disk fragmentation and too many disk regions to process. I have not gone thought the process to vMotion the storage yet as i am waiting to hear what support (05280227) can find in the logs. My 60TB disk is on a VMFS volume of its own, so wonder if that could even be an issue at all? Guess my uneducated thought process is that if the thin disk expands, its still going to be contiguous because its the only thing on the data store. My final test this weekend was to make sure all the snaps are consolidated power down the VM and run the jobs so the snaps don't get out of hand. But all the symptoms seem to be the same.

Does anyone have any experience with this issue, ideas, things to check, etc.?
Regnor
VeeaMVP
Posts: 940
Liked: 291 times
Joined: Jan 31, 2011 11:17 am
Full Name: Max
Contact:

Re: Issues With "Collecting Disk Files Location Data" Hanging

Post by Regnor »

What storage do you have? And does it use deduplication, compression or/and thin provisioning?
Depending on that it could be possible that your datastore and the disk are indeed fragmented; especially if it's IO intensive.

Here's another thread on this topic/issue: vmware-vsphere-f24/ten-minute-backup-de ... 47934.html
FECV
Enthusiast
Posts: 41
Liked: 7 times
Joined: Mar 24, 2016 2:23 pm
Full Name: Frederick Cooper V
Contact:

Re: Issues With "Collecting Disk Files Location Data" Hanging

Post by FECV »

Its a EMC Unity Hybrid XT, mainly all flash in use. No dedupe or compression in use. Right now my entire production environment is shutdown expect for an AD VM and vCenter including the VM i am trying to backup. Been sitting at this collecting disk info for 4+ hours.

As I mentioned in my OP while it is thin provision, i am not sure how fragmentation would work if the disk is the only VMDK on the VMFS datastore being presented to VMware. On my smaller thin disks this is not an issue. Just on my 10 and 30 and 60TB volumes that are 75% or more full is this an issue. I may have to storage vMotion, but i wanted to get more insight prior to moving 60TB.
soncscy
Veteran
Posts: 643
Liked: 312 times
Joined: Aug 04, 2019 2:57 pm
Full Name: Harvey
Contact:

Re: Issues With "Collecting Disk Files Location Data" Hanging

Post by soncscy »

Heya Frederick,

> Guess my uneducated thought process is that if the thin disk expands, its still going to be contiguous because its the only thing on the data store.

This isn't a safe assumption. As far as I know, you have two types of fragmentation to consider:

Internal Fragmentation, where the blocks aren't used fully
External fragmentation, where the storage doesn't place the blocks as smartly as it can

https://www.vmware.com/pdf/vsp_4_thinprov_perf.pdf
https://www.vmware.com/content/dam/digi ... -WP-EN.pdf

Give these a read and check the fragmentation section - even though it talks about multiple thin-provisioned VMs, I actually expect the same behavior even on a dedicated LUN as VMFS is opportunistic and wants to _write_ fast, not necessarily read nicely. I suspect that as the second article says, it's about the block allocation as the disk grew, and that's what's causing the issue.

While ESXi can handle this well for the VM operation, when using DirectSAN, this is a bit different because a single DiskRegionMapping call doesn't necessarily get all of the block data, especially with internally fragmented disk: https://kb.vmware.com/s/article/2148199

As a simple test, check if NBD or Hotadd are faster; if they are, check in vCenter (and the Veeam logs) for the same call referenced in the article. If you're seeing that, then you need to figure out how to "defragment" the VM. Keep in mind, vMotion won't do this if you're moving between datastores with the same blocksize.
Regnor
VeeaMVP
Posts: 940
Liked: 291 times
Joined: Jan 31, 2011 11:17 am
Full Name: Max
Contact:

Re: Issues With "Collecting Disk Files Location Data" Hanging

Post by Regnor »

Harvey is right, you also need to consider the underlying storage.
If your datastores/LUNs are thin provisioned on the storage, they will get fragmented over time as they grow. Also looking at the datasheet of the EMC storage it says something about "Inline unified data reduction" which sounds much like deduplication and will make this even worse.
Try a Direct-SAN without storage snapshots or a different processing mode and see how it goes. Also check with the storage vendor if there's some optimization taks for the SAN itself.
FECV
Enthusiast
Posts: 41
Liked: 7 times
Joined: Mar 24, 2016 2:23 pm
Full Name: Frederick Cooper V
Contact:

Re: Issues With "Collecting Disk Files Location Data" Hanging

Post by FECV »

Would converting these disks to thick disks during a vMotion force a defragmentation? Seems like a simple question, but i am not finding a clear answer.

Also the second link you provided stats that "Currently there is no tool to measure the degree of fragmentation that exists in vSphere. And the only utility to defragment a VMDK file is VMotion – to move the VMhome to another datastore and then SVM it back to the original datastore." I have not seen anything about that not being the case on stores with the same block size. Am I still missing something here when it come to vMotion to defrag?
soncscy
Veteran
Posts: 643
Liked: 312 times
Joined: Aug 04, 2019 2:57 pm
Full Name: Harvey
Contact:

Re: Issues With "Collecting Disk Files Location Data" Hanging

Post by soncscy »

Hey Frederick,

I get it's confusing, and it's because VMware has really hidden this stuff in my opinion.

https://kb.vmware.com/s/article/2004155

Same concept for the null-blocks apply, basically you end up with an extremely fragmented (internal) VMDK even if you vMotion.

I'm not really sure about converting to thick disks; part of me says "yes", but another part of my says Vmware maybe optimized this process and the same penalties apply.

I think before we theory-craft on Vmware, can you confirm that DirectSan without Storage Snapshots or hotadd/NBD do the disk collection part faster? My idea is based solely on that everything you describe sounds like it's related to storage integration, but before we send you running to find storage for a vMotion to another blocksize datastore, let's first make sure we have the right issue.

You don't have to run the whole backup, just see if the same element takes as long and stop the job.

If that's the case, then you know you're looking at fragment on the vmdk and we can put our brains together and see if we find some inspiration.
FECV
Enthusiast
Posts: 41
Liked: 7 times
Joined: Mar 24, 2016 2:23 pm
Full Name: Frederick Cooper V
Contact:

Re: Issues With "Collecting Disk Files Location Data" Hanging

Post by FECV »

Fortunately Space is not my issue right now since I’ve only started moving my environment data onto this storage. I decided to vMotion my smaller 12TB Vmdk to a new data store this morning with no format changes. Once that finishes I’ll run a back up test and see if there’s any change if not i’ll move it back And change it to thick.

I’ll test what you said as well, once it’s moved with the non storage snap.
FECV
Enthusiast
Posts: 41
Liked: 7 times
Joined: Mar 24, 2016 2:23 pm
Full Name: Frederick Cooper V
Contact:

Re: Issues With "Collecting Disk Files Location Data" Hanging

Post by FECV » 1 person likes this post

After moving the 12TB disk to a new mostly empty data store keeping it as thin disk, the the Collecting disk files location data, took 1 min versus the 1 hour and 8 minutes it took in the prior run. Seems like this made the difference. Now to start moving the other 120TBs.
Regnor
VeeaMVP
Posts: 940
Liked: 291 times
Joined: Jan 31, 2011 11:17 am
Full Name: Max
Contact:

Re: Issues With "Collecting Disk Files Location Data" Hanging

Post by Regnor »

That's a huge difference 😉
So the thin disks cause fragmentation in your case; probably somewhere at the storage level.
FECV
Enthusiast
Posts: 41
Liked: 7 times
Joined: Mar 24, 2016 2:23 pm
Full Name: Frederick Cooper V
Contact:

Re: Issues With "Collecting Disk Files Location Data" Hanging

Post by FECV »

So i have finally moved my disks around to new storage. Only made one of them thick disk so far. Plan to do the the two when i move them back. But decided to kick off a Veaam job in the mean time to see where things stand now that everything has been moved once. As i said originally the job would sit at "Collecting disk files location data" for 4 hours before i stopped it. Today that task took less than 3 minutes. So indeed moving the disks must have removed some fragmentation and resolved my issue. Thanks again for your inputs!
soncscy
Veteran
Posts: 643
Liked: 312 times
Joined: Aug 04, 2019 2:57 pm
Full Name: Harvey
Contact:

Re: Issues With "Collecting Disk Files Location Data" Hanging

Post by soncscy »

Hey Frederick,

I'm really glad this helped out, and while it's regrettable you got hit by this, it shows just how subtle the underlying storage factors are to the performance of backups and the depths vendors go to circumvent it for production, while not extending the same courtesy to users to protect their data.

Maybe there's a technical reason for this, I absolutely would believe it, but with ransomware scares or even just common IT disasters (dead hosts/disks), if production can work around it, the backup API should also.

This is my personal take and frustration with the major hypervisor vendors that it always feels like the backup API is a second-class citizen and only if you have a "good" knowledge of the underlying tech can you troubleshoot it.

I'm glad you had a success story here, and thank you for sharing the results/testing! :)
Post Reply

Who is online

Users browsing this forum: No registered users and 42 guests