Feature Request - Streamline Cluster Processing

a-tome · Post by **a-tome** » Jul 18, 2020 8:46 pm this post

I backup a Microsoft Fileserver cluster with five physical nodes and nine virtual nodes/namespaces. The cluster holds more than 100 Tb of data, so I have to subdivide the backup job in order to get segments of data completely backed-up while reducing the chance of a backup failure due to network disconnects or other anomalous errors that "just happen". I have subdivided the whole backup job based on the virtual fileserver names. Each job starts; attaches to all of the physical nodes that are not locked by a running job; finds where the disks associated with the fileserver instance are attached (which physical node); completes a "dummy backup" of the rest of the physical nodes; and backups the appropriate LUNs for the virtual fileserver name. For example, fs1 and fs3 could be running on clusternode1. The job for fs1 attaches to cn1, cn2, cn3, cn4 and cn5 to perform the backup of the LUNS that are presented to cn1 only. Looking in Microsoft Failover Cluster Manager, one can see which roles are attached to each physical node, so having the backup software attach to all nodes seems superfluous. Moreover, it takes almost 20 minutes for the software to attach to a cluster node, perform the faux backup, and unlock the resource. While the job will process as many nodes as possible in parallel, the running of separate jobs for each virtual fileserver name eventually forces this process to serialize, adding hours to the overall processing time for the backup. I run a separate job for the cluster that backups the O/S and local disks of all of the physical nodes. Here is the request:
Since there is a Veeam Physical Workload client on each server, it could run the appropriate commands from "failoverclusters" module to identify the correct physical node that hosts each role and only perform the backup on that server. The cluster name is associated with one of the physical nodes, so it could be used to get the node/role association at the beginning of the job.

Post by **Dima P.** » Jul 27, 2020 5:35 pm this post

Hello Jan,

Thank you for your post! Mind me asking to which job type you are referring to? For managed by backup server agent jobs with failover cluster type we work with cluster accounts and resolved those into associated nodes during job run. Shared resources (i.e. cluster disk) are processed via owner node only once and we do not attach it to every node during backup. Cheers!

a-tome · Post by **a-tome** » Jul 31, 2020 6:15 am this post

I am running a job on a Microsoft Fileserver cluster using the domain object to identify the cluster and allowing Veeam Backup And Recovery to identify the nodes in the cluster. The whole cluster backup is divided into a lot of smaller tasks because I need to get the job completed within the backup window. Several features of Veeam Backup and recovery make this very difficult.

While the backup job only backs-up data on the specified LUN and the LUN is associated with a specific physical node. The Job startup has to check every node in the cluster. That check can take more than 30 minutes, which locks all of the available nodes until they have been processed. When the second job starts, it might have to wait for as much as 30 minutes for a preceding job to release the node with the disks for the current job. This becomes a problem when I have five physical nodes, and each node has to pass through all of the jobs.

This week I hit another limit. There is a process at the end of the backup after all of the physical nodes have been searched (even if the data has already been backed-up< the job has to verify with all of the physical nodes that there are no luns attached to anything else). The process consolidates the oldest incremental into the latest image. For a 24TB image, the consolidation was still running after 3 days. That kind of destroys the concept of "daily" backups.

That meant I had to increase the number of jobs so the incrementals and the base image were not so large and could be consolidated in a few minutes or a few hours. Rather than being able to backup the whole cluster in a single job, I now have almost 20 jobs. With as much as a 30 minute delay for releasing resources so the next job can run, and with every job having to touch every node, I have the potential to add 10 hours of wait time to my backups and my backup window is only 15 hours.

To be more specific. If I run a job against the LUNS associated with Fileserver #1, the job will connect to all of the cluster nodes and look for the LUNS. They will go through the process of backing-up the LUNs on Fileserver#1 on whichever physical cluster node is hosting the namespace. The job will also create a snapshot and look for LUNS to backup on every other cluster node; state that "there is no work to be done" ; finalize the backup for as much as 30 minutes; and release the cluster resource.

I am asking for the software to connect to the cluster; identify the specific node that is hosting each LUN specified in the job and only perform backup tasks on that node and to not create a snapshot on any other node, or any of the other tasks associated with the backup if there is nothing in the job that pertains to them. That would allow me to start multiple jobs on a very short timeline, and to have the jobs that follow it start very shortly after each preceding job finishes without waiting for hours for a resource to be released from every other node that could possibly touch it. I have pictures to help make this clearer.

Post by **HannesK** » Jul 31, 2020 6:31 am this post

Hello,
I edited your post add added paragraphs to it as it was hard to read.

The process consolidates the oldest incremental into the latest image. For a 24TB image, the consolidation was still running after 3 days

sounds like you do not use REFS / XFS or your backup storage is too slow. https://www.veeam.com/blog/advanced-ref ... suite.html

Are you using VBR version 10? Because we added parallel disk processing in V10 / VAW4, which should make it possible to back up a 100 TB cluster with one job (assuming that you have hardware that can deal with that).

(I know that I'm just suggesting workarounds and not answer the question about the functionality. But maybe that helps as a "quick fix", because changes to the agent behavior would not be implemented within short time-frame)

Best regards,
Hannes

a-tome · Post by **a-tome** » Jul 31, 2020 5:02 pm this post

>"sounds like you do not use REFS / XFS or your backup storage is too slow." / "VBR v10"

Correct. I am backing-up LUNS that use MS Dedupe and I am not sure what happens when they land on space that is not NTFS, so I am being cautious. On the other hand, I am using VBR v10 which can process the entire data set; however, it cannot process it fast enough to avoid other issues that occur on the network. I started this process by using the "whole cluster" option and letting that backup process all LUNS in the cluster and all physicals nodes in the cluster as a single job with 50+TB of raw disk space. The initial backup finished somewhere between 72 and 96 hours, and the subsequent incrementals were completing overnight … until … An update of the antivirus clients on the servers invoked the AV firewall during the update (at night during the backup window because employees are not trying to access the data at that time). The job which had been zooming along with each LUN moving data at a rate of 25 to 40 MB/sec now went into recovery mode for all of the server instances … every LUN whether it had already finished or not on every server that had been running active backups at the time of the disruption, restarted at a rate of 5 MB/sec. The recovery for that disruption was going to take at least another 72 hours, and nothing was going to be backed-up during that timeframe. Breaking the backup into smaller jobs meant that I could get successful backups on some portion of the fileserver cluster LUNs and only experience recovery on a few of them if another disruption occurred. The result is that I changed the mode of the backup from "whole cluster" to "volume level". I created several jobs to distribute the LUNs and reduce recovery time. Now I face delays from job start-up and image consolidation processes. The most devastating time predator of these features is the start-up feature that accesses every node in the cluster for every job that starts. That is the reason for this feature request. I like the product and the quality of confidence that I have when it is time to restore data, so I am moving backups from other products that we use into VBR. Recovery, while it is the real reason that we do backups, is only as good as the data that is present in the backups (a concept that everyone reading this post already knows, but added for emphasis anyway). I am trying to improves the intake process so the data I need is present.

a-tome · Post by **a-tome** » Aug 20, 2020 2:02 pm this post

Just as I thought I crested this wave and had my cluster backups predictable, someone moved a couple of "roles" to new servers and totally blew my backup drive space. I discovered that rather than needing 60 or 70 TB of landing space for my backups, I need 55 TB x the number of nodes ( about 330 TB) because I will get a full backup of everything anytime even one cluster role moves. Why is this necessary? I had to manually fail 10 of my backups today because I ran out of drive space on the backup target and I ran out of time. 8 backup jobs have a chance of completing out of the whole cluster backup. Fortunately, my backups were broken into smaller groups rather than backing-up the whole cluster as one job. I at least got 8 backups. I am not sure how I will recover from this. It is a real mess.

Post by **HannesK** » Aug 20, 2020 4:06 pm this post

Hello,
as I already said: REFS / XFS block cloning fixes the merge issues. Whatever is inside the backups is not relevant.

I need 55 TB x the number of nodes ( about 330 TB)

I don't believe that unless I see it confirmed in a support case (please post the case number for reference). The only thing where data is duplicated is when you use per-vm backup chains and you do a backup copy job of your cluster backup. (I assume that we are not talking about Exchange DAG or SQL always ON)

With proper hardware and configuration, you should have 500 MByte/s - 1000 MByte/s. At least that's what I saw some years ago at a customer (and the network was the bottleneck) and I expect current versions to be faster.

Best regards,
Hannes

a-tome · Post by **a-tome** » Aug 19, 2021 7:37 pm this post

I tried REFS on a different backup store for the fileserver cluster. I ran out of space because the backups never recovered empty blocks. In fact that is a persistent problem whether I I use REFS or NTFS, I just run out of space faster with the recommended 64K block size. I appreciate the note about transfer speeds. If I revert to whole cluster processing consolidations still take too long. the backup that begins on Saturday might not be available for a restore until about Thursday, and no daily jobs will have run in the interim. A faster process than consolidating blocks would be to map blocks in the backup files (perhaps in a separate file in the backup structure). Blocks and increment filenames would be removed as they age. If the desire is to have a recover-ready most current image, that could be handled with a background job that was designed to manage data and database integrity, and to be interrupted if a restore was required. Small enough increments of change could be scheduled so that the "image-ready for restore" state was within two to five minutes of the job initiation request.

Post by **HannesK** » Aug 20, 2021 5:36 am this post

I ran out of space because the backups never recovered empty blocks.

not sure what that means. Is it about thin provisioned LUNs?

If the difference between 64K and 4K lets you run out of space, then something went wrong (with sizing). Yes, 64K uses a little bit more space (less than 5%). But nobody forces you to 64K. Chances are there, that it can also work with 4K.

R&D Forums

Feature Request - Streamline Cluster Processing

Re: Feature Request - Streamline Cluster Processing

Re: Feature Request - Streamline Cluster Processing

Re: Feature Request - Streamline Cluster Processing

Re: Feature Request - Streamline Cluster Processing

Re: Feature Request - Streamline Cluster Processing

Re: Feature Request - Streamline Cluster Processing

Re: Feature Request - Streamline Cluster Processing

Re: Feature Request - Streamline Cluster Processing

Who is online