Agent-based backups for Windows and Linux, centralized agent management
Post Reply
a-tome
Novice
Posts: 3
Liked: never
Joined: Jul 18, 2020 8:13 pm
Full Name: Jan Levin
Contact:

Feature Request - Streamline Cluster Processing

Post by a-tome »

I backup a Microsoft Fileserver cluster with five physical nodes and nine virtual nodes/namespaces. The cluster holds more than 100 Tb of data, so I have to subdivide the backup job in order to get segments of data completely backed-up while reducing the chance of a backup failure due to network disconnects or other anomalous errors that "just happen". I have subdivided the whole backup job based on the virtual fileserver names. Each job starts; attaches to all of the physical nodes that are not locked by a running job; finds where the disks associated with the fileserver instance are attached (which physical node); completes a "dummy backup" of the rest of the physical nodes; and backups the appropriate LUNs for the virtual fileserver name. For example, fs1 and fs3 could be running on clusternode1. The job for fs1 attaches to cn1, cn2, cn3, cn4 and cn5 to perform the backup of the LUNS that are presented to cn1 only. Looking in Microsoft Failover Cluster Manager, one can see which roles are attached to each physical node, so having the backup software attach to all nodes seems superfluous. Moreover, it takes almost 20 minutes for the software to attach to a cluster node, perform the faux backup, and unlock the resource. While the job will process as many nodes as possible in parallel, the running of separate jobs for each virtual fileserver name eventually forces this process to serialize, adding hours to the overall processing time for the backup. I run a separate job for the cluster that backups the O/S and local disks of all of the physical nodes. Here is the request:
Since there is a Veeam Physical Workload client on each server, it could run the appropriate commands from "failoverclusters" module to identify the correct physical node that hosts each role and only perform the backup on that server. The cluster name is associated with one of the physical nodes, so it could be used to get the node/role association at the beginning of the job.

Dima P.
Product Manager
Posts: 11623
Liked: 1009 times
Joined: Feb 04, 2013 2:07 pm
Full Name: Dmitry Popov
Location: Prague
Contact:

Re: Feature Request - Streamline Cluster Processing

Post by Dima P. »

Hello Jan,

Thank you for your post! Mind me asking to which job type you are referring to? For managed by backup server agent jobs with failover cluster type we work with cluster accounts and resolved those into associated nodes during job run. Shared resources (i.e. cluster disk) are processed via owner node only once and we do not attach it to every node during backup. Cheers!

a-tome
Novice
Posts: 3
Liked: never
Joined: Jul 18, 2020 8:13 pm
Full Name: Jan Levin
Contact:

Re: Feature Request - Streamline Cluster Processing

Post by a-tome »

I am running a job on a Microsoft Fileserver cluster using the domain object to identify the cluster and allowing Veeam Backup And Recovery to identify the nodes in the cluster. The whole cluster backup is divided into a lot of smaller tasks because I need to get the job completed within the backup window. Several features of Veeam Backup and recovery make this very difficult.

While the backup job only backs-up data on the specified LUN and the LUN is associated with a specific physical node. The Job startup has to check every node in the cluster. That check can take more than 30 minutes, which locks all of the available nodes until they have been processed. When the second job starts, it might have to wait for as much as 30 minutes for a preceding job to release the node with the disks for the current job. This becomes a problem when I have five physical nodes, and each node has to pass through all of the jobs.

This week I hit another limit. There is a process at the end of the backup after all of the physical nodes have been searched (even if the data has already been backed-up< the job has to verify with all of the physical nodes that there are no luns attached to anything else). The process consolidates the oldest incremental into the latest image. For a 24TB image, the consolidation was still running after 3 days. That kind of destroys the concept of "daily" backups.

That meant I had to increase the number of jobs so the incrementals and the base image were not so large and could be consolidated in a few minutes or a few hours. Rather than being able to backup the whole cluster in a single job, I now have almost 20 jobs. With as much as a 30 minute delay for releasing resources so the next job can run, and with every job having to touch every node, I have the potential to add 10 hours of wait time to my backups and my backup window is only 15 hours.

To be more specific. If I run a job against the LUNS associated with Fileserver #1, the job will connect to all of the cluster nodes and look for the LUNS. They will go through the process of backing-up the LUNs on Fileserver#1 on whichever physical cluster node is hosting the namespace. The job will also create a snapshot and look for LUNS to backup on every other cluster node; state that "there is no work to be done" ; finalize the backup for as much as 30 minutes; and release the cluster resource.

I am asking for the software to connect to the cluster; identify the specific node that is hosting each LUN specified in the job and only perform backup tasks on that node and to not create a snapshot on any other node, or any of the other tasks associated with the backup if there is nothing in the job that pertains to them. That would allow me to start multiple jobs on a very short timeline, and to have the jobs that follow it start very shortly after each preceding job finishes without waiting for hours for a resource to be released from every other node that could possibly touch it. I have pictures to help make this clearer.

HannesK
Veeam Software
Posts: 5887
Liked: 810 times
Joined: Sep 01, 2014 11:46 am
Location: Austria
Contact:

Re: Feature Request - Streamline Cluster Processing

Post by HannesK »

Hello,
I edited your post add added paragraphs to it as it was hard to read.
The process consolidates the oldest incremental into the latest image. For a 24TB image, the consolidation was still running after 3 days
sounds like you do not use REFS / XFS or your backup storage is too slow. https://www.veeam.com/blog/advanced-ref ... suite.html

Are you using VBR version 10? Because we added parallel disk processing in V10 / VAW4, which should make it possible to back up a 100 TB cluster with one job (assuming that you have hardware that can deal with that).

(I know that I'm just suggesting workarounds and not answer the question about the functionality. But maybe that helps as a "quick fix", because changes to the agent behavior would not be implemented within short time-frame)

Best regards,
Hannes

a-tome
Novice
Posts: 3
Liked: never
Joined: Jul 18, 2020 8:13 pm
Full Name: Jan Levin
Contact:

Re: Feature Request - Streamline Cluster Processing

Post by a-tome »

>"sounds like you do not use REFS / XFS or your backup storage is too slow." / "VBR v10"

Correct. I am backing-up LUNS that use MS Dedupe and I am not sure what happens when they land on space that is not NTFS, so I am being cautious. On the other hand, I am using VBR v10 which can process the entire data set; however, it cannot process it fast enough to avoid other issues that occur on the network. I started this process by using the "whole cluster" option and letting that backup process all LUNS in the cluster and all physicals nodes in the cluster as a single job with 50+TB of raw disk space. The initial backup finished somewhere between 72 and 96 hours, and the subsequent incrementals were completing overnight … until … An update of the antivirus clients on the servers invoked the AV firewall during the update (at night during the backup window because employees are not trying to access the data at that time). The job which had been zooming along with each LUN moving data at a rate of 25 to 40 MB/sec now went into recovery mode for all of the server instances … every LUN whether it had already finished or not on every server that had been running active backups at the time of the disruption, restarted at a rate of 5 MB/sec. The recovery for that disruption was going to take at least another 72 hours, and nothing was going to be backed-up during that timeframe. Breaking the backup into smaller jobs meant that I could get successful backups on some portion of the fileserver cluster LUNs and only experience recovery on a few of them if another disruption occurred. The result is that I changed the mode of the backup from "whole cluster" to "volume level". I created several jobs to distribute the LUNs and reduce recovery time. Now I face delays from job start-up and image consolidation processes. The most devastating time predator of these features is the start-up feature that accesses every node in the cluster for every job that starts. That is the reason for this feature request. I like the product and the quality of confidence that I have when it is time to restore data, so I am moving backups from other products that we use into VBR. Recovery, while it is the real reason that we do backups, is only as good as the data that is present in the backups (a concept that everyone reading this post already knows, but added for emphasis anyway). I am trying to improves the intake process so the data I need is present.

Post Reply

Who is online

Users browsing this forum: No registered users and 2 guests