-
- Service Provider
- Posts: 29
- Liked: 8 times
- Joined: Dec 05, 2019 1:51 am
- Full Name: Daniel Judge
- Contact:
Veeam Backup Proxy Selection Flaw
I've encoutered an interesting dilemma with Veeam Backup linux proxies that caused a cascading failure of scheduled backup jobs. The FAQ and helpguide don't clearly provide any description of how this is dealt with.
The environment is using several linux backup proxies all using hot-add mode for backup of vm's from Vsphere.
All jobs having proxy selection set to auto.
The issue is that if the /etc/VeeamAgentConfig file has incorrect configuration saved; the veeamtransport systemd service fails to start. This could be expected. However this leads to the following.
A job attemps to run using this proxy and fails with the error:
The error due to this can be found as: Error: TCP stream was closed Failed to start 'veeamagent' executable. Failed to create '/opt/veeam/transport/VeeamAgent
The Veeam logs show system exception errors that the agent could not be started on the proxy.
With all proxies having no load; it appears that for every job scheduled to run, Veeam BR will try to use the first proxy in the list of available proxies(which is a normal process from my understanding?; or should it in fact be using round robin process of selecting the proxy with less load?).
The problem encountered is that Veeam is not marking the proxy is unavailable due to failing to start the veeamagent; and hence subsequent retries go through the same failed decision process to try using the first proxy again.
This is inturn leads to all jobs that are scheduled doing the same thing, and leads to a cascading failure of all scheduled jobs.
I suspected some logic would see that the proxy veeam agent has failed 'x' times in the environment and hence would be marked as unavailable.
Or, Veeam sees that the job failed due to this, and on the subsequent retry will select another proxy to use; to mitigate an issue with the first proxy it attempted to use.
It doesn't seem to be applying any of this logic to mitiagte the problem at hand.
Could someone please provide their feedback on this issue and advise how to deal with this; or explain the actions of the auto proxy selection in this scenario.
It appears that if the first configured proxy has any systematic issue with the systemd service; and cannot start it; all logic to deal with the situtation and select an alternate proxy is not considered.
A case was raised for this dilemma.
#07602576
The environment is using several linux backup proxies all using hot-add mode for backup of vm's from Vsphere.
All jobs having proxy selection set to auto.
The issue is that if the /etc/VeeamAgentConfig file has incorrect configuration saved; the veeamtransport systemd service fails to start. This could be expected. However this leads to the following.
A job attemps to run using this proxy and fails with the error:
The error due to this can be found as: Error: TCP stream was closed Failed to start 'veeamagent' executable. Failed to create '/opt/veeam/transport/VeeamAgent
The Veeam logs show system exception errors that the agent could not be started on the proxy.
With all proxies having no load; it appears that for every job scheduled to run, Veeam BR will try to use the first proxy in the list of available proxies(which is a normal process from my understanding?; or should it in fact be using round robin process of selecting the proxy with less load?).
The problem encountered is that Veeam is not marking the proxy is unavailable due to failing to start the veeamagent; and hence subsequent retries go through the same failed decision process to try using the first proxy again.
This is inturn leads to all jobs that are scheduled doing the same thing, and leads to a cascading failure of all scheduled jobs.
I suspected some logic would see that the proxy veeam agent has failed 'x' times in the environment and hence would be marked as unavailable.
Or, Veeam sees that the job failed due to this, and on the subsequent retry will select another proxy to use; to mitigate an issue with the first proxy it attempted to use.
It doesn't seem to be applying any of this logic to mitiagte the problem at hand.
Could someone please provide their feedback on this issue and advise how to deal with this; or explain the actions of the auto proxy selection in this scenario.
It appears that if the first configured proxy has any systematic issue with the systemd service; and cannot start it; all logic to deal with the situtation and select an alternate proxy is not considered.
A case was raised for this dilemma.
#07602576
-
- Chief Product Officer
- Posts: 32297
- Liked: 7645 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Veeam Backup Proxy Selection Flaw
Could you clarify why does /etc/VeeamAgentConfig file has incorrect configuration saved?
-
- Service Provider
- Posts: 29
- Liked: 8 times
- Joined: Dec 05, 2019 1:51 am
- Full Name: Daniel Judge
- Contact:
Re: Veeam Backup Proxy Selection Flaw
An incorrect line entry was saved, but this is irrevelant; as the major concern is the issue it caused and the logic Veeam VBR used / did not use to work around the problem of the agent exec failing repeatedly on the proxy; and Veeam should have selected another proxy to use for retries; or isolated the proxy to be unavailable.
-
- Chief Product Officer
- Posts: 32297
- Liked: 7645 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Veeam Backup Proxy Selection Flaw
Just to be clear, did VBR save the incorrect line into the config, or did someone manually edit advanced data mover configuration settings file?
I'm asking because if that's the latter case then there are millions of other ways to mess up any software by putting invalid settings into internal config files, and it would be a complete waste of time to implement in-code protection from such mistakes. This will inflate code dramatically, making it harder to maintain, secure and support - all this only to ensure the code behaves well in case of an extremely unlikely situation.
On the other hand, if a similar proxy selection issue can be reproduced without erroneous modifications to config files, then it would certainly be a high priority bug to address.
I'm asking because if that's the latter case then there are millions of other ways to mess up any software by putting invalid settings into internal config files, and it would be a complete waste of time to implement in-code protection from such mistakes. This will inflate code dramatically, making it harder to maintain, secure and support - all this only to ensure the code behaves well in case of an extremely unlikely situation.
On the other hand, if a similar proxy selection issue can be reproduced without erroneous modifications to config files, then it would certainly be a high priority bug to address.
-
- Service Provider
- Posts: 29
- Liked: 8 times
- Joined: Dec 05, 2019 1:51 am
- Full Name: Daniel Judge
- Contact:
Re: Veeam Backup Proxy Selection Flaw
To provide some context...
In order to address the log file expansion on the linux proxies, a clean up script was added to cron, and the VeeamAgentConfig was updated on a number of proxies to add the configuration settings of:
AgentMaxLogSize=10485760
AgentMaxLogCount=5
The issue that occured(which we've outlined in the case, which wasn't addressed to our satfisfaction to be honest) is that an entry in the VeeamAgentConfig file was duplicated and saved in error on the first proxy, causing the file configuration to have the duplicate entries in the first proxy of:
AgentMaxLogSize=10485760
AgentMaxLogCount=5
AgentMaxLogSize=10485760
AgentMaxLogCount=5
All other proxies were correctly configured.
This double entry caused the issue reported and prevented the systemd veeamtransport service on the proxy from starting.
The Veeam log files showed that a system exception was encountered with the veeamagent on the proxy and this quickly allowed identification of the problem during our investigation to resolve the issue. Correcting the config and manually running a systemctl restart veeamtransport, the service started without issues and restarted jobs ran using the first proxy without issues.
The environment has 6 proxies, named proxy-001, through proxy-006.
Backup jobs all configured to use auto proxy selection. All proxies of equal weight for selection.
The issue that arose, is that every job tried to use the first proxy and all job retries, causing a cascading failure of every job as they all tried to use the first proxy in the list of proxies to be 'auto' selected from. Now to our understanding and as we've been advised, this was due to the first proxy being the least loaded and the first in the list.
The issue is that this system error exception for the veeamagent is not being dealt with by VBR in a proper manner to address the situation and is a critical problem that affects backup job operations.
It seems that VBR only checks if the proxy is available to be used for the backup job operation by reachability; authentication and and by attempt to use the veeamagent; but disregards any system exception errors if the veeamagent process fails.
It should be that some system integrity process exists that in the event that the proxy fails (such as a system exception error! ) to report that a problem exists with the proxy and be dealt with appropriately to ensure that subsequent scheduled backup jobs to follow and retries of each can auto select another proxy to be used where appropriate. And/or even make the proxy unavailable so it wouldn't be chosen until the veeamagent doesn't report a system exception!
Is there was a watchdog process that checks on the veeamagent status? to ensure all systems are working and available?..
If more than one proxy exists, why is it that VBR doesn't randomly select a proxy of no or lesser load; if backup jobs are configured to auto-select the proxy? This might have mitigated the issue to some degree; but it's not a solution.
The issue exists that if a proxy veeamagent fails, like has occured in this situation where the systemd service cannot start, there appears to be no ability to mitigate the problem with the backup proxy selection operation to work around the problem.. and all backup jobs are doomed to fail as in this situtation.
This is not a problem of the incorrect configuration causing a system process to fail. It's that a system exception is not being dealt with appropriately in order to maintain reliable and continued operability of backup operations as a whole.
In order to address the log file expansion on the linux proxies, a clean up script was added to cron, and the VeeamAgentConfig was updated on a number of proxies to add the configuration settings of:
AgentMaxLogSize=10485760
AgentMaxLogCount=5
The issue that occured(which we've outlined in the case, which wasn't addressed to our satfisfaction to be honest) is that an entry in the VeeamAgentConfig file was duplicated and saved in error on the first proxy, causing the file configuration to have the duplicate entries in the first proxy of:
AgentMaxLogSize=10485760
AgentMaxLogCount=5
AgentMaxLogSize=10485760
AgentMaxLogCount=5
All other proxies were correctly configured.
This double entry caused the issue reported and prevented the systemd veeamtransport service on the proxy from starting.
The Veeam log files showed that a system exception was encountered with the veeamagent on the proxy and this quickly allowed identification of the problem during our investigation to resolve the issue. Correcting the config and manually running a systemctl restart veeamtransport, the service started without issues and restarted jobs ran using the first proxy without issues.
The environment has 6 proxies, named proxy-001, through proxy-006.
Backup jobs all configured to use auto proxy selection. All proxies of equal weight for selection.
The issue that arose, is that every job tried to use the first proxy and all job retries, causing a cascading failure of every job as they all tried to use the first proxy in the list of proxies to be 'auto' selected from. Now to our understanding and as we've been advised, this was due to the first proxy being the least loaded and the first in the list.
The issue is that this system error exception for the veeamagent is not being dealt with by VBR in a proper manner to address the situation and is a critical problem that affects backup job operations.
It seems that VBR only checks if the proxy is available to be used for the backup job operation by reachability; authentication and and by attempt to use the veeamagent; but disregards any system exception errors if the veeamagent process fails.
It should be that some system integrity process exists that in the event that the proxy fails (such as a system exception error! ) to report that a problem exists with the proxy and be dealt with appropriately to ensure that subsequent scheduled backup jobs to follow and retries of each can auto select another proxy to be used where appropriate. And/or even make the proxy unavailable so it wouldn't be chosen until the veeamagent doesn't report a system exception!
Is there was a watchdog process that checks on the veeamagent status? to ensure all systems are working and available?..
If more than one proxy exists, why is it that VBR doesn't randomly select a proxy of no or lesser load; if backup jobs are configured to auto-select the proxy? This might have mitigated the issue to some degree; but it's not a solution.
The issue exists that if a proxy veeamagent fails, like has occured in this situation where the systemd service cannot start, there appears to be no ability to mitigate the problem with the backup proxy selection operation to work around the problem.. and all backup jobs are doomed to fail as in this situtation.
This is not a problem of the incorrect configuration causing a system process to fail. It's that a system exception is not being dealt with appropriately in order to maintain reliable and continued operability of backup operations as a whole.
-
- Chief Product Officer
- Posts: 32297
- Liked: 7645 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Veeam Backup Proxy Selection Flaw
Yes, of course there's a watchdog that updates the availability of each backup infrastructure component periodically (otherwise task scheduled would not know which backup resources are available for it to use in principle, and what is the current load of each). Which is why I suspect you need to mess up backup proxy in a very special way to run into the issue you observed, at which point we're deviating into "unsupported data mover modifications" territory, which as I've said can be a rabbit hole to pursue and try to handle in the code. But this is certainly something for QA to review more closely before making the final call about this situation.
-
- Service Provider
- Posts: 29
- Liked: 8 times
- Joined: Dec 05, 2019 1:51 am
- Full Name: Daniel Judge
- Contact:
Re: Veeam Backup Proxy Selection Flaw
And can you provide commentary on the behaviour of the proxy selection process. Why it doesn't use a random selection process to select proxies with no or lesser load of those in use?
That's leaves two outstanding questions then...
First, if there is a periodic check , how often is this, and why didn't it pickup the veeamagent was generating system exception issues; and therefore make the proxy unavailable.
The second, again if it wasn't clear in my reponses. The focus isn't about unsupported modifications; it is that the backup operations are failing if more than one proxy are available, because VBR isn't smart enough to address that "one proxy is having veeamagent issues /use another proxy for the retry and subsequent jobs if needed.. and if that problem proxy is still having issues, make it unavailable and exlude it from operations"
And I guess third, why the job during start procedures do not see the system exception as a conditional check on how to proceed to a successful outcome, and try to use another proxy in the backup operation process. If 'x' attempts further fail.. then fail the job(which would indicate 'all' proxies are encountering issues).
Of course if all proxies are having issues this would cause Veeam to make unavailable all proxies if they had this issue; and that is a larger issue at hand.
But having one proxy(the first of many) with veeamagent service issues causing all retries, and backup operations to fail and fall over like a standing line of dominoes, is another thing entirely.
The whole point of having more than one proxy, should be not only to support more concurrrent tasks, but to also provide some 'high availability' mechanism should one of those proxies encounter problems ?
This is a pretty important feature to have for the continued success of backup job operations?
That's leaves two outstanding questions then...
First, if there is a periodic check , how often is this, and why didn't it pickup the veeamagent was generating system exception issues; and therefore make the proxy unavailable.
The second, again if it wasn't clear in my reponses. The focus isn't about unsupported modifications; it is that the backup operations are failing if more than one proxy are available, because VBR isn't smart enough to address that "one proxy is having veeamagent issues /use another proxy for the retry and subsequent jobs if needed.. and if that problem proxy is still having issues, make it unavailable and exlude it from operations"
And I guess third, why the job during start procedures do not see the system exception as a conditional check on how to proceed to a successful outcome, and try to use another proxy in the backup operation process. If 'x' attempts further fail.. then fail the job(which would indicate 'all' proxies are encountering issues).
Of course if all proxies are having issues this would cause Veeam to make unavailable all proxies if they had this issue; and that is a larger issue at hand.
But having one proxy(the first of many) with veeamagent service issues causing all retries, and backup operations to fail and fall over like a standing line of dominoes, is another thing entirely.
The whole point of having more than one proxy, should be not only to support more concurrrent tasks, but to also provide some 'high availability' mechanism should one of those proxies encounter problems ?
This is a pretty important feature to have for the continued success of backup job operations?
-
- Chief Product Officer
- Posts: 32297
- Liked: 7645 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Veeam Backup Proxy Selection Flaw
Exactly, high availability was one of the two reasons for multi-proxy design, with the other one being scalability of course.
Generally speaking, there are no known issues with this [almost 15 years old] functionality of proxy selection, you're probably the very first person to have some complaints about it in all these years and across well over a million of VBR installs, which is why I suspected from the beginning this is some extremely environment-specific problem in play. Otherwise we would have thousands of support cases created about similar issue every day, due to how fundamental it is to backup performance and fitting the backup window. Yet, there are none.
So as I've said, Support/QA will need to review your specific situation closer and make the determination on whether there are any bugs that need fixing, including with what you claim to be a "random" proxy selection... which is of course extremely unlikely that our algorithm uses Rnd() when selecting a backup proxy while completely disregarding its current load
rather, the task scheduler has always used available task slots to determine the best backup proxy, assuming all other things are equal (i.e. all the same processing modes are available for all proxies etc.)
Generally speaking, there are no known issues with this [almost 15 years old] functionality of proxy selection, you're probably the very first person to have some complaints about it in all these years and across well over a million of VBR installs, which is why I suspected from the beginning this is some extremely environment-specific problem in play. Otherwise we would have thousands of support cases created about similar issue every day, due to how fundamental it is to backup performance and fitting the backup window. Yet, there are none.
So as I've said, Support/QA will need to review your specific situation closer and make the determination on whether there are any bugs that need fixing, including with what you claim to be a "random" proxy selection... which is of course extremely unlikely that our algorithm uses Rnd() when selecting a backup proxy while completely disregarding its current load

-
- Service Provider
- Posts: 29
- Liked: 8 times
- Joined: Dec 05, 2019 1:51 am
- Full Name: Daniel Judge
- Contact:
Re: Veeam Backup Proxy Selection Flaw
Thanks for the information.
Ok I have found a snippet of this in the documentation. But this clearly is not operating correctly.
https://helpcenter.veeam.com/docs/backu ... ty&ver=120
quote "Another advantage of the advanced deployment scenario is that it contributes to high availability — jobs can migrate between proxies if one of them becomes overloaded or unavailable."
There is no mention of how this operates however, and if a ssh,icmp check is the only check of being unavailable, rather than checking the veeamagent service operatbility, then this is an issue.
We'll continue to work with the support case.
Ok I have found a snippet of this in the documentation. But this clearly is not operating correctly.
https://helpcenter.veeam.com/docs/backu ... ty&ver=120
quote "Another advantage of the advanced deployment scenario is that it contributes to high availability — jobs can migrate between proxies if one of them becomes overloaded or unavailable."
There is no mention of how this operates however, and if a ssh,icmp check is the only check of being unavailable, rather than checking the veeamagent service operatbility, then this is an issue.
We'll continue to work with the support case.
-
- Veeam ProPartner
- Posts: 592
- Liked: 114 times
- Joined: Dec 29, 2009 12:48 pm
- Full Name: Marco Novelli
- Location: Asti - Italy
- Contact:
Re: Veeam Backup Proxy Selection Flaw
Nasty problem... proxy logic seems in flaw here, it can definitely improved
Marco
Marco
Ciao,
Marco
Marco
Who is online
Users browsing this forum: Google [Bot], lohelle and 67 guests