after 5 months of finding issue why Veeam is not possible to backup couple of Azure VMs running Microsoft SQL Server. I've decided to post something to forum.
We are facing major issues with Veeam which is intermittently failing into state from which only "Reboot of whole VBR" server can partially help. Whole Veeam Infrastructure is built in Azure and is taking care about Azure VMs running Microsoft SQL Servers.
First our database team colleagues found out that there are problems with transactional log backups, then with backups in general and then it is required to reboot server because no jobs can be started, stopped, nothing is running database is full of information that jobs are already running and there is now way around this. This problem usually comes in times where many changes to our Azure environment is happening. Now it occurs from time to time but at least once or twice a week. Currently about 80+ individual SQL Servers located in different Azure Subscriptions using dedicated Storage Account for backups with copy to storage account sitting in different azure region.
After 4 months and exchange of 0.5TB of VEEAM LOGS we came to nothing. Completly nothing, we do not know what is root cause of whole problem, we do not know how to fix it, we do not know how to continue. Not mentioning that it is causing troubles to production SQL Systems which are growing in TR Logs and failing down.
Complete disaster is crash in Veeam process :
Code: Select all
CLR exception type: System.TimeoutException
"Cloud instance is unresponsive."
Call stack snippet:
Veeam.Backup.ServiceLib.CPublicCloudQueueHubService.ReceiveResponse
→ Veeam.Backup.Core.CCloudMessageServiceSendQueueClient.ReceiveResponse
→ ...
→ System.Threading.ThreadHelper.ThreadStart
My internal review (I was not fully involved in investigation) is following :
- Where the Timeout Happens
The exception arises in Veeam.Backup.ServiceLib.CPublicCloudQueueHubService.ReceiveResponse, which is part of Veeam’s logic for talking to cloud services (e.g., Azure Blob, Amazon S3, or a Veeam Cloud Connect provider). In other words, the “hub service” is attempting to send or receive data and has not gotten a response in time.
- “Cloud instance is unresponsive.”
This is Veeam’s message stating that the remote endpoint (storage, queue, or cloud connect server) is not replying before the configured .NET/Veeam timeout.
- In Azure scenarios, this can be triggered by high latency or an unavailable storage service, network interruption, or throttling on the Azure side.
- If you run large or frequent backups, or have concurrency set high, you might hit ephemeral network issues that cause the Veeam call to stall until it times out.
- Likely Root Causes
- Azure transient network or storage latency. Even minor network blips can cause timeouts if Veeam’s default limit is exceeded.
- Azure resource throttling (e.g., exceeding storage IOPS or egress limits).
- Firewall / NSG misconfiguration that intermittently blocks or slows traffic.
- DNS / name resolution delays.
- Timeout / Retry settings in Veeam are too low to handle sporadic high-latency calls.
Anyway none of above should cause troubles to Veeam at all, its architecture should be somewhat resilient and durable, because we know how it is in Cloud. Sometimes Too Many requests, simetimes something else. However it is clear that someone should properly handle how "Cloud Machines" are handled. We know that previously Veeam was using "Service Bus" unfortunately that option missed our implementation by 2 months. So we are now using "Azure Storage QUEUES" I do not know if this was really good approach, in my eyes it seems like complete disaster. There must be something wrong with Veeam, .NET Framework or something and it has to be fixed. Otherwise it is not possible to wait for crash of complete backup system just becusase timeout occurred somewhere.
Did anyone experienced this issue? Can someone help? We do appreciate any help here.
Backups running daily with 1h trlog backups (Where FULL RECOVERY DBs are), there are invidual copy jobs which do secondary copy to different azure region. We do have Storage QUEUEs in each subscription where resources are sitting. Storage Accounts are accessible to both backed up VMs and also to VBR server. It is almost 80 Individual Subscriptions for 80+ Individual Azure VMs. Veeam B&R Server is quite beefy Azure VM and should handle even more, it has individual dedicated remote MSSQL Server Also Azure VM which has no performance issues. Azure Cloud Machines (Agents) do have sizing required for running Microsoft SQL Servers. Usually there are no more then 1 backups running to one Storage Account.