Gostev wrote:First tests are done and we could not confirm an issue using multiple VMs with 11 disks. To be continued...
There has to be something to this. You can look at our ticket which will have logs uploaded. This customer has three backup jobs with 10, 13 and 26 VMs and two replication jobs with 9 and 10 VMs. There are only two VMs that go into the hotadd "pausing job" issue. One is a SQL server in job #1 with 10 VMs. The other is a backend Exchange server in job # 2 with 13 VMs. The third job with 26 VMs is not affected. Neither are the two replication jobs, but as I mentioned, neither the two replication jobs nor the third backup job have those two VMs in it (or any other VMs with high disk count). No matter what order we move those two VMs in their jobs (first, middle, last), the jobs tank as soon as they reach those two VMs. The issue has to be related to drive quantity. It could maybe be total drive sizes? Thos two VMs are their largest VMs. If you add up all 8/9 drives the SQL server is 2.2TB and the Exchange server is 3.5TB, but this is all spread out on 100GB, 200GB, 500GB drives for both VMs. They have three other VMs that are file servers with hundreds of thousands of files and they backup fine. They are 2.0TB and 1.7TB in size. Those three have only two drives, a small 60GB OS drive and then the 1.7TB to 2.0TB drive. I'll list things I can think might influence this.
#1 - VMs with high quantity of disks
#2 - All their VMs including the two affected ones use the VMware Paravirtual SCSI controller. Not sure if you tested this in conjunction with high quantity of disks.
#3 - VMs with large disks. Those two VMs are their biggest ones. Well over 2TB in total disks added up. There next larger VMs are exactly 2.0TB and down. Not sure if their is a majic threshold of 2TB for this issue.
#4 - Application aware processing VMs. All their VMs run with this on. The two affected VMs are a SQL and backend Exchange that have it enabled for obvious transaction log reasons. Again, not sure if you tested this in conjunction with high quantity of disks.
#5 - The current alignment of planets and moon causing this. I'm at a loss with this issue. No matter what we do to the backup jobs, when it hits those two VMs, they blow up.
Hopefully you can get some insight in your testing. If you need any specific changes tested on our end, just let me know and we'll try it. Every body just wants to get to the source of this evil.