Comprehensive data protection for all workloads
Post Reply
Khue
Enthusiast
Posts: 67
Liked: 3 times
Joined: Sep 26, 2013 6:01 pm
Contact:

Backup Jobs Failing After Migration to New Veeam Server

Post by Khue »

Hey everyone,

I have a ticket open with Veeam already on this issue, however I was wondering if anyone else has seen this issue in the wild. For the better part of 4 years we had been running Veeam on a DL380 G7 running Windows Server 2012 R2 with some sort of rebranded HPE based Qlogic HBA. With the exception of this one server, our environment is completely virtual. We'd run into the occasional issue, however nothing major. Veeam was very reliable. We ran the DL380 G7 up until sometime around mid February. The DL380 G7 was decommissioned and replaced with a standalone Cisco C-Series C220 M5SX with a Qlogic QLE2692 16Gig Dual port HBA. The C220 is currently running Windows Server 2016 with Veeam 9.5 Update 3 (I have not had an opportunity to update to Update 4 yet).

Ever since the system swap, I have RARELY had a successful evening of backups. I have a total of 4 backups and I am using SAN integrated backups. I am backing up VIA FC from 2 HPE StoreServs. After a bunch of troubleshooting with Veeam initially, we decided to kick this over to Cisco and Cisco recommended updating firmware on the C220 and updating the driver for Windows. Using Cisco's HCL, I identified the most current CIMC firmware, BIOS firmware, QLE2692 firmware, and Windows driver and proceeded to update everything across the board. Still no luck. I am running into the following errors:
  • Processing SERVERNAME Error: Cannot process pending I/O request (offset: 893444423680, size: 4194304): async reader terminated. Failed to upload disk. Agent failed to process method {DataTransfer.SyncDisk}. Exception from server: Cannot process pending I/O request (offset: 893444423680, size: 4194304): async reader terminated Unable to retrieve next block transmission command. Number of already processed blocks: [0]. Failed to download disk.
  • Processing SERVERNAME Error: The request could not be performed because of an I/O device error. Asynchronous read operation failed Failed to upload disk. Agent failed to process method {DataTransfer.SyncDisk}. Exception from server: The request could not be performed because of an I/O device error. Asynchronous read operation failed Unable to retrieve next block transmission command. Number of already processed blocks: [0]. Failed to download disk.
  • Error: Cannot process pending I/O request (offset: 893444423680, size: 4194304): async reader terminated Failed to upload disk. Agent failed to process method {DataTransfer.SyncDisk}. Exception from server: Cannot process pending I/O request (offset: 893444423680, size: 4194304): async reader terminated Unable to retrieve next block transmission command. Number of already processed blocks: [0]. Failed to download disk.
  • Processing SERVERNAME Error: NTFS file: failed to read 3265504 bytes at offset 0 Failed to upload disk. Agent failed to process method {DataTransfer.SyncDisk}.
I have a total of 4 jobs. Each job pulls from the HPE StoreServs and then places the backups on a DD2500 device. The results of the jobs are wildly inconsistent. Jobs 1 and 2 backup 22 and 61 VMs respectively (and pretty much always have). Jobs 3 and 4 each backup a single LARGE VM. For example:
  1. Backup job 1 ran as expected last night. The job size is about 4.5 TBs and per evening on incremental backup it typically reads and transfers between 50 and 100 gigs worth of data. It ran in 24 minutes last night which is what we expect.
  2. Backup job 2 ran successfully but completed with many errors. It started around 8:40 PM and backed up all but 20 servers on the first pass. Most of the failed servers failed with the "Error: NTFS..." message from above. A few failed with the "Error: Cannot process pending I/O request" or the "Error: The request could not be performed because of an I/O device error.". The first pass at the job ended at 10:00 PM. The second pass started at 10:00 PM and then backed up 8 more VMs and finished at 11:52 PM. The third and final pass started at 12:02 AM and backed up all remaining VMs successfully before finishing at 12:42 AM.
  3. Backup job 3 completely failed.
  4. Backup job 4 completely failed.
Under normal running conditions, I expect all incremental jobs to finish prior to 12 AM. Our incremental backups aren't HUGE by comparison to our total data size. Job 1 usually runs in about 20-45 minutes. Job 2 usually takes a bit longer, typically around 90 minutes. Job 3 usually takes about 20-45 minutes and job 4 usually takes about 20-30 minutes.

Any additional thoughts would be great. For the record, the current Veeam case number is 03413450.
ejenner
Veteran
Posts: 636
Liked: 100 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: Backup Jobs Failing After Migration to New Veeam Server

Post by ejenner » 1 person likes this post

What sort of performance options have you enabled? Is it all set to standard or have you tweaked? I'm thinking of things like numbers of concurrent tasks on the proxies and repositories and parallel processing and bandwidth limits... that sort of thing.

Maybe try slowing it down a bit? Is it possible the new setup is too fast for the existing storage to keep up?
Khue
Enthusiast
Posts: 67
Liked: 3 times
Joined: Sep 26, 2013 6:01 pm
Contact:

Re: Backup Jobs Failing After Migration to New Veeam Server

Post by Khue »

Everything is pretty standard. Nothing too special. Backend storage is pretty fast. AFA over 8Gbps FC. DD2500 is most likely the throttle point for backups at this point and it hasn't really changed. I'll take a look at some of the jobs and see if reducing the concurrency fixes the issue. Good thought. I'll report back if anything changes.
nitramd
Veteran
Posts: 297
Liked: 85 times
Joined: Feb 16, 2017 8:05 pm
Contact:

Re: Backup Jobs Failing After Migration to New Veeam Server

Post by nitramd » 1 person likes this post

@Khue, do you have any throttling rules in place?
Khue
Enthusiast
Posts: 67
Liked: 3 times
Joined: Sep 26, 2013 6:01 pm
Contact:

Re: Backup Jobs Failing After Migration to New Veeam Server

Post by Khue »

No, no throttling rules that I know of. As an update to this, I got thinking about my configuration and one of the biggest changes with the new server is the HBA. I went from an 8 Gbps HBA to a 16 Gbps HBA. Could there be some sort of incompatibility issue with my fiber switch? The switch is a Brocade Silkworm 300E and all ports are 8 Gbps ports. I looked at the compatibility guide and everything from the Brocade side seemed like the 300E should be happy with the newer Qlogic HBA.
Khue
Enthusiast
Posts: 67
Liked: 3 times
Joined: Sep 26, 2013 6:01 pm
Contact:

Re: Backup Jobs Failing After Migration to New Veeam Server

Post by Khue »

I'd like to follow up on this and let everyone know what I found. Veeam level 2 engineering went through some good troubleshooting with me over the phone yesterday. What we found is that the dual port HBA might be the source of the problem. Here's the testing scenario we ran through:
  • When both HBA ports are active, some of our backups work, others don’t. It’s very hit or miss and there’s no common repeatable pattern.
  • During testing with a temporary job and a few isolated VMs, the backups over the standard Ethernet network work just fine
  • During testing with a temporary job and a few isolated VMs, the backups over the SAN network with both HBA ports enabled fails pretty regularly with intermittent successes.
  • During testing with a temporary job and a few isolated VMs, the backups over the SAN network with HBA port connected to fabric A are consistently successful
  • During testing with a temporary job and a few isolated VMs, the backups over the SAN network with HBA port connected to fabric B are consistently unsuccessful
  • Brocade 300 E port connected to the HBA port for fabric A reports no errors
  • Brocade 300 E port connected to the HBA port for fabric B reports no errors
Last night with HBA port on fabric B disconnected/disabled all backups ran fine. I think this leaves me with 2 potential issues:
  1. Something is wrong with the HBA
  2. I need to install some sort of MPIO driver or something
My only thought as to why it's not an MPIO issue is that HBA port on fabric B just straight up doesn't seem to work at all. If it was an MPIO issue, I would assume that simply running on HBA port connected to fabric B would result in a successful backup as long as HBA port connected to fabric A was disconnected and that's not the case. Also, I am somewhat curious as to if I have some sort of larger issue with my B fabric in general (I will look at that today). I will report back with more findings later. I am still interested in the community's thoughts on this. Thanks in advance!
Post Reply

Who is online

Users browsing this forum: Google [Bot], HansA, Semrush [Bot] and 105 guests