Disk latency alerts and SCSI aborts during Backup

labsy · Post by **labsy** » May 05, 2024 11:51 pm this post

Hi,

I have a small setup:
- ESX 6.7 with a dozen of VMs on one location
- My home PC and NAS disk array with Veeam B&R andd VeeamOne on another location
- inbetween there's 500/500 Mbps internet connection

It's been happening for more than a year, every week's full backup generates dozens of alerts via e-mail:

Code: Select all

Alarm - Host disk SCSI aborts (state: Error)
Alarm - Host disk SCSI aborts (state: Reset/resolved)
Alarm - Datastore write latency (state: Warning)
Alarm - Datastore write latency (state: Reset/resolved)
Alarm - VM total disk latency (state: Warning)
Alarm - VM total disk latency (state: Reset/resolved)

I never found exactly what's wrong. However, I can see some errors on ESX host, but diagnosing is limited. Looks like all RAID arrays would have problems at that time:
vmkwarning.log

Code: Select all

WARNING: SVM: 5761: scsi0:1 VMX took 2283 msecs to send copy bitmap for offset 1260572901376. This is greater than expected latency. If this is a vvol disk, check with array latency.
WARNING: SVM: 5761: scsi0:1 VMX took 1352 msecs to send copy bitmap for offset 1282047737856. This is greater than expected latency. If this is a vvol disk, check with array latency.
WARNING: SVM: 5761: scsi0:1 VMX took 1009 msecs to send copy bitmap for offset 1288490188800. This is greater than expected latency. If this is a vvol disk, check with array latency.
WARNING: SVM: 5761: scsi0:1 VMX took 1882 msecs to send copy bitmap for offset 1297080123392. This is greater than expected latency. If this is a vvol disk, check with array latency.
WARNING: SVM: 5761: scsi0:1 VMX took 1234 msecs to send copy bitmap for offset 1301375090688. This is greater than expected latency. If this is a vvol disk, check with array latency.
WARNING: SVM: 5761: scsi0:1 VMX took 1439 msecs to send copy bitmap for offset 1324997410816. This is greater than expected latency. If this is a vvol disk, check with array latency.
WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.60030057027cabf02094118fc22c20b0" state in doubt; requested fast path state update...
WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.60030057027cabf020941096b3552071" state in doubt; requested fast path state update...
WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.60030057027cabf020941246cd1b61f9" state in doubt; requested fast path state update...
WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.60030057027cabf0209412edd702b94e" state in doubt; requested fast path state update...
WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.60030057027cabf0209412a6d2c6875b" state in doubt; requested fast path state update...
WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.60030057027cabf020941246cd1b61f9" state in doubt; requested fast path state update...

...and vmkernel.log, from time to time...but none of those is mapped Backup NAS drive:

Code: Select all

ScsiDeviceIO: 3435: Cmd(0x459b4efbf7c0) 0x85, CmdSN 0x87238 from world 2099828 to dev "naa.60030057027cabf020941246cd1b61f9" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
ScsiDeviceIO: 3435: Cmd(0x459b4efbf7c0) 0x1a, CmdSN 0xb35ed3 from world 0 to dev "naa.60030057027cabf0209412a6d2c6875b" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

Since this is enthusiastic setup, I cannot afford some paid service engineer, so asking for a clue. Maybe something wrong with Veeam B&R config on my home PC for this "over-the-WAN" setup? I would suspect 1 DISK, but errors or warnings do not point to single one, but rather to all of them. So, is maybe RAID Controller faulty? It should get at least some error on RAID Ctrl, but I can't find any.
Someone kick me in the right direction. Thanx!

May 06, 2024 3:43 am

if iSCSI connection is over WAN, please change that.

May 06, 2024 5:33 am

Hi Andrej

That sounds like a technical issue. We cannot investigate such issues through a forum topic.
Please provide a support case ID for this issue, as requested when you click New Topic. Without case number, the topic will eventually be deleted by moderators.

Unfortunately we cannot investigate log files over a forum post. But one thing I would like to ask, did you deploy a Veeam VmWare proxy on the ESXi host? The proxy should be in the same side as the ESXi infrastructure.

Best regards,
Fabian

PS: support can only help if you upload logs https://www.veeam.com/kb1832

May 06, 2024 6:51 am

I think this is just the way the infrastructure is designed and components selected.

In general, when you add additional IO or throughput load to a disk controller, you hit a point where the controller or the disks can not keep up with the demand, and therefore, latency will rise significantly.
As backup transport a lot of data, this can happen if you overload the controller with it.
You can work with the Veeam task slots on the Proxy to reduce the number of parallel reads needed for backup. This might help to avoid the situation.

labsy · Post by **labsy** » May 06, 2024 7:00 am this post

Hi all!

Thank you very much for your responses!

@karsten123: No, iSCSI is not over WAN. NAS is connected to my home PC on same LAN and mapped as SMB network share to my PC.

@Mildur: Understood! But before I raise (probably paid) ticket, I will try the hint you provided - I did NOT deploy Veeam VmWare Proxy on ESX host, as I did not have that knowledge. But will take a look into that, maybe this is what leads all comm over WAN and slows things down.

@Andreas Neufert: Yes, logical. Beside, I did not provide ESX Proxy on host side, and possibly 4 parallel tasks over WAN are too much. Thanks for the hint!

May 06, 2024 11:12 am

To be honest.. if you run ESXi 6.7 your hardware might be 10 years old or so and the question is why it starts stalling when VM runs on a snapshot or deleting the snaps. It looks like your Disks arent fast enough. Whats the status of the hardware (bad blocks and error counting of the disk and battery of the RAID controller)? Is writethough active?

Please post a esxtop where we can see the device latency during backup.

Regards,
Joerg

R&D Forums

Disk latency alerts and SCSI aborts during Backup

Re: Disk latency alerts and SCSI aborts during Backup

Re: Disk latency alerts and SCSI aborts during Backup

Re: Disk latency alerts and SCSI aborts during Backup

Re: Disk latency alerts and SCSI aborts during Backup

Re: Disk latency alerts and SCSI aborts during Backup

Who is online