timeouts on LLVM - linux guys please help

mcz · Post by **mcz** » Jul 29, 2024 7:34 am this post

Hi everyone,

on my hardened repository server, there are two LUNs. I've connected these two to a new volume via LLVM and everything is working fine - except the health checks (the always fail). While support had first suggested that it might be due to storage corruption (which is not the case), I had the confirmation after a while that it was due to the mix of a fast an slower LUN (HDDs vs SSDs). The error I always get is:

Jun 8 22:01:45 host kernel: blk_update_request: I/O error, dev sdc, sector 407982080 op 0x0:(READ) flags 0x0 phys_seg 13 prio class 0
Jun 8 22:01:45 host kernel: sd 0:1:0:2: [sdc] tag#98 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=6s
Jun 8 22:01:45 host kernel: sd 0:1:0:2: [sdc] tag#98 Sense Key : Aborted Command [current]
Jun 8 22:01:45 host kernel: sd 0:1:0:2: [sdc] tag#98 Add. Sense: Timeout on logical unit

Now support has mentioned that it makes sense as the SSDs would wait for the HDDs... I don't understand this explanation. If a process askes for reading a certain offset within a file, the process doesn't know on which LUN the data was written to. This would only be known to the file system / LLVM which would then read on the specific LUN(s). But I guess those two LUNs don't act with each other, so why should one wait for the other? The OS waits until all data has been fetched from both, yes, but I guess they don't deadlock each other...

Anyways, my goal is to bring this to work and support had mentioned that there are no options to fine-tune it, which surprises me. What I assume is that there is a certain timeout and after that timeout, the read-operation failes. It's just interesting that it did work before with only the HDD LUN, so having the faster SDD LUN within the volume has changed the dynamic.

Any ideas why this might suddenly fail and what I could do to bring it back to work? We had tried to change the read throughput on the repository, but this is being ignored by the health check.
Can you maybe throttle the IO on the volume? Less IO would result in better latency...

Thanks!

Post by **david.domask** » Jul 29, 2024 8:56 am this post

Hi mcz,

Thank you for the write up; can you please share the case number for us to review? I'm not quite sure I follow the explanation either and would be great to review the discussion in the case.

As for the repository IO throttling, indeed health checks are not considered for this option. If you could share the case number, we can better comment on the behavior.

Thanks!

mcz · Post by **mcz** » Jul 29, 2024 9:26 am this post

Hi David,

sure! Case: #07295030

Thanks!

mcz · Post by **mcz** » Aug 05, 2024 6:29 am this post

ok, I think I can share the results from my conversation with David: We have disabled the async data processing on that particular repository. Obviously that leads to less IO which doesn't stress the HDD LUN too much, hence it's working as before.

Thank you David!

R&D Forums

timeouts on LLVM - linux guys please help

Re: timeouts on LLVM - linux guys please help

Re: timeouts on LLVM - linux guys please help

Re: timeouts on LLVM - linux guys please help

Who is online