Comprehensive data protection for all workloads
Post Reply
mcz
Veteran
Posts: 948
Liked: 223 times
Joined: Jul 19, 2016 8:39 am
Full Name: Michael
Location: Rheintal, Austria
Contact:

timeouts on LLVM - linux guys please help

Post by mcz »

Hi everyone,

on my hardened repository server, there are two LUNs. I've connected these two to a new volume via LLVM and everything is working fine - except the health checks (the always fail). While support had first suggested that it might be due to storage corruption (which is not the case), I had the confirmation after a while that it was due to the mix of a fast an slower LUN (HDDs vs SSDs). The error I always get is:
Jun 8 22:01:45 host kernel: blk_update_request: I/O error, dev sdc, sector 407982080 op 0x0:(READ) flags 0x0 phys_seg 13 prio class 0
Jun 8 22:01:45 host kernel: sd 0:1:0:2: [sdc] tag#98 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=6s
Jun 8 22:01:45 host kernel: sd 0:1:0:2: [sdc] tag#98 Sense Key : Aborted Command [current]
Jun 8 22:01:45 host kernel: sd 0:1:0:2: [sdc] tag#98 Add. Sense: Timeout on logical unit
Now support has mentioned that it makes sense as the SSDs would wait for the HDDs... I don't understand this explanation. If a process askes for reading a certain offset within a file, the process doesn't know on which LUN the data was written to. This would only be known to the file system / LLVM which would then read on the specific LUN(s). But I guess those two LUNs don't act with each other, so why should one wait for the other? The OS waits until all data has been fetched from both, yes, but I guess they don't deadlock each other...

Anyways, my goal is to bring this to work and support had mentioned that there are no options to fine-tune it, which surprises me. What I assume is that there is a certain timeout and after that timeout, the read-operation failes. It's just interesting that it did work before with only the HDD LUN, so having the faster SDD LUN within the volume has changed the dynamic.

Any ideas why this might suddenly fail and what I could do to bring it back to work? We had tried to change the read throughput on the repository, but this is being ignored by the health check.
Can you maybe throttle the IO on the volume? Less IO would result in better latency...

Thanks!
david.domask
Veeam Software
Posts: 2688
Liked: 620 times
Joined: Jun 28, 2016 12:12 pm
Contact:

Re: timeouts on LLVM - linux guys please help

Post by david.domask »

Hi mcz,

Thank you for the write up; can you please share the case number for us to review? I'm not quite sure I follow the explanation either and would be great to review the discussion in the case.

As for the repository IO throttling, indeed health checks are not considered for this option. If you could share the case number, we can better comment on the behavior.

Thanks!
David Domask | Product Management: Principal Analyst
mcz
Veteran
Posts: 948
Liked: 223 times
Joined: Jul 19, 2016 8:39 am
Full Name: Michael
Location: Rheintal, Austria
Contact:

Re: timeouts on LLVM - linux guys please help

Post by mcz »

Hi David,

sure! Case: #07295030

Thanks!
mcz
Veteran
Posts: 948
Liked: 223 times
Joined: Jul 19, 2016 8:39 am
Full Name: Michael
Location: Rheintal, Austria
Contact:

Re: timeouts on LLVM - linux guys please help

Post by mcz »

ok, I think I can share the results from my conversation with David: We have disabled the async data processing on that particular repository. Obviously that leads to less IO which doesn't stress the HDD LUN too much, hence it's working as before.

Thank you David!
Post Reply

Who is online

Users browsing this forum: Baidu [Spider], Bing [Bot], Google [Bot] and 91 guests