LSI controller warning since upgrade to VBR 11

dbr · Post by **dbr** » Jul 27, 2021 8:37 pm this post

Hi guys,

I'm facing a strange issue (case #04923012). Since upgrade to VBR 11 in March this year we receive following event log entries regularly on our physical backup repository server:

Code: Select all

Log: Application
Source: LSA_Monitor
Event ID: 268
Message: Controller ID: 0 PD - C0  :1:14 (EnclosureId: 67; DeviceId: 8) : Reset. Type: 3, Path: 0x5000c500adb26111.

Log: Application
Source: LSA_Monitor
Event ID: 267
Message: Controller ID: 0 PD - C0  :1:14 (EnclosureId: 67; DeviceId: 8) : Command timeout; Additional Sense Info: No additional sense information. CDB:  0x88  0x00  0x00  0x00  0x00  0x01  0xc0  0x55  0x90  0x00  0x00  0x00  0x00  0x80  0x00  0x00 .

It seems that this issue occurs only on load in any way, but we hadn't issues for 15 months before. We see that the issue only occurs on read peaks and mainly but not every time while backups running and writing to the repository. We have even higher read peaks with no issues. The vendor has also no more idea. The repository server is already limited by 99 tasks. We had the issue for a copy job where the repository server couldn't start more than 755 veeamagent.exe processes. The resolution from support was just to limit the task (case #03440861). Unfortunately I couldn't set a value greater than 99. Anyway, since version 11 Veeam is doing anything different compared to earlier versions or just performs better and therefore maybe now runs close to controller capacity. Veeam-Support told me to set the limit to an even lower value. Does anyone know what these errors mean? For our Veeam engineers in the forum with access to the case: I've uploaded a bunch of Veeam logs, performance monitor data collectors and event logs. Maybe someone can have a look. Meanwhile, I will throttle the task limit to let's say 50 and monitor whether the issue will occur again.

Thanks, Daniel.

Post by **PetrM** » Jul 27, 2021 9:27 pm this post

Hi Daniel,

Sounds like a complicated technical problem which most probably needs to be investigated by our engineers together with the storage vendor support. Anyway, let's wait for recap from our support based on analysis of the latest logs, I'm sure our engineers will be able to figure out the most suitable approach to isolate the issue.

Thanks!

dbr · Post by **dbr** » Jul 27, 2021 9:35 pm this post

Hi Petr,

I opened this thread because it sounds like the engineer is going to close the case and provided decreasing the tasks limit as a solution. I would appreciate if Veeam will work on this issue together with the vendor. I replied to clarify whether the case is really going to be closed or we can go on in this case. I will keep you posted.

Post by **PetrM** » Jul 27, 2021 9:43 pm this post

Don't hesitate to request an escalation of the support case if you have a feeling that it moves in the wrong direction.

Thanks!

Jul 28, 2021 1:05 am

Out of curiosity, do the reported messages cause any actual issues? Also, do you see them with different devices or is it always the same enclosure and device reported?

I've personally seen these errors, or similar, on all types of LSI controllers over the years, especially when the devices are running at maximum capacity, although interestingly I've seen them more on SSD devices than HDD. In my lab I can use a simple benchmark to produce nearly identical messages with my SuperMicro server with 12x SSD drives, but they've never seemed to actually cause any issue.

For example, here's a nearly identical issue with Cisco UCS with Samsung SSD drives: https://quickview.cloudapps.cisco.com/q ... CSCvf13045

dbr · Post by **dbr** » Jul 28, 2021 7:26 am this post

I have to admit, that these errors causing any obvious problems. But I generally don't just ignore that kind of error instead I try always to get the root cause. Ok, that's not always possible but I try. Therefore I still want to find out, what's causing these errors respectively why it only occurs since update to VBR 11. We have configured all out LSI-Controller to send alerts if anything is wrong. Even though this is a warning, I don't want to filter those emails in Outlook neither I want to have my colleagues to do so. As far as I know I cannot disable only this type of error without suppressing actual errors. Coincidentally, our repository server is a SuperMicro server, too. I'm not sure if I understand you correctly, but I've seen this error only on this particular machine. But it's the only machine we use for primary backup. We have a similar machine for copy jobs, the difference is that during backups the affected machine is the only one that is written to and read from in parallel. On the backup copy machine there's only write activity.

I tried to force the issue with multiple diskspd jobs to simulate a copy job with many read tasks and a few write tasks. This should be close to reality, but I coundn't force the issue. You said, you were able to produce those message with a simple benchmark. Do you have something special in mind? The problem at the moment is, that I cannot reproduce it except waiting for the issue while backup is running. Note: I decreased the task limit on the repository from 99 to 50. The error still occured.

Jul 28, 2021 3:25 pm

I understand not wanting to ignore it, but it's difficult to see how Veeam support could be of much help. Veeam is just a heavy filesystem consumer, yes, we generate a lot of I/O, but we don't have any insight into the hardware underneath, our knowledge stops at OS level system calls.

Now, I could theorize why you might see the issue with v11 vs prior versions. In v11 we switched to using unbuffered I/O, which, as it's name implies, bypasses to OS buffer cache. This generally improves performance for systems with hardware controllers with good buffers (i.e. commercial/enterprise grade hardware), but also puts significantly more stress on the storage subsystem vs buffered I/O. Also, as always, every version includes performance optimizations, so it's possible v11 is simply stressing the I/O more than prior versions. You could potentially even try the UseUnbufferedAccess=0 registry key to go back to v10 I/O behavior and see if that changes anything.

Earlier I asked if the message was always with the same enclosure/device or if it's always different devices. If it's the same one all the time (I'm assuming it's not) then I'd suspect the specific device might be having an issue. Otherwise it seems more likely to be a bus timeout issue which could be triggered by load. Regardless, it's difficult to see how anyone other than the storage support engineers could help solve the issue as the Veeam workload is just the catalyst.

In my case, the messages happened to random devices/drives, always during heavy I/O, but not with much consistency. Sometimes it would go a few days without a message, other times I'd get multiple messages in an hour, always on a random drive, sometimes dozens or even hundreds in a night. I tried new drive firmware, latest controller firmware, even tweaks to timeout settings in the LSI BIOS, but I never managed to impact the messages in any measurable way. I eventually just ignored them and it's been operating that way for years.

Another question, do you have any SSDs in this setup at all? Just curious.

For my case I was able to produce the issue by using the iozone benchmark in throughput mode with many parallel tasks, usually something like:

Code: Select all

iozone -I -r 512k -t 8 -s 2g

This is only 8 parallel threads, each reading/writing a 2G file, but you can tweak those. The -r 512K is the record size, which matches an average Veeam block size, and the -I parameter tells it to use direct I/O, similar to how v11 works. Try that at 99 (or 50) tasks and see how it works, although you will probably want to shrink the file size (-s) if you don't want to wait forever.

These are difficult problems to solve for sure, so I wish you luck and I hope you keep us updated on what you find.

dbr · Jul 29, 2021 7:00 am

I'm really comprehend what you say and I had exactly the same in mind (Veeam's performance is getting better and better, Veeam is just using the underlying hardware and so on). Thus, we contacted our vendor first. I agree that is not a Veeam problem and I really appreciate the help of the support and forums and I praise Veeam's support and forums much for it in general, not only for this case. Anyway, I'm still in contact with Veeam support. Parallel to that I will try to force the issue with iozone on the primary system. Once the error is thrown, I will try to test a similar system to see whether it is only related to a single machine or a common issue. Thanks for your input so far. I will keep you posted on this.

dbr · Post by **dbr** » Jul 29, 2021 1:02 pm this post

FYI: We have UseUnbufferedAccess=0 already configured, due to a bug in relation to ReFS / FastCloning. My case: #04843767 / should be related to this. I set the Task limit to 4, just to see if the error is still occuring and have still iozone on the list.

dbr · Aug 02, 2021 2:56 pm

Meanwhile, I tested with following syntax with no luck:

Code: Select all

iozone -I -i 0 -i 1 -r 512k -t 99 -s 2g

With a task limit of 4 it seems that the issue doesn't occur, but that's no option for us because a second copy job offsite is "occupying" 2 tasks with only a low throughput. That slows down overall performance for the onsite backup copy job. Anyway, the case has been escalated to second level and I will go on working with him. Once I have news, I will post them here.

R&D Forums

LSI controller warning since upgrade to VBR 11

Re: LSI controller warning since upgrade to VBR 11

Re: LSI controller warning since upgrade to VBR 11

Re: LSI controller warning since upgrade to VBR 11

Re: LSI controller warning since upgrade to VBR 11

Re: LSI controller warning since upgrade to VBR 11

Re: LSI controller warning since upgrade to VBR 11

Re: LSI controller warning since upgrade to VBR 11

Re: LSI controller warning since upgrade to VBR 11

Re: LSI controller warning since upgrade to VBR 11

Who is online