Comprehensive data protection for all workloads
Post Reply
dbr
Expert
Posts: 118
Liked: 16 times
Joined: Apr 06, 2017 9:48 am
Full Name: Daniel Brase
Contact:

LSI controller warning since upgrade to VBR 11

Post by dbr »

Hi guys,

I'm facing a strange issue (case #04923012). Since upgrade to VBR 11 in March this year we receive following event log entries regularly on our physical backup repository server:

Code: Select all

Log: Application
Source: LSA_Monitor
Event ID: 268
Message: Controller ID: 0 PD - C0  :1:14 (EnclosureId: 67; DeviceId: 8) : Reset. Type: 3, Path: 0x5000c500adb26111.

Log: Application
Source: LSA_Monitor
Event ID: 267
Message: Controller ID: 0 PD - C0  :1:14 (EnclosureId: 67; DeviceId: 8) : Command timeout; Additional Sense Info: No additional sense information. CDB:  0x88  0x00  0x00  0x00  0x00  0x01  0xc0  0x55  0x90  0x00  0x00  0x00  0x00  0x80  0x00  0x00 .
It seems that this issue occurs only on load in any way, but we hadn't issues for 15 months before. We see that the issue only occurs on read peaks and mainly but not every time while backups running and writing to the repository. We have even higher read peaks with no issues. The vendor has also no more idea. The repository server is already limited by 99 tasks. We had the issue for a copy job where the repository server couldn't start more than 755 veeamagent.exe processes. The resolution from support was just to limit the task (case #03440861). Unfortunately I couldn't set a value greater than 99. Anyway, since version 11 Veeam is doing anything different compared to earlier versions or just performs better and therefore maybe now runs close to controller capacity. Veeam-Support told me to set the limit to an even lower value. Does anyone know what these errors mean? For our Veeam engineers in the forum with access to the case: I've uploaded a bunch of Veeam logs, performance monitor data collectors and event logs. Maybe someone can have a look. Meanwhile, I will throttle the task limit to let's say 50 and monitor whether the issue will occur again.

Thanks, Daniel.
PetrM
Veeam Software
Posts: 3264
Liked: 528 times
Joined: Aug 28, 2013 8:23 am
Full Name: Petr Makarov
Location: Prague, Czech Republic
Contact:

Re: LSI controller warning since upgrade to VBR 11

Post by PetrM »

Hi Daniel,

Sounds like a complicated technical problem which most probably needs to be investigated by our engineers together with the storage vendor support. Anyway, let's wait for recap from our support based on analysis of the latest logs, I'm sure our engineers will be able to figure out the most suitable approach to isolate the issue.

Thanks!
dbr
Expert
Posts: 118
Liked: 16 times
Joined: Apr 06, 2017 9:48 am
Full Name: Daniel Brase
Contact:

Re: LSI controller warning since upgrade to VBR 11

Post by dbr »

Hi Petr,

I opened this thread because it sounds like the engineer is going to close the case and provided decreasing the tasks limit as a solution. I would appreciate if Veeam will work on this issue together with the vendor. I replied to clarify whether the case is really going to be closed or we can go on in this case. I will keep you posted.
PetrM
Veeam Software
Posts: 3264
Liked: 528 times
Joined: Aug 28, 2013 8:23 am
Full Name: Petr Makarov
Location: Prague, Czech Republic
Contact:

Re: LSI controller warning since upgrade to VBR 11

Post by PetrM »

Don't hesitate to request an escalation of the support case if you have a feeling that it moves in the wrong direction.

Thanks!
tsightler
VP, Product Management
Posts: 6013
Liked: 2843 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: LSI controller warning since upgrade to VBR 11

Post by tsightler » 1 person likes this post

Out of curiosity, do the reported messages cause any actual issues? Also, do you see them with different devices or is it always the same enclosure and device reported?

I've personally seen these errors, or similar, on all types of LSI controllers over the years, especially when the devices are running at maximum capacity, although interestingly I've seen them more on SSD devices than HDD. In my lab I can use a simple benchmark to produce nearly identical messages with my SuperMicro server with 12x SSD drives, but they've never seemed to actually cause any issue.

For example, here's a nearly identical issue with Cisco UCS with Samsung SSD drives: https://quickview.cloudapps.cisco.com/q ... CSCvf13045
dbr
Expert
Posts: 118
Liked: 16 times
Joined: Apr 06, 2017 9:48 am
Full Name: Daniel Brase
Contact:

Re: LSI controller warning since upgrade to VBR 11

Post by dbr »

I have to admit, that these errors causing any obvious problems. But I generally don't just ignore that kind of error instead I try always to get the root cause. Ok, that's not always possible but I try. Therefore I still want to find out, what's causing these errors respectively why it only occurs since update to VBR 11. We have configured all out LSI-Controller to send alerts if anything is wrong. Even though this is a warning, I don't want to filter those emails in Outlook neither I want to have my colleagues to do so. As far as I know I cannot disable only this type of error without suppressing actual errors. Coincidentally, our repository server is a SuperMicro server, too. I'm not sure if I understand you correctly, but I've seen this error only on this particular machine. But it's the only machine we use for primary backup. We have a similar machine for copy jobs, the difference is that during backups the affected machine is the only one that is written to and read from in parallel. On the backup copy machine there's only write activity.

I tried to force the issue with multiple diskspd jobs to simulate a copy job with many read tasks and a few write tasks. This should be close to reality, but I coundn't force the issue. You said, you were able to produce those message with a simple benchmark. Do you have something special in mind? The problem at the moment is, that I cannot reproduce it except waiting for the issue while backup is running. Note: I decreased the task limit on the repository from 99 to 50. The error still occured.
tsightler
VP, Product Management
Posts: 6013
Liked: 2843 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: LSI controller warning since upgrade to VBR 11

Post by tsightler » 7 people like this post

I understand not wanting to ignore it, but it's difficult to see how Veeam support could be of much help. Veeam is just a heavy filesystem consumer, yes, we generate a lot of I/O, but we don't have any insight into the hardware underneath, our knowledge stops at OS level system calls.

Now, I could theorize why you might see the issue with v11 vs prior versions. In v11 we switched to using unbuffered I/O, which, as it's name implies, bypasses to OS buffer cache. This generally improves performance for systems with hardware controllers with good buffers (i.e. commercial/enterprise grade hardware), but also puts significantly more stress on the storage subsystem vs buffered I/O. Also, as always, every version includes performance optimizations, so it's possible v11 is simply stressing the I/O more than prior versions. You could potentially even try the UseUnbufferedAccess=0 registry key to go back to v10 I/O behavior and see if that changes anything.

Earlier I asked if the message was always with the same enclosure/device or if it's always different devices. If it's the same one all the time (I'm assuming it's not) then I'd suspect the specific device might be having an issue. Otherwise it seems more likely to be a bus timeout issue which could be triggered by load. Regardless, it's difficult to see how anyone other than the storage support engineers could help solve the issue as the Veeam workload is just the catalyst.

In my case, the messages happened to random devices/drives, always during heavy I/O, but not with much consistency. Sometimes it would go a few days without a message, other times I'd get multiple messages in an hour, always on a random drive, sometimes dozens or even hundreds in a night. I tried new drive firmware, latest controller firmware, even tweaks to timeout settings in the LSI BIOS, but I never managed to impact the messages in any measurable way. I eventually just ignored them and it's been operating that way for years.

Another question, do you have any SSDs in this setup at all? Just curious.

For my case I was able to produce the issue by using the iozone benchmark in throughput mode with many parallel tasks, usually something like:

Code: Select all

iozone -I -r 512k -t 8 -s 2g
This is only 8 parallel threads, each reading/writing a 2G file, but you can tweak those. The -r 512K is the record size, which matches an average Veeam block size, and the -I parameter tells it to use direct I/O, similar to how v11 works. Try that at 99 (or 50) tasks and see how it works, although you will probably want to shrink the file size (-s) if you don't want to wait forever.

These are difficult problems to solve for sure, so I wish you luck and I hope you keep us updated on what you find.
dbr
Expert
Posts: 118
Liked: 16 times
Joined: Apr 06, 2017 9:48 am
Full Name: Daniel Brase
Contact:

Re: LSI controller warning since upgrade to VBR 11

Post by dbr » 2 people like this post

I'm really comprehend what you say and I had exactly the same in mind (Veeam's performance is getting better and better, Veeam is just using the underlying hardware and so on). Thus, we contacted our vendor first. I agree that is not a Veeam problem and I really appreciate the help of the support and forums and I praise Veeam's support and forums much for it in general, not only for this case. Anyway, I'm still in contact with Veeam support. Parallel to that I will try to force the issue with iozone on the primary system. Once the error is thrown, I will try to test a similar system to see whether it is only related to a single machine or a common issue. Thanks for your input so far. I will keep you posted on this.
dbr
Expert
Posts: 118
Liked: 16 times
Joined: Apr 06, 2017 9:48 am
Full Name: Daniel Brase
Contact:

Re: LSI controller warning since upgrade to VBR 11

Post by dbr »

FYI: We have UseUnbufferedAccess=0 already configured, due to a bug in relation to ReFS / FastCloning. My case: #04843767 / should be related to this. I set the Task limit to 4, just to see if the error is still occuring and have still iozone on the list.
dbr
Expert
Posts: 118
Liked: 16 times
Joined: Apr 06, 2017 9:48 am
Full Name: Daniel Brase
Contact:

Re: LSI controller warning since upgrade to VBR 11

Post by dbr » 2 people like this post

Meanwhile, I tested with following syntax with no luck:

Code: Select all

iozone -I -i 0 -i 1 -r 512k -t 99 -s 2g
With a task limit of 4 it seems that the issue doesn't occur, but that's no option for us because a second copy job offsite is "occupying" 2 tasks with only a low throughput. That slows down overall performance for the onsite backup copy job. Anyway, the case has been escalated to second level and I will go on working with him. Once I have news, I will post them here.
Post Reply

Who is online

Users browsing this forum: Bing [Bot], Google [Bot] and 116 guests