Equallogic issue

theflakes · Post by **theflakes** » Dec 03, 2010 6:43 pm this post

# Fix for an issue which may have reported healthy SATA drives to be erroneously marked as failed.

We've been getting calls from Equallogic on the above issue wanting us to upgrade to 5.0.2 or 4.3.7. We are currently on 4.2.1. Do any Equallogic customers have some insight on if this is a universal problem or just in specific circumstances?

I seriously despise upgrading storage related hardware except to fix a known existing serious problem.

Off topic I know so please forgive me, but this is a very good and knowledgeable forum for this question.

theflakes · Post by **theflakes** » Dec 03, 2010 8:26 pm this post

This only affects the 4.3 and 5 firmware versions for anyone out there who is also wondering.

joergr · Post by **joergr** » Dec 03, 2010 10:59 pm this post

Yeah, these fw´s sometimes reports 100% healthy drives as dead ;-( - but it´s extremely rare.

Anyhow - I suggest you upgrade. 4.3.7 and also 5.0.2 are EXTREMELY stable and reliable firmware versions. BOTH. In my opinion the best EQL firmwares ever when speaking of stability.

best regards,
Joerg

Post by **tsightler** » Dec 04, 2010 1:15 am this post

joergr wrote:Anyhow - I suggest you upgrade. 4.3.7 and also 5.0.2 are EXTREMELY stable and reliable firmware versions. BOTH. In my opinion the best EQL firmwares ever when speaking of stability.

Funny, I don't know how this could even be determined yet. 4.3.7 has been released for 90 days, 5.0.2 for 60 days. Stability of enterprise hardware is measured in years. We really won't know if it stable for quite some time yet.

That being said, I was bit by the "good drive marked bad" issue (although it caused no serious problem other than major concern) and I seriously doubt that 4.3.7 has any chance of being less stable than any of the previous versions of the 4.x series of firmware. It was a 4.1.7 firmware that ate our RAID just because of a failed drive.

I'm waiting on upgrading to the 5.0 train because of the incredibly bad history of the 4.x code in our environment. It certainly did instill confidence when the first release of 5.0 had serious issues and had to be recalled, although it was admittedly only with a new feature. 4.3.7 is at the tail in of a branch that has finally become stable, so that's my pick for a truly stable release. Of course, there are good reasons to use 5.0.2, namely VAAI, but I've just been bitten too many times with EQL firmware so I'm waiting.

joergr · Post by **joergr** » Dec 04, 2010 8:54 am this post

Hi Tom,

yeah, sorry, that is my fault and btw sounds very smart-alecky, i have to explain this:

We have some Test Arrays from DELL-EQL, only for our labs, only for testing purposes and only for "torturing" for many years now. Besides, we get the firmwares earlier but are not allowed to talk when they are not released yet. Thus, we like to do aggressive research on the general stability of these versions (for example we found some of the very first reported bugs in the fw of the new 10 GB controllers and worked them out together with EQL, even by sending controllers via express over the atlantic and testing and testing and testing). We do this all because we think EQL is such an innovative product that it earns our affords to do research and besides, i like to do research (as i do with veeam, also a very innovative product).

And my opinion (or my feeling, don´t know?) is, that the 4.3.7 and 5.0.2 are both extremely stable

They survived any of our really nasty tests without having any issues. I thought i mentioned this already in a post a few months ago

But then again, I must admit i mainly measure stability after a few weeks in our testing labs and not with a long timeline in production. Maybe i have to adjust my definition and sight on this one to also include a good and long enough timeframe.

Regarding the 4.1.7 problem you described: Did you manually adjust the number of hot spares via cli tweaking or did you completely remove any hotspares via cli? That might cause the trouble you just described. And was a HUGE bug, to say the least. And you are right. There were many fw versions which i didn´t install because they had serious issues.

Me personally i have 4.3.7 on our huge production group (10 GB, 120 TB native, 50 TB logic, RAID 10) and 5.0.2 on our "not soooo important groups" and all the replication target groups. And i never ever do fw upgrades in the live wild, either i shut down everything, replicate and then do the upgrade or i svmotion everything to a complete different group before any update.

best regards,
Joerg

Post by **tsightler** » Dec 04, 2010 10:49 pm this post

Didn't mean to be a smart-aleck, I'm just not a big fan calling something "stable" when it's only been out a few months and is likely only running on a very small subset of their deployed systems. In my opinion stability of a given firmware is proven by wide deployment across a large number of environments for a reasonable amount of time because, no matter how expansive your lab, it's still only a very small sampling of deployed systems and likely provides almost no sampling (or certainly only minimal) of how the firmware handles various real-world failures with drives and controllers, which is where most problems seem to be for all vendors, not just EQL.

We also have EQL arrays just for testing that we pound on, but pounding still doesn't do much for proving long term stability since most of the problems show up when you have "unexpected" issues, like a controller restart because of a link or drive failure, then having poor throughput after said failure. I've been involved with EQL beta firmware since before Dell purchased them.

On the array we had that ate our data, we were running RAID6 with no spares. We found that we typically replaced drives before they were finished rebuilding on the hot spare anyway so we decided that RAID6 with no spares would be just as safe as RAID50 with spares while providing additional capacity and only a slight performance degradation in our environment. We had a drive fail, but took a few hours before we were able to swap it. Shouldn't be any big deal. Unfortunately, after just a couple of hours, the volumes on the array began to show massive corruption. There were no indications of other errors on the array, no additional failed drives, but there were corrupt blocks all over the place for all of the volumes hosted on it. There's really no excusing that since RAID6 with a single failed drive is still a redundant configuration (that's the whole point).

joergr · Post by **joergr** » Dec 04, 2010 11:10 pm this post

no-no i meant I me myself must have sounded like a smart-aleck

Oh my god - hell - data corruption? That is extremely bad. Did you get it solved with EQL or was your data left corrupt/destroyed? We, too, use EQL for a very long time, can´t remeber when i ordered the first ps300

- these were the ones with the sata drives arranged sideways, long before the dell acquisition. But back to the corruption then: The only cause which i knew was that the array (when it had no hot spares) tried to add a hot spare before any other actions, it can´t find any and hit a code glitch and boom set all volumes offline. But data corruption is much much worse. What happened - did they cure it?

best regards,
Joerg

Post by **Gostev** » Dec 04, 2010 11:16 pm this post

AFAIK they did not, Veeam Backup came to the rescue...
Here is the whole story > Equallogic Eats Our Data

Post by **tsightler** » Dec 04, 2010 11:50 pm this post

Because Equallogic was so slow in responding we ended up reinitializing the array after 3-4 days of "we don't see anything wrong" and "how do you know the array caused the corruption?". I wrote the blog entry that Anton posted largely out of frustration, but it didn't take long after that before high level people at Equallogic were calling me. They eventually sent me a replacement PS6000E (a nice upgrade) and took the PS400E back for "analysis". Never really came up with anything though.

Interestingly, it was the PS6000E on which I experienced the "failed drive that's really good" scenario. We had just placed it into production and put a few small VM's on it to let it run for a few days. Over the weekend I received an alert that there was a drive failure on the array, however, when I logged into the interface, the RAID status showed "Normal" even though it was showing a failed drive. Turns out the problem in this scenario was that the "bad block" was actually in the reserved area, not in the RAID area, so the management controller was reporting the area, but not acting to remove the drive from the RAID. Turns out not a big deal, but still not comforting.

It's funny because we purchased Equallogic after becoming extremely dissatisfied with EMC due to several major issues with controller failures and a couple of cases where we experienced a few blocks of corruption due to a firmware bug. We ran Equallogic for three years without a single stability issue, then everything went bad in the span of a few months.

All of our problems happened after upgrading to 4.1.x firmware and using the new RAID6 features. In the end, it might have been bad luck, but it made us gun shy for new firmware upgrades and new features so now we let them sit out there for a while. Of course obviously someone has to be the early adopter, so we thank you for all your testing.

joergr · Post by **joergr** » Dec 05, 2010 9:02 am this post

Holy shit, that sounds very bad, indeed. I never tested RAID6 with eql, only on the replication-target side i use raid6, but i guess, even if there was some bad block over the years i wouldn´t have noticed it. In production we always use raid10, but yeah, that´s quite expensive i have to admit. I will read this story in a few minutes completely, it sounds extremely interesting and also important.

Best regards,
Joerg

Post by **Gostev** » Dec 05, 2010 10:48 am this post

joergr wrote:i guess, even if there was some bad block over the years i wouldn´t have noticed it.

This is why you need SureBackup! Backup your VMs, and test anything you want on these VM "offline", without loading production storage.

joergr · Post by **joergr** » Dec 05, 2010 11:18 am this post

Yeah Anton is right

Tom, as this corruption occurred, did you also replicate these luns? Have you checked if the replicas got corrupted, too? This would be very interesting for me to know.

God. LUN corruption is my personal nightmare No. 1 ;-(

Post by **tsightler** » Dec 05, 2010 4:28 pm this post

We haven't used Equallogic replication in years, because of it's relatively high overhead, both with WAN and storage usage, that we simply calculated it's cost as too high even though it's included for free (which is funny, because we paid some real money to use replication with our EMC arrays, which was many times more efficient, especially on the WAN). We decided we would never again lock ourselves in to a storage vendor by becoming so dependent on their underlying features and now perform replication either via application specific capabilities (like Dataguard for Oracle) or by using Veeam. LUN's that were replicated and/or backed up with Veeam in the hour or so before we received alarms did indeed show corrupt blocks, but I have no idea if it would have impacted a replicated LUN. For a few "less critical" host (i.e. hosts that only get backed up or replicated once a day) we had to revert to the previous days backups.

For the most part we were only minimally impacted for our production environment. Of course, some of this was just lucky because our most critical systems were not on the affected array, and our DR and failover plans mostly worked, but we learned a lot that day, like not to assume that you won't lose a lot of storage during such an event (one of our biggest challenges was to find and allocate 6+TB of data from somewhere else as we didn't have that much space on our primary storage).

It was actually the first time I had ever experienced the complete loss of an array due to corruption in my 20+ years of system administration. We had an issue with our EMC array years ago where we had two unrecoverable blocks due do a firmware bug, but in that case we were able to trace the two blocks to the actual filesystem and one was in a deleted file and the other was in a non-critical system file so it had no real production impact other than being a lot of trouble to track down.

joergr · Post by **joergr** » Dec 05, 2010 5:53 pm this post

Yeah i see. The reason why i replicate a lot with EQL is that you don´t have to take a vmware snapshot and thus don´t interfere with the guest in any way. OK, your guest, your whole lun is anything but consistent

- BUT anyhow - via 10 GB Link the replication is extremely fast and short and gives me another shoot in case of a real emergency. Every time i brood over this concept i say to myself: It´s one safety thing more, its free of charge - why not using it. But then again, i won´t use it over wan, only via minimum 1 GB. Better is 10 GB.

best regards,
Joerg

Post by **tsightler** » Dec 05, 2010 6:25 pm this post

We didn't consider it free since you have to pay for all the storage it uses and it's about 5X less efficient than our previous two storage vendors for it's on disk storage requirements, which somewhat offset's the "free". Also, we have to replicate to a site >700 miles away and don't have the budget for a 1Gb link that far. We're lucky to have 10Mb and some WAN acceleration.

I do understand the VMware snapshot issue though, but it's a concern for only a handful of our VM's and we use application specific transactional replication for those systems.

ctchang · Post by **ctchang** » Dec 20, 2010 4:51 am this post

tsightler wrote:Because Equallogic was so slow in responding we ended up reinitializing the array after 3-4 days of "we don't see anything wrong" and "how do you know the array caused the corruption?". I wrote the blog entry that Anton posted largely out of frustration, but it didn't take long after that before high level people at Equallogic were calling me. They eventually sent me a replacement PS6000E (a nice upgrade) and took the PS400E back for "analysis". Never really came up with anything though.

Interestingly, it was the PS6000E on which I experienced the "failed drive that's really good" scenario. We had just placed it into production and put a few small VM's on it to let it run for a few days. Over the weekend I received an alert that there was a drive failure on the array, however, when I logged into the interface, the RAID status showed "Normal" even though it was showing a failed drive. Turns out the problem in this scenario was that the "bad block" was actually in the reserved area, not in the RAID area, so the management controller was reporting the area, but not acting to remove the drive from the RAID. Turns out not a big deal, but still not comforting.

It's funny because we purchased Equallogic after becoming extremely dissatisfied with EMC due to several major issues with controller failures and a couple of cases where we experienced a few blocks of corruption due to a firmware bug. We ran Equallogic for three years without a single stability issue, then everything went bad in the span of a few months.

All of our problems happened after upgrading to 4.1.x firmware and using the new RAID6 features. In the end, it might have been bad luck, but it made us gun shy for new firmware upgrades and new features so now we let them sit out there for a while. Of course obviously someone has to be the early adopter, so we thank you for all your testing.

While I am still reading this post, I would like to post something I came across recently related to EQL false alarm issue.
I've seen a blog showing exactly what you described on SATA drives, but I forgot where it is now, so it's a known issue on all E series. (ie, SATA)

R&D Forums

Equallogic issue

Re: Equallogic issue

Re: Equallogic issue

Re: Equallogic issue

Re: Equallogic issue

Re: Equallogic issue

Re: Equallogic issue

Re: Equallogic issue

Re: Equallogic issue

Re: Equallogic issue

Re: Equallogic issue

Re: Equallogic issue

Re: Equallogic issue

Re: Equallogic issue

Re: Equallogic issue

Re: Equallogic issue

Who is online