SCSI timeouts on Dell EqualLogic iSCSI SAN Arrays

 VM Squad posted today about a problem he perceives exists with Dell EqualLogic arrays. Its not a problem, but that's what one of competitors is telling people.  The issue is the time it takes to upgrade firmware in our systems, which is 15 seconds in the systems we are selling today. 

What's interesting is that Derek Schwab posted about the exact same thing yesterday and how it worked so well. 

So what's up with that?  

Systems and storage solve intermittent communication problems through SCSI time outs. If the host system can't communicate, it keeps trying patiently for a long time before giving up. This is a lot longer than 15 seconds.  The amount of time depends on the host system implementation, but it is usually more than a minute and can take five or more minutes (or so I'm told). FWIW, this is the same mechanism that is used for multi-pathing.  After a SCSI timeout, the system tries to re-connect using an alternative path.   

So a 15 second delay (not an outage) a few times a year is not a very big deal. If you have to, you can schedule it for non-peak hours. Everything will work, applications will stay up and end users will see a short temporary hang - if they see anything at all. 

 

Comments  Comment RSS Feed

equallogic user said:

so what's the official word?  Are online updates now supported?  I attended one of your user groups you had started back in December, where you mentioned this as a feature under development.  I've seen no announcement...

Luke said:

Another interesting point is that as a customer we somewhat rely on what is told to us by salesmen about their product. When we asked the sales rep last week when he was onsite about data corruption during the upgrade period we were only told it won't happen. We weren't told why this won't happen or even offered to have it proven to us that it won't happen.

I will not blindly put faith in something just because a rep tells me it's so. If that was the case then i'd be wearing a bill gates t-shirt and running vista on all my desktops.

I have more to respond with but i'll keep your comments section clean and just leave it on my blog.

Thanks Marc!

Marc Farley, Inside IT Lead Blogger said:

So, first the question on online updates:   the firmware that's in beta test now restarts the controllers in sequence, as opposed to at the same time.  When the beta test is completed this feature should be a part of it.

As to wearing a Bill Gates shirt and running Vista.... well that's  kind of funny - and I'll get back to it at the end of this.

The question of how timeouts and restarts work is almost always an "it depends" kind of thing.  Switches and network routing are involved - as are details of the data in controller cache, how cache mirroring and failure recognition between the controllers works and whether or not there is any disk checkup/cleanup work that needs to be done to ensure safe operations.  I hate to say it, but it really is a "mileage may vary" answer and so my discussion in the post above should have said as much.  It just messes up a simple discussion when you have to dive into the caveats that inevitably exist for all things we work on.

In the case of our competitor, who says there is no stoppage of I/Os during a firmware upgrade, there are several questions you want to look into.  The first is whether or not they actually recommend running that way as a best practice.  The reason for asking is that I/Os will be processed without full redundancy, which might be OK and it might not.  You need to know whether or not the I/O that is still going on has write caching turned off.  You probably don't want to run in single-point-of failure mode while write caching is on. Even so, you can still save yourself from data loss with battery backup for the cache, but then you want to know how long the battery will hold dirty cache data.   You might want to know how long the entire sw upgrade will take - knowing that mileage will vary, but the list of tasks includes taking the system offline, loading the new software, restarting the system and how long it takes the system to synch with the other so it can start accepting I/Os again.  I don't know what it is for this vendor, but I'm pretty sure its probably over a minute - maybe a lot more.   Then you want to know how remirroring works and how long it takes and if there are any limitations tied to remirroring that could impact how long you run in single-point-of-failure mode.  After all, every I/O that's written during the upgrade has to be copied over to the other system - and during that process there is still a single point of failure issue because a catastrophic data loss in the system that stayed awake may mean the dirty data on its disks and hasn't been re-silvered yet is lost.  

In other words, its one thing for them to say that Dell EqualLogic has a controller restart problem and they don't, but its another thing altogether for them to explain how everything actually does work when things start sliding down the slippery slope.  Thats why I was surprised they brought it up, because its actually not a very good story for them.  And its why the question about best practices is the first one to ask. Then maybe the next question to ask is what the default settings are for running in degraded mode are and whether or not they exercise those best practices or violate them.  You can't really have your cake and eat it too when it comes to data protection.

Pertaining to Vista, I know where you are coming from.  It wasn't the best implementation perhaps, but Microsoft is trying to solve a few thorny problems:  Reducing the number of security upgrades needed - which is a huge issue - and getting drivers outside of the kernel - another big deal.  Vista made big strides in both areas.   Maybe its the sacrificial lamb OS, but I'd hate to see those key advancements lost because of UI complaints.

Anyway, thanks for reading down this far.

 There's a lot of FUD about iSCSI making the rounds. One large switch vendor who is pushing FCoE talks about the SCSI timeout issue and suggests that 10 second long timeouts are inevitable and frequent with iSCSI. Of course, if this were actually the case, all the companies selling iSCSI would have long since gone out of business, pursued by customers carrying pitchforks, but then again why let reality spoil a good piece of FUD :-)

Marc Farley, Inside IT Lead Blogger said:

Yeah, that's a load of rubbish.  Thanks NIk.

Marc,

While I normally cut some slack for lack of technical details because of all the caveats involved, on this one I think Equallogic has to bear some of the responsibility.  Post like this significantly undermine the reputation of Equallogic in the storage world.

Let's start with the easiest quote:

"The issue is the time it takes to upgrade firmware in our systems, which is 15 seconds in the systems we are selling today"

Perhaps this is true in a lab, on an empty box, but I've yet to see this speed in the field, at least if you count the time from service unavailble until the device serves I/O again.  We own 4 Equallogic arrays and I've upgraded them all several times, and an array with many volumes and snapshots and dozens of active connections can take quite a bit longer to accept connections and actively serve data (this is noted in the firmware upgrade documentation).  Even worse, the SAS arrays seem to be slower than the SATA arrays to return to avtive status even though they have a tendency to hold the most critical data.

To make matters worse, Equallogic's own technotes suggest setting KATO values to 60 (or in Boot-from-SAN environments 120) which is rediculously high and seems to assume a single, non-multipathed host.  When I tested failover with these values I found that path failures took very long to be recognized by my hosts OS (Linux and VMware ESX) and sometimes caused delays of several minutes before systems failed to alternate paths.

When I asked support why the KATO values should be set so high I was told it was set high to be sure that systems would survive a firmware upgrade.  That seems to indicate that even Equallogic does not believe in 15 second firmware upgrades.

Even worse, setting such a high KATO value with a VMware ESX  3.x server in Boot-from-SAN is an almost sure way to cause a "ext3 journal abort" of the console OS.  Why?  Because the console OS virtual SCSI devices have a 60 second device timeout.  If during a path failure the console OS has any write to a device that does not complete within 60 seconds, the device will timeout and will generally loose the ability to manage the ESX host.  However, a path failure on ESX takes exactly KATO*2+5 seconds to actually failover.  That means that if you set the KATO value on the Boot device to 60 seconds it will take 125 seconds to actually perform the failover but the Console OS virtual SCSI device will almost certainly timeout by that time (assuming the path doesn't come back).  This does not affect your running VM's since VMFS somehow manages to failover anyway, but you will no longer be able to managed the ESX server itself meaning you can't migrate the VM's off with Vmotion them to another server thus you'll eventually have to shut them down to get the ESX server funtional again.

Interestingly, this setting does not even affect firmware upgrades because during a firmware upgrade is basically an "all paths down" event and that triggers a different response from the ESX failover driver which simply queues all I/O until at least one path returns.  In other words, the default KATO value (14, which is a failover in 33 seconds) is fine.

Equallogic really needs to firm up their documentation with regards to network configurations, iSCSI configurations, HBA settins, etc and truly document the best practices associated with creating an enterprise class iSCSI network.  That one place where the other major storage vendors seem to truly trump EQL.  So far I've found that to be the weakest part of the Equallogic solution because, if configured properly and robustly, the EQL boxes really rock!!

Later, Tom

Marc Farley, Inside IT Lead Blogger said:

Hi Tom,

Thanks for weighing in. I know you have tested the heck out of failover in VMware environments.  Otherwise, OMG, it was a dumba$%$ mistake on my part to say the controller reboot time is 15 seconds. The firmware we are currently beta testing now and plan to ship this summer (with all the usual sw release caveats) is designed to restart in 15 seconds on a relatively clean machine.  The current firmware restarts in something like 35-40 seconds I think.  So I managed to screw that up pretty badly, but Tom, at least I did say it was a "mileage may vary" phenomenon.  As you point out the number of volumes and snapshots makes a difference because each of them needs to be verified. 

So it doesn't surprise me that technical support advises customers to keep KATO (keep alive time out) values at the levels they do. As you know there are other variables too, such as network routing convergence that has to be accounted for. SCSI time outs are one of those very gray areas of storage that involve a number of non-integrated SAN components (servers, networks and storage) SCSI time out values have been all over the map, including FC storage. I don't think anybody publishes this as a spec, because there is no way to reasonably support it. 

But I'm really confused by your discussion of ESX console and firmware upgrades.  First you seem to say its totally hosed and then you say it doesn't matter.  I'm pretty sure its not totally hosed it would be a much bigger deal in forums and in the blogosphere.

As to your final point, I agree completely.  Maybe you can help us understand what sort of information would be the most valuable.  There are lots of problems involved in doing this, not the least is characterizing what constitutes an "enterprise environment." Figuring out where to draw the lines between different service levels is very slippery.  Its not OK to assume that all customers need enterprise capabilities because the cost differences are pretty high. 

Can I call you?  I'm pretty sure I have your number.

BTW, the screenshot on Dereks blog shows 12 pings missing, but he doesn't show the command he ran or mention what version of Windows he was using.  Assuming Windows XP or Windows 2003 and the default timeout (not using the -w option) 12 timed out pings would be ~60 seconds since the default ping timeout in those Windows versions is 4 seconds (see  http://technet.microsoft.com/en-us/library/bb490968(TechNet.10).aspx) and Windows ping always adds a one second interval between pings which does not include the timeout period (4 second timeout + 1 second interval * 12 pings is ~60 seconds).  Even if he used the -w 1000 option that would still indicate about 25 seconds.  Of course, he might have used a different ping program or different options, but since he didn't say we're left to speculate.  

Also, a system which pings is not the same as a system serving requests, so there's nothing on this blog that would indicates only a 15 second outage even though you seem to imply that's all the outage there is in your entry.

Later, Tom

Marc Farley, Inside IT Lead Blogger said:

Tom,  the grovelling continues.....

Don't throw Derek in the grovel pit with me.  What he published was of his own doing and not some sort of restart benchmark.  I assume he is a very good guy who loves his EqualLogic products - as you do. 

There was no intention to throw Derek under the bus, but your post referenced the 15 second time several times, including the false claim that the current Equallogic products require only a 15 second outage and then you linked to his blog with a screenshot showing the loss of 12 pings.  This made it appear you were trying to tie the two together.

I only wanted to point out that Dereks screenshots likely backup the reality of the currently shipping Equallogic firmware, which is more like 45-60 seconds.  Now, don't get me wrong, compared to the 5-10 minutes it took for our EMC controllers to boot that's pretty fast, but it's not 15 seconds.

As far as you confusion regarding KATO timeouts, I did not say it "doesn't matter".  What I said is that, assuming a system is using multipath drivers configured to queue requests when there are no paths (which is most) then the KATO value will have no impact during a firmware upgrade since that's a "no path event" and the multipath driver will actually queue I/O while all paths are gone, however, it will have detrimental effect on the far more common scenario of an actual path failure.

If you don't believe this is a problem a quick search on the VMware communities forum will turn up how many times this issue has been reported and just how many times I've had to instruct people on how to lower the KATO values for their boot-from-SAN volumes regardless of what EQL says.  Many have reported success with my settings while many others have simply reverted to not doing boot-from-SAN.  Several people thought their setup was working but I provided steps on how to reproduce the failure 100% of the time with the Equallogic "recommended" settings and a simple script because with the default settings a path failure would sometimes work, but would many times fail horribly.

I suspect it's not a major Blogosphere issue simply because there aren't that many people running iSCSI Boot-from-SAN with Equallogic and VMware ESX 3.x and even many of those that do don't test their failover function very well.  Still, I opened a case with both EQL and VMware support and VMware support verfied this issue and the failover behavior.

Later, Tom

BTW, of course you can call me.  If you don't have my number just email me and I'll provide it.  I'm hoping that one of the things Dell provides to EQL is a little more "enterprise" focus.  I understand your comments that the term can mean different things to different companies, but I still think there are ways to provide some good strong guidelines that will help customers create robust envrionments for running iSCSI.

Tom

 

In the screenshot posted in my blog, that's a 1 second ping timeout with a 1 second interval.  As Tom mentioned, that's a total of about 25 seconds.  So yes, it's not a 15 second reboot.  But, 25 seconds is pretty darn good.

Also, although not intended to be a very scientific bookmark - I was just pleased with the very quick reboot time - I can tell you that the array was actively serving I/0 within a few seconds of the pings coming back.  All in all, I'd say that the total time between the controller beginning the reboot and seeing active I/O again was about 35 seconds, which, in my opinion, is pretty impressive.

On a totally separate note, I have had an interesting problem with Exchange happen twice over that last 7 months or so.  Both times I got a call bright and early in the morning that certain users were unable to access exchange.  Further exploration revealed that the drive that particular database was located on had completely disapeared.  Logs showed about the time a nightly snapshot was triggered on the volume, exchange started complaining about high latancy and then dismounted the database.  Apparently windows was also unhappy and dismounted the entire volume.  Interestingly, it was the same volume both times (I have the exchange database spread between multiple volumes).

I haven't opened a ticket with Dell about this yet.  I figured I'd wait and see if it happened with the new firmware.  Maybe Marc can provide some insight on this though?

 

Leave a Comment

Compose
Preview
(required ) 
(required , not published) 
(optional )
(required ) 

Note: Conversation is encouraged and expected. However, moderation of comments is necessary to prevent spam, personal attacks, profanity, mentions of legal action or off-topic commentary. We will not publish comments that advertise third-party shopping sites or ones that violate our terms of service.

Comments related to specific product support or customer service issues will be addressed separately rather than posted here. Please use the links in Contact Us for product and customer service assistance.