Wednesday, January 23, 2008

I'm going to go off on a bit of a (somewhat grumpy) lecture here in hopes that people will stop long enough to listen. A little Gestalt therapy, if you will. Ultimately I hope at least one person recognizes a need and acts on it.

If I had a dime for every time I have personally seen this one issue bite someone in the backside, I'd be a rich man. There are a zillion things that can go wrong on a mission-critical network, but of those things there are actually just a few that account for a substantial portion of the issues that typically bring critical services down.

So, if you run a network and have not addressed the one issue I will describe below, please take the time out of your day to start a plan to remediate the problem ASAP. Along the same lines, if you are not sure where you stand with regard to the issue, or if you have never checked but you feel confident because everything works today and always has so it can't possibly be an issue... Again, please just take the time to inspect your infrastructure and put a plan in place.

I should also say that if I had a dime for every time I've said exactly what you just read in the paragraph above, I'd be a rich man. I lost count long, long ago of the number of hours spent watching people try to avoid - in any way possible - checking the obvious and addressing it. Usually that's due to those egg-on-face concerns that go along with being they guy who missed something so simple and critical (albeit not too obvious) when it came time to learn the detailed intricacies of running a high-availability network.

Okay, enough with the harshness. Time for the issue at hand.

The number one network mistake I have seen people make on IP networks, over and over again, is using the default settings on their switches and servers that cause the network interfaces to auto-negotiate the speed and duplex settings.

Seriously, if your requirement is to provide high availability and your SLAs require your services be up, do not neglect the critical (but often skipped) process of manually configuring your NICs and switches to the proper setting. Just because the interface says it's running 100mbps and full-duplex doesn't mean it's working, and when your network takes a dive and you start losing packets you'll be sorry.

Along the same lines, never assume that one half of one percent of packet loss is no big deal. Seriously, if you are seeing retransmits on your network interfaces, something is likely wrong. Also, chances are that .5% loss is not being scattered evenly across your traffic. It may all be happening at once in bursts, and that hurts - a lot.

Again, if I had a dime for every time I (or someone working with me) recommended inspecting the interface settings, recommended changing them, and flagged interfaces where traffic analysis showed data transmission loss that was obviously causing network apps to fail... Well, let's just say it's amazing how hard it is to convince some people that their network is the cause of the issue.

Why am I being so blatantly blunt about this? Because I hope that the message will carry, that administrator egos will be set aside, and that people will understand that the real-world evidence based on years of actual experience, proven over and over again, bears out the fact that this will eventually happen to you if you have not already taken the steps to ensure it doesn't. Don't let that happen. Protect that ego now, rather than waiting for it to be damaged.

Finally, don't fall prey to the idea that just because you have high-grade HP, IBM and Dell Servers and Cisco switches that the money you (smartly) spent negates the need to set things up the right way, or that these vendors have everything figured out for you and set as defaults. Point of fact, this issue occurs just as often (if not even more so) with your expensive, data-center class hardware. In fact, Cisco switches have been somewhat famous for requiring intervention of the manual-configuration type. They even have a troubleshooting support article here that you can refer to for your configuration needs.

You have been advised. Now go do something about it. And forward this to every network administrator you know. The network (and ego) you save may be theirs. :)



Add/Read: Comments [8]
Tech
Wednesday, January 23, 2008 3:21:31 PM (Pacific Standard Time, UTC-08:00)
#  
Wednesday, January 23, 2008 3:52:33 PM (Pacific Standard Time, UTC-08:00)
I'm pretty sure I read this yesterday:

http://joelonsoftware.com/items/2008/01/22.html

=)
Wednesday, January 23, 2008 3:57:33 PM (Pacific Standard Time, UTC-08:00)
Phillip - Now that's interesting, heh. Someone called me today, right in the middle of a network emergency, asking me to remind them what the network setting was that some time back caused an outages on another network. While I really hate rolling my eyes, you have to admit that it's strange that something that occurs so often is also so frequently overlooked and unknown. Maybe it's the phase of the moon.
Thursday, January 24, 2008 8:52:14 AM (Pacific Standard Time, UTC-08:00)
Joel on his blog Joel on Software has recently posted on the same subject. Its funny to me because I was reading your blog through Google Reader and his entry was next in my list :).

Here is the link: http://www.joelonsoftware.com/items/2008/01/22.html

Quote:

"Michael spent some time doing a post-mortem, and discovered that the problem was a simple configuration problem on the switch. There are several possible speeds that a switch can use to communicate (10, 100, or 1000 megabits/second). You can either set the speed manually, or you can let the switch automatically negotiate the highest speed that both sides can work with. The switch that failed had been set to autonegotiate. This usually works, but not always, and on the morning of January 10th, it didn't."
Thursday, January 24, 2008 9:04:33 AM (Pacific Standard Time, UTC-08:00)
Hi Dmitry -

Yep, Phillip also pointed that blog entry out in his comments. I'm posting a hyperlinked version here to make it easier for people to get there. It's a great example of where that problem can bite you. I've seen it happen at banks, ecommerce companies, startups, enterprises, you name it. So I think it's good to get the word out. Clickable link below:

http://www.joelonsoftware.com/items/2008/01/22.html

greg
Thursday, January 24, 2008 11:37:18 AM (Pacific Standard Time, UTC-08:00)
I am by no means an expert, but I attended a seminar a year or two ago at our local user's group given by Microsoft MVP Mike Pennacchi of Network Protocol Specialists. He said that the average network was better off if everything was set to AUTO, and that the early problems that we old hands remembered (remember 3Com, anybody?)were dealt with in modern equipment. The real problem are mismatches between AUTO and FULL duplex, if I recall.

He has a video about it at his website, if you are interested.
Jeremy Renton
Thursday, January 24, 2008 12:14:58 PM (Pacific Standard Time, UTC-08:00)
Hi Jeremy -

If Mike Pennacchi says something about troubleshooting networks, I say listen. Here is a link to the page where people can find his video (great resource with detailed information, thanks for pointing it out). I agree with what he has to say. My point is that often there are situations with older equipment or mismatched equipment where the settings are the culprit, and that this is avoidable.

It's true that a lot of the time AUTO-AUTO works well (as Mike points out). He also states that there are times when it doesn't work, and that it's important to set both the switch and the ethernet card or other device to an identical setting. Because there are times when it clearly doesn't work to leave the auto settings in place, if engineers and admins out there are not aware of the needs in their specific network hardware environments, then they should dig in and take a look. My goal is that people will realize the importance of checking what's needed in their infrastructure if they have not already, in order to avoid outages and performance hits. Assuming the defaults will work is never a safe assumption to make.
Monday, January 28, 2008 11:08:40 AM (Pacific Standard Time, UTC-08:00)
Well said Greg. As a load tester I've seen this time and time again. If I had a quarter for every time....

Auto/Auto has proven to be a horrible thing in my years of testing. I've seen this hose environments under load when set on the NIC and/or switch. What happens is that traffic is cruising along fine around 10 or 100 or 1000 and then one of the items (switch or NIC)decides to either bump up or bump down, leaving the other item as not in sync, resulting in a huge % of lost or redirected traffic. This causes my load tests to fail (>1% was a failed test). Taking auto/auto out and setting it to 100/100 solves the problem every time.
Monday, January 28, 2008 1:52:24 PM (Pacific Standard Time, UTC-08:00)
Greg, you are so right about this issue! The insidious thing is that networks with this kind of configuration problem will appear to function normally under low loads. When you encounter higher loads however, the number of network errors and retransmits can spiral out of control. This could account for an interesting human behavioral side to this problem I've observed: the blind faith of network engineers that are so sure it can't be something wrong with *their* network!
Alex Ginos
Name
E-mail
Home page

Comment (Some html is allowed: b, blockquote@cite, em, i, strike, strong, sub, sup, u) where the @ means "attribute." For example, you can use <a href="" title=""> or <blockquote cite="Scott">.  

Enter the code shown (prevents robots):

Live Comment Preview