I got a call from a customer asking for help with a subversion server that suddenly went off-line. The system was running SuSe 10.1, apache2 and subversion, and almost nothing else. It had been operating for 2 years in the server room, in a headless configuration. The developers that used it connected with their Windows machines and everything worked flawlessly for two years. Suddenly, last week, it became unresponsive. They tried rebooting the server, nothing worked. nothing had changed in the configuration, and since this machine was only accessible in the LAN, and protected by a firewall, they hadn’t even done any upgrades. Their attitude could best be described “if it’s not broken, don’t fix it”.
Before I was called in, the customer had installed a new ethernet card, on the assumption that the on-board card had failed. They came to that conclusion because they couldn’t ping out, and no other machines could ping the server. A good assumption, I think. However, the new ethernet card didn’t fix the problem. That’s when they called me in.
We went through the configuration with a fine tooth comb. We disabled Network Manager, since I’ve known it to cause issues on some systems. Everything seemed fine, but the network still didn’t work. The network configuration was static. So we changed to using dhcp to get an address. This worked, it got an address, but we still couldn’t ping either direction, from the server or to the server. We looked for alternate kernels we could boot, to see if it was a kernel specific problem, however there was only one kernel available. As I recall there wasn’t even a failsafe option in the boot menu.
As part of the diagnostic process we put a laptop on the wired network and tried pinging it from the server. No response. we checked the ARP tables and there was an entry for the server we was working on, with it’s IP and MAC addresses. We concluded that the server could “talk” but couldn’t “hear”. We checked for firewall rules that might be blocking. There were none. We checked the arp tables on the server, and there were entries, without host names, for other machines on the network. None of us had see this before. The plot thickens.
So next we decided to try a Knoppix live CD to test the hardware. we figured if Knoppix worked, then there was a software configuration problem. Sure enough, Knoppix booted and networking was working. We could ping, other machines on the network could ping the server. So we concluded that it must be something in the software.
We again, went through the network configuration files. Suspecting that there may have been some corruption in the files, we manually rewrote them from scratch. We thought there was a possibility of unprintable characters in the configuration files that was causing them not to work properly. We flushed the firewall rules for good measure with iptables -F just to make sure there were no rules, that were not being reported by iptables -L. Still no change.
Still operating under the assumption that there was a configuration errors somewhere, we decided to try an upgrade, to OpenSuSE 11.1. We backed up all the configuration files and data to an external USB drive. Then we did an upgrade. Upgrading is a pretty slow process, relative to a fresh install. Nothing changed.
Next we tried a fresh install, no change. Then I tried another distributions, CentOS 5. Still no change. Finally as my frustration mounted, I did ifup eth0 and a root console popped up displaying an error message. “…. irq #66 disabled” So we googled the error and found that this kind of error appears when there are spurious interrupts on the bus. After a couple more tests we concluded that there was some kind of hardware problem, that we hadn’t seen before. Since the system was under warranty, we contacted the manufacturer and they replaced the motherboard. After a fresh install, the system came up and ran perfectly.
This is one of the things I both love and hate about this business. Everyday there is a new problem to be solved that we’ve never seen before. I hope this saves someone else some time diagnosing their hardware problems.