Is the Internet down? Troubleshooting network issues
If you work with people who aren’t tech-savvy, this is probably a sentence you have heard before. When they sit down to their computer, try to do some work, and then something unexpected happens, someone invariably exclaims, “The Internet is down!” That user most likely tried to pull up a website or read their emails, and received a connection error in their browser, but the actual problem can be in many different places. It could be the website that’s down, or the local computer that has some type of connectivity problem. Or, in some cases, it could be somewhere in between, and that’s where things become much harder to diagnose. Let’s look at a couple of tools and services that can be used for just that purpose.
So, when diagnosing a connection issue, obviously the first thing to do is check the local computer. That’s easy and is often the cause of most problems. Can that user access other websites, are you able to see that site on your own PC — things like that. If everything turns out okay on the local end, then you need to find out where the issue is on the network. As an IT pro, your first reaction would probably be to pull out a terminal window, or a command line, and use the traceroute command. Traceroute can give you a lot of information, more so than just which routers are between you and the target host.
As you can see above, the resulting text gives you a list of hops between you and the remote host, and three columns of replies. That’s because the tool sends three ECHO REQUEST packets over to each hop. The times shown there can be useful to see if there’s a bottleneck somewhere. A good reply time is below 100 ms. If your result shows a very high latency, then that may be an indication something is wrong with that particular hop. Any star you see means the packet did not return. If there are 3 stars for a particular IP, then the tool considers that host timed out. What often happens is you start seeing timeout replies going on forever. This doesn’t necessarily mean that a router is down. What usually happens is that the next host in the path has a firewall, and is blocking your requests. A much worse type of reply to see is “Destination host unreachable”. When you see that, then it pretty much means that the next router in the list is down, or can’t be reached. This could be for a number of reasons, like the device itself isn’t up anymore, or there’s a routing issue and your packets can’t get there. If that happens, then the problem is most likely at that location.
If the problem is outside of your own network, whether you work in a corporation and have to deal with a corporate network, in a school, or some other type of organization, then it’s much harder to diagnose the issue and correct it. Often, the problem may be with your Internet provider. If the result you see from traceroute is some big latency, or packets being dropped, and it seems to be happening just as they get out of your network, then perhaps your provider has a congested network. This often happens as local ISPs get more clients and don’t want to spend the money to upgrade their networks. Cable providers used to be the victim of this type of slowdown much more than DSL providers, because by definition, a cable connection is shared among your local neighborhood, whereas a DSL link is direct to the local telco office. So what would happen is at peak times, when everyone is online and trying to view HD videos, everything would slow down, although in recent years that’s less the case with the amount of throttling going on. That’s why companies typically pay more for dedicated connections.
If the problem seems to be further away than your local ISP, and a call to their support line is unlikely to help much, you can go to the InternetHealthReport, which is a site that monitors performance between various backbones in the US, or the InternetTrafficReport for worldwide charts. Usually, backbones are pretty good at staying up, but sometimes outages do occur. If you’re dealing with a large network, and having to support a large amount of users, then keeping up with the latest issues happening at the backbone level may be important. The best place to do that is by subscribing to the NANOGmailinglist. That’s the North American Network Operators’ Group. Each region of the world has their own organization like that.
Finally, the problem can sometimes be inside of your network. If a user, or multiple users, have problems accessing the web, especially if it’s everything that seems slow, but you know that your outside connection is fine, then something may be wrong somewhere within your building. A lot of things can actually go wrong. An unshielded ethernet cable may be too close to a power cord, causing interference. Perhaps someone plugged in a device incorrectly, causing data loops. Or maybe a computer or device on the network is using up all the bandwidth. The best way to find out exactly what’s happening is by using a network tap. If you manage a network, you should always have a tap available with you.
A network tap almost always works the same way. It’s a hardware device with at least three ports, which you can connect anywhere in your organization, between points A and B, and all it does is copy all the data that goes through it to a monitor port. In many cases, this is preferable than using a software tool, because you get far more information. The problem may not be at the TCP/IP level. It could be happening at the frame level, and only a tap could show you that.
Diagnosing a network trouble should always start with finding out where the problem is. Once you know whether it’s on the computer itself, the local subnet, your greater network, your ISP or the Internet at large, you can decide whether you have the ability to try and solve it. There can be a large amount of issues that can cause the Internet to seem slow, or simply not work. You may have a misconfigured router not using its spanning trees correctly, it may be a physical issue with the cables, maybe you have an intruder or attacker causing denial of service attacks, or perhaps it’s just a maintenance going on at an upstream provider, or a million other things. But the bottom line is that you have to be able to tell your users what’s happening, and your boss whether or not it’s something you can fix.