Facebook and its properties WhatsApp, Instagram, and others went offline for about five hours yesterday.
Many small businesses throughout the world depend on these services to reach customers. They got a rude shock.
Why did this outage happen?
There was nothing wrong with Facebook's computers; they were running fine. The problem was their Internet connectivity.
The Internet has no government; it is made up of hundreds of thousands of sovereign kingdoms. One of these is the Kingdom of Facebook. (Actually, Facebook runs a handful of kingdoms, but they're managed as one single empire.)
Kingdoms like Facebook advertise Internet services like whatsapp.com running on its own computers. To access any Internet services, your computer must be in one of these kingdoms.
If you're using data, then your mobile phone is in your phone company's kingdom. If you're on WiFi, then it's in your Internet Service Provider's kingdom.
In either case, when you type a WhatsApp message, your phone sends data packets addressed to "whatsapp.com" to your phone's own kingdom. Your phone company or ISP knows that the "whatsapp.com" service is maintained by the Facebook kingdom, so it forwards these data packets to Facebook.
In Internet lingo, these kingdoms are called Autonomous Systems, AS.
Imagine the Internet as a sprawling landscape where the kingdoms (the Autonomous Systems) have their computers.
These kingdoms coordinate packet exchanges between themselves on a massive highway interchange situated on a mountain. The mountain is in the center of all the kingdoms, and it has exits to all the kingdoms.
Each kingdom owns and manages the signposts for its own exit, to show all the services it maintains. For example, Facebook's exit has signposts for "whatsapp.com", "instagram.com", and "facebook.com".
One of the services that Facebook advertises is called a Domain Name System (DNS) server, which can take names like "whatsapp.com" and translate them to numeric addresses for sending data packets to.
When you type "facebook.com" in your browser, the browser first needs to "resolve" this name to an address. It asks your ISP's DNS service.
Your ISP's DNS service knows that Facebook has advertised a DNS service that can translate "facebook.com". The ISP asks Facebook's DNS service to translate the name.
When your browser discovers the numeric address, then it sends an actual web page request to that address.
Your ISP again forwards the request to Facebook, because Facebook has advertised that address as a web server.
That's how you can browse facebook.com.
Yesterday, Facebook engineers were updating their signposts with new ones, but they accidentally left out an important destination---their domain name servers.
Data packets could no longer find the signposts to Facebook DNS servers. No one could resolve "facebook.com" or "whatsapp.com", etc. This is what caused the outage.
As soon as they replaced their signposts, the engineers themselves could no longer reach their own computers, because they couldn't resolve their domain names to addresses.
The employees who were on site and could log into the routers themselves, did not have the authorization to make the changes.
It took them some time to resolve this comedy of errors, but now all seems well in the kingdom of Facebook.