In the previous blog article The People's Librarian, we broke down how Google searches its index when you're looking for "sweet potato casserole."
If you look closely, though, you'll find a problem in performing this search.
What does it take to search the index?
The index is organized by words. Each word has a list of all the pages that contain that word.
Since the world-wide web has about sixty billion pages, the list for each word can get quite long. And as hundreds of pages are being added every second, the lists are likely to keep getting longer over time.
Today, the three lists for our three words weigh in as follows:
When you type the entire phrase "sweet potato casserole" into the search box, you want pages that contain all three words.
Google's search program sifts through the three lists above to pick out just the pages that are common to all three— today, that's roughly 50 million pages.
And this is just the first step. From these 50 million pages, it needs to find the top ten or twenty most relevant ones to show you in your results page.
While the search program is doing all this work, you're shaking your mouse impatiently. Unless you see results in a second or two, you might lose interest and go away.
Google's entire advertising business depends on showing you the results page. How can the search program finish quickly enough?
The problem is that computers are too slow.
A computer's processor runs program instructions in sequence, one by one. Each instruction takes a tiny amount of time, but when you're processing millions of items, you might have to run billions of instructions, and the time does add up.
It takes even more time when the search program needs to read the index. The lists are stored on some kind of storage medium, for example, spinning magnetic disks. These disks take much longer than one instruction to read data.
Processors and disks are getting faster by the year, but still, there's a limit to how much work even a top-of-the-line computer can do in a second.
For such a huge index, the search would take too long to be completed within the required time.
This is not a job for an individual computer, but for clusters of hundreds, or thousands, of computers working together.
Just as digging a ditch goes faster if many people pitch in, the search can get done sooner if many different computers take on different portions of it. They perform their tasks simultaneously and combine the results together in the end.
Google operates dozens of data centers around the world. Each is a big building packed floor to ceiling with machines.
When you type a search query into your browser, the nearest data center facility handles it. Together, the machines in this data center are able to search the index and give you a results page within a fraction of a second.
Getting the answer right is not enough. It also needs to be timely.
Sometimes, to speed a program up, you have to write it in such a way that many computers can run parts of it in parallel. This is called parallel or distributed computing, and your Thanksgiving dinner might depend on it.