The People's Librarian

Similar to a librarian, Google can find documents containing the words you're thinking about.

2020-11-08 Previous Home Next

If you're like most people preparing for Thanksgiving, you might turn to Google to search for a recipe.

But pre-Internet, or if you're like Hermione Granger, it was different.

When in doubt, go to the library

Library stacks

In the Dewey Decimal Classification, the code 641.5 is for cooking. You'd locate that shelf and then browse the recipe books.

In my town, the librarians are helpful and nice to talk to. I'd be more likely to just tell them I'm thinking of making sweet potato casserole.

People find it more natural to search for information by what they're thinking about, rather than trying to follow somebody else's classification system.

Already in 1948, scientist Vannevar Bush speculated in The Atlantic that computing machinery might let you find documents in this way.

These ideas were the basis of today's "full-text search" software like Google's search function.

The index

One might imagine that Google goes out to find the websites that contain the words you typed.

But obviously, searching through all websites for your words would take too long and would be impractical. Instead, Google is only looking up your words in an index that it has already prepared.

The index at the back of a textbook lists, for each word, the page numbers where the word is mentioned. The Google index is similar, except that instead of page numbers, it lists links to websites.

Every few days, programs called googlebots wake up and "crawl" the entire web, automatically browsing all the websites they know about. These pages link to other pages, which link to yet others, which all get retrieved for indexing.

Whenever a googlebot finds a page, an indexing program extracts all the words in it. For each of these words, the program adds the page to the list for that word. For example, the page you're reading contains about 300 unique words; it will have to be added to 300 lists.

The index thus consists of one list per word, each list containing many website links.

The lookup

If you type in the three words "sweet potato casserole" into Google's search box, the search program retrieves the lists for these three words in the index.

Only those links that are in all three lists will lead to the pages that contain all three words. That's the result set.

This basic result set will still contain a lot of links. In order to give you only the most relevant results, the search program uses many criteria to select a few of these links.

One criterion is that some pages are more important than others. During indexing, Google has analyzed all the pages and their links to figure out which pages seem to be the most authoritative by popularity, i.e., pages that have lots of other pages pointing to them are the most authoritative. Similarly, as in real life, if you have more important pages linking to you, you're deemed more important.

Famous recipe sites will find themselves being picked out from the index more often than a hobbyist's blog page.

Some other criteria are used to modify the search based on external information. For example, since it's Thanksgiving season, you're more likely to be looking for pages containing the phrase "sweet potato" rather than random pages containing both "sweet" and "potato".

To make the search faster, the program uses some strategies. Since rarer words will have shorter lists, the search program can start with the rarest word's list first (here it's "casserole"), and from its list start discarding those pages that don't match the other words.

As a result of all these arrangements, Google is able to build your first search results page within a fraction of a second.

Perhaps not as pleasant as talking to my librarian, but the next best thing.