Sunday, 12 October 2003

Google's reliability in question

The word "Google" has long since become synonymous with "search" in the Internet context. I used to believe whole-heartedly that this moniker was richly deserved by the king of search engines because the quality of its product — search results — was unparalleled. When Google bought the most extensive usenet archive in existence from Déjà and applied its search technology to it, the usenet community relished having a more reliable and efficient search form for the archive. Google has grown so important to the Internet community that otherwise-honorable businesses engage in shenanegans (and occasionally outright scams) to boost their "rank" in the search engine's hit list.

Before this backdrop, serious questions about Google's reliability have been raised in recent weeks.

In one notable example, Google misreports the number of pages in its index that match certain search criteria. One particular series of searches reveals a systemic flaw in Google's reporting. On 30 September 2003, a search for the keywords "quote dog cat stone" (without the quotes) yields the following reported result: "Results 1 - 10 of about 75,600." On that same day, a search for the keywords "quote dog stone" (again, without the quotes) yields the following reported result: "Results 1 - 10 of about 48,700."

Note the difference between these two searches. The first query had four keywords, and the second had three — the word "cat" was removed. Google's default boolean operator is AND, meaning that when you search for more than one word, Google automatically looks for documents containing all of your search terms. You can change this behavior by typing "OR" or some other operator between the words. The default, however, should always produce more results when there are fewer keywords. It seems likely that many pages on the web will have at least one of our keywords, since quote, dog, cat, and stone are all relatively common words. But how many will contain two of the words? Cognitively, dog and cat go together, but it is easy to imagine many pages devoted to dogs that do not mention cats at all. Similarly, how many pages devoted to Craig Venter's poodle will mention "quote" or "stone"? This number will be even lower if we look for pages that contain all four words. In sum, the fewer keywords we use in the query, the more documents we should retrieve. However, Google's reported results were the opposite of what we expected: 75,600 hits for the four-word query and 48,700 for the three-word query. Meanwhile, very few documents were actually returned for these searches — fewer than ten documents for each.

Why does all this matter?

First, the public trusts Google to return search results reliably and impartially. Some civil libertarians fear that Google's position in the Internet search industry may eventually grow into a monopoly. Imagine having only one search engine available: it could, for example, direct everyone to its advertising partners, as opposed to the web pages that are really the best matches for the queries it gets. (A Machiavellian future, to be sure, but a possible one.)

Second, researchers rely on Google. This is simultaneously the easiest and the hardest example to understand. Everyone has experience searching for information in an Internet search engine. When you do your research, you rely on the search engine to return accurate results. On top of this straightforward problem, consider the dilemma of the linguists in alt.usage.english. These academics and amateur enthusiasts rely on Google's reported results to determine how widely words and phrases are used. If, for example, 1 million web pages contain the word "cool" but only 10 thousand contain "groovy," this is evidence of a change taking place in our language. This technique also extends into demographic research. Google reports 1,490,000 documents containing "Filipino" but only 97,400 documents containing "Pilipino." This has some bearing on the number of people from the Philippines who are publishing information on the web — because they are much more likely than non-Filipinos to use "Pilipino" in English text.

We have documented one instance where Google's reported results differ markedly from its actual results, so it is reasonable to suspect that other examples exist. The company guards its search algorithms as proprietary; so it is unlikely that we, the public, will ever know exactly what causes these discrepancies. And it is not always possible to catch Google red-handed. Today, the queries I posted earlier ("quote dog cat stone" and "quote dog stone") yield actual results that appear commensurate with the reported results. The company has evidently heeded the complaints it has received over the last few weeks and taken action to correct this particular problem. You, the reader, must rely on my good word (and that of a few usenet posters) that this discrepancy really did exist at the end of September. Since the largest usenet archive is under Google's exclusive control, the company might conceivably alter its contents to erase all dated posts that mention this problem.

Please note that I do not believe such a scenario is likely. Also, I believe that, at this point, the problems I have outlined above remain relatively minor and affect only a small group of Internet users. That said, we should remain vigilant for such problems, to avoid being surprised by even bigger problems in the future.

Topics: Technology
