Millions of Digitized Books, Hundreds of Fascinating Conclusions

Jean-Baptiste Michel et al.’s short and sweet article Quantitative Analysis of Culture Using Millions of Digitized Books raises a number of bold points that show just how valuable Google’s bold (and originally considered foolhardy) Google Books project has been to historians.  The project uses nearly 5.2 million books (over 4% of those written, a very significant standard) containing over 5 billion words and searches them.  Let’s pause for a minute and think about what that means.  25 years ago, or even 10 years ago, if you said you wanted a search through a sizable sample of every book ever written for certain words, you would have had your head examined.  The paper points out that it would take a human 80 years just to read all the books written in one year, 2000.  Here’s this device that can go through the entire corpus in literally quicker than a blink of the eye.  Roughly 2/3rds of the 500 billion words are in English and there’s only a significant sample size for the books 1800 and on (though there are a fair amount from 1600-1800), but even with these limitations, the work allowed the researchers to come to some bold conclusions.
“What conclusions?” you ask?  Try this one on for size: they estimate that most dictionaries might only contain as little as 52% of the living lexicon at any given moment.  They estimate the total lexicon of 1-grams (single words, excluding symbols, numbers, typoes, etc) at 544k in 1900, 597k in 1950, and 1,022,000 in 2000 (counting n-grams that are used over 1/1,000,000,000 of all English words).  Some of these are not in dictionaries due to dictionaries’ traditional dislike of compound words, but others are inexcusable (they point to “deletable” as a particularly ironic example).  This lexical “dark matter,” in their charming expression, are words that are fresh for research.  No OED biography has ever examined every facet of these words, and no amount of looking up will find them.  The n-gram has saved these potentially valuable expressions from the invisibility of  their hidden nature.


Another bold feature the n-gram allows is to trace the rise and fall of terms over time.  Much has been made of the example of the engram for “World War I” vs “Great War,” where Great War holds strong until 1939, then falls off, while World War I rises to pick up the slack, but it’s hardly the only example.  You can do the n-gram test yourself and see the decline of a good many words and phrases, and the introduction of others.  Ever been curious to see if anyone said “Yadda-yadda-yadda” before Jerry Seinfeld?  Want to map “Reality Television” vs. “Situational Comedy” and see if you can identify the year Survivor was released?  Want to compare Claude Lamarck with Charles Darwin or Karl Marx with Sigmund Freud?  The world is your oyster.


The n-gram can also detect the death of older, archaic forms of words.  “Spilled” is becoming the past tense of “to spill,” but there is no use in crying over spilt milk about it, spilt had a long run.  Contemporary spouters of aphorisms think that all that glitters is not gold, but their fathers sagely opined that all that [i]glisters[/i] is not gold.  Indeed, past tense verbs that end in “t” are fighting a slow, steady, losing battle against “ed.”  Can they survive?  I feel I’ve spoilt the ending of this struggle, but I’ve been burnt on these predictions before.


The final section of the article struck (or will it become “striked?”) a more somber note: repression.  Examining the use of the word “Trotsky” in Russian language sources through the 1920s tells a harrowing tale, but everyone expected as much.  (I wanted to run a similar test on “New Economic Policy” vs “Five Year Plan,” but, alas, I speak no Russian, and the English results are pretty meaningless).  What is more interesting is the revelation of people never before suspected of repression.  The Nazi regime’s list of degenerate artists was apparently far more extensive than generally known, as people never included in the traditional narrative saw their mentions in German press fall off the face of the earth in the late 1930s.  Again, this was just a cursory exercise: this n-gram search opens up the possibility of a new way of looking both at the more blatant Nazi/Soviet repression, and the more subtle blacklisting preferred in the West.  There are millions of possibilities that n-grams open up for these millions of books.

Wikipedia: The Good, the Bad, and the Ugly

As Bonnie’s post below adroitly demonstrates, Wikipedia is a site with a deeply-ingrained ethos and traditions that might not be familiar to the casual user, a tribal society that debates the content of pages hidden behind talk pages that regular users rarely see, and end up producing articles that are more dependent on consensus than on expertise.  Sometimes, the pages that result are admirable, written in clear English and with a large number of citations at the end for further scholarly pursuit.  Often, these are prominent subjects with quality articles in many of Wikipedia’s innumerable language versions (including the admirable but somewhat bizarre “Simple English” Wikipedia, which tries to present topics like quantum mechanics at a sixth grade reading level).  Since this practicum called on me to analyze three pages on Wikipedia, I decided to present them in a classic format: the good, the bad, and the ugly.  I found one article on Wikipedia I found especially praiseworthy, one that was stunningly poor, and a talk page that was, to put it mildly, ugly.  Without further hesitation or preamble, let us examine Wikipedia.

How can you tell what the best articles on Wikipedia are?  Wikipedia itself has a handy answer: they have a “Featured Articles” category that lists what the site considers the best articles on the site.  There are currently over 3,100 featured articles, and they add roughly one a day.  These articles are Wikipedia’s self-proclaimed cream of the cream, the roughly .1% of its over three and a half million articles that it’s willing to say it stands by.  Indeed, the Featured Article I have chosen is an admirable encyclopedia article.  Slavery in Ancient Greece has sections analyzing every aspect of slavery, from a detailed examination of the various terms the Greeks used for slavery and their different connotations to an examination at the origins of Greek slavery from the Mycenaean age through the Homeric period, tracing references of slaves in pre-Classical Greece all the way down through Draco and Solon.  The article struggles to quantify the number of slaves in Classical Greece, arguing that though the wide-scale slavery of the Romans in terms of number of slaves per master was unknown, there was a widespread usage of slaves in most classes, and a rich man could have up to fifty slaves.  It argues that intentional slave breeding was a rare, if not unknown, phenomenon, and that the “slave/citizen” line was far blurrier than the strict separation of the antebellum American South.  It goes on to detail classical views of slavery and then, amazingly, gives a short modern historiography of the subject and even poses discussion questions.  This admirable article is followed by a lengthy list of twenty-nine sources, 170 endnotes, and fifteen books for further reading.  This article’s ending list of sources would be ideal for an undergraduate writing a paper on Ancient Greek slavery and needing academic sources: an amazing amount of historiography is present in the works listed (though, admittedly, 1/3rd of the books mentioned are in French).  All in all, this article is a great example of what a Wikipedia page can offer scholars.

At the other end of the spectrum, we find Wikipedia’s article on the 19th century Taiping general Loyal Prince Lee, or, as he’s known on Wikipedia, Li Xiucheng. I will admit that this is the third occasion on which I have cited this page as an example of Wikipedia’s defects, and it has changed every time, except no matter how much it changes, it remains unacceptable, year after year.  Wikipedia put up a disclaimer, almost apologetically, saying that “This article is a rough translation from Chinese. It may have been generated by a computer or by a translator without dual proficiency. Please help to enhance the translation.”   Now, read that quote back over.  The translation was generated by a computer or a translator without dual proficiency.  It’s no wonder the article is a shambles, an incomprehensible mishmash.   The level of incomprehensibility is best demonstrated by the section labeled “Write:” “In Zhong Prince Li Xiucheng Describes Himself (《忠王李秀成自述》), the autobiographical account of a prince of the Heavenly Kingdom written shortly before his execution(Pseudohistory saying Li was suicide admitted by Zeng Guofan gave Li a sword because Zeng respected Li, even Li Hongzhang had been read this describes and praised Li Xiucheng was a hero on a letter to Zeng).”  Is this at all comprehensible to anyone?  The faults are further demonstrated in the final section which gives the name of a professor at the University of London as “柯文南.”  One must be skeptical that that is, in fact, how he prefers his name to be rendered in English.  In a final confusing move, under children it lists a son, “Li Ronfar Battle of Shanghai (1861).”  Did he die in the Battle of Shanghai?  Was he born in the battle of Shanghai?  What does this mean?  The Loyal Prince Lee article demonstrates a major shortcoming of Wikipedia: articles featuring figures that are mainly of interest to speakers of non-English tongues can be extraordinarily poor, even if their article on the Wikipedia of the native language is fine or even exceptional.

Most of Wikipedia’s deliberations happen behind the scenes, on its talk pages.  Talk pages are attached to every article, yet are rarely seen by most casual users (many do not even notice them), leading to talk page conversations usually dominated by hard-core Wikipedians or cranks (and the two categories often overlap).   Many articles are subject to perennial flame wars: whether Wikipedia’s trickster sister Encyclopedia Dramatica deserves an article (warning: the author of this post strongly encourages you not to visit Encyclopedia Dramatica), whether a formerly-German, now-Polish city on the Baltic should have its name rendered “Danzig” or “Gdansk” and whether its most famous inhabitant, Nicolaus Copernicus, should be a “Pole” or a “German” (a distinction Copernicus would not have understood).  Yet many of the most contentious flame wars are on subjects that one would not expect: race in antiquity.  See the talk page of the Ancient Egyptian Race Controversy page.  For over a century, there has been vigorous academic debate on the subject, and the popular debate on Wikipedia makes that academic debate look positively civilized by comparison.  The page comes with an astounding twenty-three archives of discussion and warnings telling you that the Arbitration Committee has placed the article under probation, that the subject is controversial and in dispute, that the article had been Wikipedia peer reviewed (such a thing does, in fact, exist), that the page survived a vote on deletion, and, amusingly enough, a little dove image telling the user to remember etiquette.  The article’s first archive alone is enough to give one a major headache, and the implication that there are twenty-two more spanning half a decade of running argument boggles the mind.  That this much discussion hides in the shadow of a relatively modest article shows both how much work goes into Wikipedia and how much controversy the past can create even after a gap of two and a half millennia.

Wikipedia shows that history is alive and well on the Internet, still arousing passions and still leading to ferocious debates.  It does, however, demonstrate that not all articles are created equally, and that one should not presume that your average Wikipedia article is of equal caliber to the ones with that tell-tale star, and that maybe, just maybe, one should look at the talk page before accepting the article’s contents as truth.