The Corpus of Historical American English and Google Books

The Corpus of Historical American English and Google Books/Culturomics


Let me first start off by defining a corpus. A corpus is a large or complete collection of words and writings. This blog post is about three recently created resources which serve as a corpuses of American English words. The Corpus of Historical American English (COHA), Google Books standard, and Google Books advanced are the three resources that are compared here.

Let me first start by giving a little background on each corpus. COHA was created by Mark Davies of Brigham Young University with funding from the US National Endowment for the Humanities and was released in 2009. COHA contains 400 million words from 1810 to 2009 and is one of the largest structured corpuses of historical English. This online resource allows you to search through more then 400 million words of text of American English. You can see how words, phrases, and grammatical constructions have increased in frequency, how words have changed meaning over time, and how stylistic changes have taken place in the language. You can also download an offline interface to use.

Google Books standard was also created by Mark Davies of BYU. It was released in October 2010. This contains 155 billion words, but does not have as wide a range of searches as COHA. In May 2011, Google Books BYU/Advanced version was released. This new interface allows you to search the same amount, 155 billion words in American English, including 62 billion words from 1980-2009. This new advanced interface is a hybrid of COHA and Google Books Standard version. It is much more advanced than the original Google books interface. You can search by word, phrase, substring, lemma, part of speech, synonyms, and collocates. You can also easily compare the data in two different sections of the corpus. Although this corpus is based on Google Books data, it is NOT an official product of Google or Google Books.

COHA is a lot smaller than both Google Books interfaces but offers an extremely wide range of searches. In terms of exact words and phrases, all three resources give nearly the same results for these searches. COHA is probably sufficient for searches for exact words and phrases.  However, with the standard Google Books interface you get less information on frequency and related phrases than the other two interfaces.  The Standard Google Books interface is also limited with related words and cultural insights, whereas COHA and Advanced Google Books allow you to do more interesting and useful searches like finding all words with the suffix “ism.”

You can also search for concepts, not just exact words and phrases. With COHA and the Advanced Google Books interface you can use built-in synonyms to search for the frequency of concepts. But with the standard interface you can only look for exact words and phrases. COHA and Advanced Google Books also allow you to search changes in meanings, collocates and natural shifts, function of words, grammatical change, and language change and genre.

Overall, the Standard Google Books interface is very neat, but it all it does is allow for the search of the frequency of words or exact phrases over time. Whereas, COHA and the Advanced Google Books interface allow for much broader and more interesting searches. Why are these interfaces important? A comparison of words, phrases, etc. gives us, as historians, great insights into cultural, social, and historical changes in American English throughout different periods of history.  These interfaces are very interesting and provide us with a valuable source of a part of history that many people ignore: the history of words, aka the history of American English. Check out the interfaces and you might just be surprised at what you find!

Leave a Reply

Your email address will not be published. Required fields are marked *