Google Books has currently digitized 15 million books (and counting), or 12 percent of all books that have been published in history. While impossible to read this vast amount of literature, the tool Google Ngram allows any user with the click of a button to search linguistic trends spanning centuries.
Created by a team of researchers centered at Harvard, Google Ngram uses 5,195,769 books (roughly 4 percent of all that have been published) to conduct quantitative analysis of their contents, referring to this as “culturomics.” The concept and interface of Google Ngram is easy to use.
Below is a search for the word “history” in American English. Ngram allows you to switch between different languages like Russian, Spanish, or Chinese. Ngram also allows for comparisons between American and British English, and regular English and English fiction, which can make for fascinating results. Ngram contains the option “English One Million” which narrows your search to 6000 books per year from 1500 to 2008 for a more focused search. The user has the option of selecting from which years to search. Ngram gives you the option of “smoothing” the yearly results – which streamlines your results by averaging the occurrences of your search in the years immediately before and after each date on the results graph. I found the default smoothing of three to be effective. Perhaps someone with a better grasp of statistics could better explain this to the class?
Ngram also gives the user the opportunity to download the datasets on which it is built. For each language, the datasets for each “gram,” 1-5 terms long may be manipulated for further experiments. Although I had some difficulty doing this, it seems admirable that the creators of Ngram provide this to the public. Currently, the datasets available are from July 15, 2009, when Ngram was first created.
Ngram opens fascinating possibilities of research to historians. As mentioned in the Michel, et al. article, Ngram gives the opportunity to track the usage and evolution of words through printed history, and even government censorship, as in Nazi Germany. Below is another Ngram I ran comparing the use of the terms “USSR,” “Cold War,” and “Nuclear” in American English. It is interesting to see that by 2000, after the fall of the Soviet Union, the term Cold War has surpassed the other two.
After having experimented with this tool, what are your impressions? Are there shortcomings to how Ngram can be of use to historians? Ngram does acknowledge that information before the 1700s can be skewed because few books were published during this time. How do you envision using a tool like Ngram in your projects in the future?