Google Ngram Viewer: A marriage of textual analysis and a Google search

Before sharing my thoughts on the Google Ngram Viewer, let’s get started with some basic information on how to get started using Google Ngram. If you’d like to follow along as you read through this first part of the post, then visit Let’s get started!

The Basics

What is it? Google Ngram Viewer—hereafter referred to as Google Ngram—is a text analysis and data visualization tool that allows users to see how often a certain word, phrase, or variation of a word or phrase is found in books and other digitized texts. Given a set of simple parameters, it combs through all text sources available on Google Books.

Given your input and designated parameters, this tool maps how often (y-axis) your word(s) or phrase(s) appear among all of the words in texts published year-to-year (x-axis). It is important to note that it does not show how many times such words or phrases were used in a corpus of books. (“How many times” will be addressed at the end of this post.) Rather, it displays a percentage.

These parameters include: A) a word or words, which are referred to as grams. For example, “car” is a unigram or 1-gram. “Shirley Chisholm” is a bigram or 2-gram. “Garden of Eden” is a trigram or 3-gram. “X marks the spot” is a quadgram or 4-gram. Quintgrams or higher are not searchable. “N” in “ngram” signifies the number of combinations that your word of interest can have. For example, “Garden of Eden” has the following combinations: “Garden—of,” “Garden—Eden,” and “of—Eden.” Still yet, you can enter several words or phrases at a time, separated only by a comma (no spaces).

B) years between which you want to view the frequency of your grams. You can gather data from available materials dating as far back as 1500 and that are from as recently as 2008. Unfortunately, this presents its own limitations if materials you wish to search date to before or after this time period. This will change as Google digitizes more books and other texts.

C) a corpus in which you would like to search for your grams. These include (but are not limited to) English, Spanish, and Chinese (simplified). Here are your options, which are vast yet limited based on what Google Books has made available for searches:

D) smoothing. Essentially, this takes an average for each data point on your graph. For example, if you select a smoothing score of 6, then the data for, say, 1680, will be the average of the data from six years prior to and after 1680.

Before smoothing the data by 6 for “industry”
After smoothing the data by 6 for “industry”

E) case-insensitive. Google Ngram will by default search for your ngrams in the case that you’ve input unless you select this box:

There are many more parameters for searches that either, limit, expand, or diversify the variants of the grams for which you want to search. Google Ngram has provided a useful tutorial on its info page. There are many additional parameters to chose from, which are referred to as wildcard searches, inflection searches, part-of-speech tags, and ngram compositions. I will demonstrate the wildcard search as one example.

To conduct a wildcard search, place an asterisk before or after a word or phrase. The resulting graph will display the frequency of the ten most common preceding or appending words along with your word or phrase. For example: “Sea of *” in the English corpus from 1500 to 2008 with no smoothing and case-sensitive input:

What can Google Ngram tell us, or not?

Simple searches using Google Ngram can tell us a lot, and open opportunities for further inquiry. You can enter any term or phrase that is relevant to your research. Let’s use an example from my own research: the quadgram phrase “dollar of our daddies”.

I’d like to answer the following question. Was the phrase “dollar of our daddies” used in American English books published before 1873?

You’re probably asking, “What in the world does ‘dollar of our daddies’ mean? And why does Jonah care about its use in American English published text around 1873?”

Well, I’ll tell you.

The “dollar” refers to the silver one-dollar coin, which ceased to be minted and lost its legal-tender status after passage of the Coinage Act of 1873. (The US Mint began minting the silver dollar again starting in 1878.) The “daddies” are members of the American Revolutionary generation who saw the Constitution’s ratification and the first silver dollar minted in 1794. The Act is often referred to as the “Crime of 1873” because white American men (overwhelmingly) viewed demonetization of the silver dollar as deprivation of a part of their ancestral American heritage and manhood. There remains much research and writing to be done on the history of money in American life. And at its core, the above question asks when American specie came to represent a kind of memorial to America’s founding generation. Google Ngram gives me a fairly conclusive answer to my initial question and thus to part of the broader project.

I am immediately struck by a second question. What is the first book in that uses this phrase? A quick Google Books search for books and documents published between 1873 and 1900 reveals that volume 11 of the Brotherhood of Locomotive Engineers’s Monthly Journal, published in January, 1877, uses the phrase on page 486. This information—the date and type of publication and publishing organization— is useful. Using Google Ngram helped me to narrow the scope of my research inquiry, encouraged me to search for books and documents, published beginning in 1873, that are available through Google Books. It provided strong evidence to support my hypothesis, that “dollar of our daddies” is a product of a very specific moment in American history.

And still I am left wanting more. Why is this phrase coined post-1873 and not earlier, or later for that matter? When? What? But, why? As a textual analysis tool Google Ngram goes a long way toward answering “when” and “what” text analysis questions and then leaves me wanting answers to “why.” I am left wanting more, left with the urge to dig deeper, to read the books and magazines that comprise my Google Books search, to visit the library and borrow books on memory, money, and identity in nineteenth-century America. I’ve done a minimal sum of quantitative analysis on this phrase, and now I want to pursue qualitative analysis. As I read through sources, I will undoubtably come across other terminology to enter into Google Ngram’s search box. And the cycle continues.

Google Ngram is a powerful tool for anyone. It lets scholars conduct the sort of quantitative analysis that otherwise would be prohibitively time consuming and expensive. Text analysis on this scale would have been impossible before the arrival of machine computers at universities in the 1960s. Furthermore, people don’t often think of Google Books as a database unto itself. It is, however, a conglomeration of textual material from a range of libraries and parties that partner with Google that makes published texts available to anyone with unrestricted internet access. It grants access to a truly vast sum of text with a few parameters and the click of a mouse. (I haven’t tried, but see if you can calculate the number of parameter combinations that can be entered. It will take a long, long while.)

On the other hand, it is like any other conglomeration of textual material, a growing but limited resource. Google is sometimes thought of as the idealized modern reincarnate of the Library at Alexandria, a sort of one-stop shop for all of the information one could desire. Just think about the names “Google” and “Google Ngram.” A googol, or 10^100, is the funny number that is Google’s namesake. Ngram, as in “x” to the “nth” degree. And for the seemly unlimited knowledge to be had, Google Ngram is limited in one major way: Google Ngram mines Google Books for data because they are both Google products. And unlike American University’s library database, Google Ngram does not search across databases such as the Internet Archive, which similarly gives open access to printed texts. Furthermore, Google Books only had approximately 500,000 books dating to before the nineteenth century, so the frequency that appears high in earlier years prior to 2008 for words like “car” may give a distorted impression or confuse some users (as well as require some additional archival research).

Remember: Google Ngram displays the frequency, not number of times that a word or phrase appear in texts dated to a specific year. Unless your question is appended with “in textual sources available through Google Books,” research should never stop with Google Ngram. Ask more questions.

Google Ngram has the power to get people thinking about the endlessly intriguing knowledge to be had simply by clicking “search lots of books” as a start. And that’s just it. It has the power to get people thinking.

An AP World History teacher may use this Google Ngram to point at the suffusion of new ideas about leadership in the lead up to the twentieth century:

A LGBTQ+ Studies professor may use this Google Ngram to show students about the relationship between terms such as “gay” and “homosexual” in American culture after 1840:

Students who are more inclined to learn visually rather than textually, visual representations of textual analysis may go a long way towards driving home a lesson.

Are you a researcher who want access to the raw corpus data? There is an option to download a .zip file containing that data.

And for those who have ever google’d “Google,” this is for you:

Google Ngram Viewer is an immense tool. A fun tool. And one that raises more questions than it answers. Ask away!

Visualizing Your Data With IBM’s Many Eyes

Many Eyes is a powerful tool that enables a user to create visualizations from any kind of data set.

Here’s where it gets fun: while a user can upload their own data set, Many Eyes is a community-powered tool. There are over 150,000 data sets to choose from, and many are pre-visualized.

Another (seemingly underused) feature are Topic Centers. Topic Centers allow teams of people to collaborate on visualizations. Topic Centers are organized around certain topics (makes sense, right?), as well as teams of people at organizations and classes (like this one).

Here are some examples:

Average Time Spent Commuting by State Many Eyes
Average Time Spent Commuting by State

Number of arrests by age and type of crime Many Eyes
Number of arrests by age and type of crime

News Blogs Dominated By A Few Startups Many Eyes
News Blogs Dominated By A Few Startups

But selecting a dataset from the community is not always the best option: the metadata associated with many of the datasets is inaccurate or incomplete. Rest assured, because what makes Many Eyes such a versatile tool is that any type of data is accepted, so long as it is in a structured format. Data needs to be pre-formatted in Microsoft Excel (or similar spreadsheet software), then pasted into Many Eyes’ Web interface.

Then the user is presented with an array of visualization options, from tag clouds and word trees to assorted graphs and even maps.

A couple of potential uses for historians:

  • Take a historical text or speech (i.e. the Gettysburg Address) and create a tag cloud from it, where the more frequently a word is used, the larger it will appear.
  • Create a network diagram to visualize a historical figure’s family tree.
  • Use a map to show population trends over time.

Over the summer, I took air traffic control data and visualized it using Many Eyes, for fun. It was easy to use every step of the way. In fact, it’s so easy to use, the hardest part should be finding the data in the first place.

It is beyond imperative to have good visuals when working on the Web, since readers hate long blocks of static text. Bringing a history project to the Web calls for the use of visualizations like those that can be generated using Many Eyes. It will make your work more attractive, and will certainly help your readers understand things better. At the end of the day, it’s all about them!