Google Ngram Viewer: A marriage of textual analysis and a Google search

Before sharing my thoughts on the Google Ngram Viewer, let’s get started with some basic information on how to get started using Google Ngram. If you’d like to follow along as you read through this first part of the post, then visit https://books.google.com/ngrams. Let’s get started!

The Basics

What is it? Google Ngram Viewer—hereafter referred to as Google Ngram—is a text analysis and data visualization tool that allows users to see how often a certain word, phrase, or variation of a word or phrase is found in books and other digitized texts. Given a set of simple parameters, it combs through all text sources available on Google Books.

Given your input and designated parameters, this tool maps how often (y-axis) your word(s) or phrase(s) appear among all of the words in texts published year-to-year (x-axis). It is important to note that it does not show how many times such words or phrases were used in a corpus of books. (“How many times” will be addressed at the end of this post.) Rather, it displays a percentage.

These parameters include: A) a word or words, which are referred to as grams. For example, “car” is a unigram or 1-gram. “Shirley Chisholm” is a bigram or 2-gram. “Garden of Eden” is a trigram or 3-gram. “X marks the spot” is a quadgram or 4-gram. Quintgrams or higher are not searchable. “N” in “ngram” signifies the number of combinations that your word of interest can have. For example, “Garden of Eden” has the following combinations: “Garden—of,” “Garden—Eden,” and “of—Eden.” Still yet, you can enter several words or phrases at a time, separated only by a comma (no spaces).

B) years between which you want to view the frequency of your grams. You can gather data from available materials dating as far back as 1500 and that are from as recently as 2008. Unfortunately, this presents its own limitations if materials you wish to search date to before or after this time period. This will change as Google digitizes more books and other texts.

C) a corpus in which you would like to search for your grams. These include (but are not limited to) English, Spanish, and Chinese (simplified). Here are your options, which are vast yet limited based on what Google Books has made available for searches:

D) smoothing. Essentially, this takes an average for each data point on your graph. For example, if you select a smoothing score of 6, then the data for, say, 1680, will be the average of the data from six years prior to and after 1680.

Before smoothing the data by 6 for “industry”
After smoothing the data by 6 for “industry”

E) case-insensitive. Google Ngram will by default search for your ngrams in the case that you’ve input unless you select this box:

There are many more parameters for searches that either, limit, expand, or diversify the variants of the grams for which you want to search. Google Ngram has provided a useful tutorial on its info page. There are many additional parameters to chose from, which are referred to as wildcard searches, inflection searches, part-of-speech tags, and ngram compositions. I will demonstrate the wildcard search as one example.

To conduct a wildcard search, place an asterisk before or after a word or phrase. The resulting graph will display the frequency of the ten most common preceding or appending words along with your word or phrase. For example: “Sea of *” in the English corpus from 1500 to 2008 with no smoothing and case-sensitive input:

What can Google Ngram tell us, or not?

Simple searches using Google Ngram can tell us a lot, and open opportunities for further inquiry. You can enter any term or phrase that is relevant to your research. Let’s use an example from my own research: the quadgram phrase “dollar of our daddies”.

I’d like to answer the following question. Was the phrase “dollar of our daddies” used in American English books published before 1873?

You’re probably asking, “What in the world does ‘dollar of our daddies’ mean? And why does Jonah care about its use in American English published text around 1873?”

Well, I’ll tell you.

The “dollar” refers to the silver one-dollar coin, which ceased to be minted and lost its legal-tender status after passage of the Coinage Act of 1873. (The US Mint began minting the silver dollar again starting in 1878.) The “daddies” are members of the American Revolutionary generation who saw the Constitution’s ratification and the first silver dollar minted in 1794. The Act is often referred to as the “Crime of 1873” because white American men (overwhelmingly) viewed demonetization of the silver dollar as deprivation of a part of their ancestral American heritage and manhood. There remains much research and writing to be done on the history of money in American life. And at its core, the above question asks when American specie came to represent a kind of memorial to America’s founding generation. Google Ngram gives me a fairly conclusive answer to my initial question and thus to part of the broader project.

I am immediately struck by a second question. What is the first book in that uses this phrase? A quick Google Books search for books and documents published between 1873 and 1900 reveals that volume 11 of the Brotherhood of Locomotive Engineers’s Monthly Journal, published in January, 1877, uses the phrase on page 486. This information—the date and type of publication and publishing organization— is useful. Using Google Ngram helped me to narrow the scope of my research inquiry, encouraged me to search for books and documents, published beginning in 1873, that are available through Google Books. It provided strong evidence to support my hypothesis, that “dollar of our daddies” is a product of a very specific moment in American history.

And still I am left wanting more. Why is this phrase coined post-1873 and not earlier, or later for that matter? When? What? But, why? As a textual analysis tool Google Ngram goes a long way toward answering “when” and “what” text analysis questions and then leaves me wanting answers to “why.” I am left wanting more, left with the urge to dig deeper, to read the books and magazines that comprise my Google Books search, to visit the library and borrow books on memory, money, and identity in nineteenth-century America. I’ve done a minimal sum of quantitative analysis on this phrase, and now I want to pursue qualitative analysis. As I read through sources, I will undoubtably come across other terminology to enter into Google Ngram’s search box. And the cycle continues.

Google Ngram is a powerful tool for anyone. It lets scholars conduct the sort of quantitative analysis that otherwise would be prohibitively time consuming and expensive. Text analysis on this scale would have been impossible before the arrival of machine computers at universities in the 1960s. Furthermore, people don’t often think of Google Books as a database unto itself. It is, however, a conglomeration of textual material from a range of libraries and parties that partner with Google that makes published texts available to anyone with unrestricted internet access. It grants access to a truly vast sum of text with a few parameters and the click of a mouse. (I haven’t tried, but see if you can calculate the number of parameter combinations that can be entered. It will take a long, long while.)

On the other hand, it is like any other conglomeration of textual material, a growing but limited resource. Google is sometimes thought of as the idealized modern reincarnate of the Library at Alexandria, a sort of one-stop shop for all of the information one could desire. Just think about the names “Google” and “Google Ngram.” A googol, or 10^100, is the funny number that is Google’s namesake. Ngram, as in “x” to the “nth” degree. And for the seemly unlimited knowledge to be had, Google Ngram is limited in one major way: Google Ngram mines Google Books for data because they are both Google products. And unlike American University’s library database, Google Ngram does not search across databases such as the Internet Archive, which similarly gives open access to printed texts. Furthermore, Google Books only had approximately 500,000 books dating to before the nineteenth century, so the frequency that appears high in earlier years prior to 2008 for words like “car” may give a distorted impression or confuse some users (as well as require some additional archival research).

Remember: Google Ngram displays the frequency, not number of times that a word or phrase appear in texts dated to a specific year. Unless your question is appended with “in textual sources available through Google Books,” research should never stop with Google Ngram. Ask more questions.

Google Ngram has the power to get people thinking about the endlessly intriguing knowledge to be had simply by clicking “search lots of books” as a start. And that’s just it. It has the power to get people thinking.

An AP World History teacher may use this Google Ngram to point at the suffusion of new ideas about leadership in the lead up to the twentieth century:

A LGBTQ+ Studies professor may use this Google Ngram to show students about the relationship between terms such as “gay” and “homosexual” in American culture after 1840:

Students who are more inclined to learn visually rather than textually, visual representations of textual analysis may go a long way towards driving home a lesson.

Are you a researcher who want access to the raw corpus data? There is an option to download a .zip file containing that data.

And for those who have ever google’d “Google,” this is for you:

Google Ngram Viewer is an immense tool. A fun tool. And one that raises more questions than it answers. Ask away!

The Google Custom Engine: Refining Searching in a Few Steps

Sometimes it is a frustrating experience to search for a topic through the internet, only to have the search engine turn up results that are not related to what you are looking for. This problem is similar to what the Bing commercials looked to address with “search overload” during internet searches.

The Google Custom Search Engine provides its users with a search engine to put on their website; the main feature is that it is customizable to refine its search results based upon parameters set by the user.

This makes it easy to find information because the search engine will only look through the user-set websites and pages, and not through other places that are not topic-related.

Setting up a Google Custom Search Engine is an easy three-part step. The first step has the user setting the parameters of the search engine, listing the websites the search engine will use. The second step is only a setup of how the engine will appear on the website, and the third step provides the code to paste into the user’s website.

There are tons of smaller options that allow the search engine to be customized even further, from choosing sites to emphasize during the search, to making money from Google’s AdSense program.

One problem I could see with the search engine is that its usefulness is only as good as the sites that the user lists for the engine to use; if they do not know enough sites to put on the list, the search results may not be as complete.

One solution is that the search engine allows collaboration with invited users with limited access, letting them add sites and labels to the list as needed. The search engine can also choose instead to search through all pages, but emphasize the list of websites provided by the user.

The Google Custom Search Engine is basic in what it is used for, but can be further customized for advanced use in user interaction and how results are shown. Easy to set up, this search engine is one way for websites to ensure that their users are finding search results that are topic-related.

External Link to Example Search Engine
Smithsonian and DC Museums

Victorian Researcher Finds Google Makes His Life A Lot Easier

If you thought “Googling the Victorians” was about something else, you’ll be disappointed. In this article, Patrick Leary discusses how Google has made his life as a researcher of the Victorian era so much easier.

That’s to be expected with anything in digital history — wouldn’t our lives as historians be so much harder without Google?

But what is so surprising and unique about Leary’s article is how he views Google’s usefulness as something of an accident.

Leary writes about his search for a phrase that appeared in the Sunday Review.. His search for this phrase appeared in a number of other sources as well.

Leary writes: “Such experiences reinforce the conviction that the very randomness with which much online material has been placed there, and the undiscriminating quality of the search procedure itself, gives it an advantage denied to more focused research.”

While Google has helped his work, Leary also writes that it is no silver bullet and that one should always verify the authenticity of a source that is returned in a Google search.

“A great many legitimate scholarly purposes can nevertheless be served by an array of online texts that are, to one degree or another, corrupt,” he writes.

Later in the paper, we hear with excitement the prospects of expanded digitization projects as well as improvements in optical character recognition, or OCR, the technology that enables the searching of 19th century Victorian documents. Leary is also excited about the expanded number of non-profit digitization initiatives, like the Internet Archive.

He then discusses how new generations will take this kind of research for granted.

“What we are seeing is arguably not merely an electronic supplement to traditional library and archival research, but a more fundamental shift in our relationship to the textual universe on which our research depends,” he writes.

In all, this paper is not at all surprising. It could be extrapolated and made applicable to other topics within history, or even other fields. But what makes it important is Leary’s anecdotes about how this has changed his life — and his field.