Clew or Clue?: Analyzing Two Web-Based Language Evolution Tools

In the 1987 comedy-romance Roxanne with Steve Martin and Daryl Hannah, one man (C.D.) attempts to help another (Chris) woo a woman in whom they both have a romantic interest (Roxanne). After a disastrous initial encounter, Chris requests that C.D. help him talk to Roxanne. Standing under her window at night, C.D. hides in the shadows and attempts to feed romantic lines to Chris. Roxanne questions why Chris said the things he’d said, and C.D. whispers “Tell her you were afraid.” Chris shouts up at the window, “Because I was afraid.” “Of me? Afraid of what?” Roxanne responds. “Tell her you were afraid of words,” C.D. hisses to Chris in the dark. “What?” Chris asks. “Words.” Chris looks up to Roxanne’s window and passionately exclaims, “Because I was afraid of worms, Roxanne, worms!”

In both love and scholarship, words matter. In the latter, change over time in which words are used and how can reveal changing intellectual, social, and cultural trends. However, historians wishing to analyze the evolution of language are likely to find combing primary sources and recording the data too tedious to accomplish beyond a very small sample. That’s where two web-based tools can be of great help: TIME Magazine Corpus and Google Ngram.

TIME Magazine Corpus

According to the TIME Corpus landing page, this tool allows users to search approximately 100,000,000 words of text in 275,000 article from TIME, 1923-2006. Hosted by BYU, which hosts other corpora as well spanning more contexts and periods, the TIME Corpus provides a brief tour of its offerings.  It notes that you can search for terms used during a particular time period to discover social and cultural change or significant events; the use of particular words and phrases as language trends; word roots, prefixes, and suffixes; types of words that were or were not used in a particular period; and the change of meaning over time.

In my own research, a common language change in print is clew vs. clue. I entered “clew” into the search bar and it provided 12 instances in TIME between the years of 1924 and 1964 when “clew” appeared:

Searching for “clue” provides 919 hits. The tool helpfully breaks down the use of the searched for word into decades to reveal potential trends, and provides brief context (and the ability to click through for full context) for each use of the search term:

This tool is sophisticated. It can show results in either a list or a chart, search for words that are co-located, and compare how two search terms are located in sentences. It also offers a slightly-too-advanced-for-my-understanding feature called KWIC, or Keyword in Context display. The TIME Corpus stands out as a tool for researchers because it not only can search a massive amount of data, but can do so intelligently, taking into account parts of speech, synonyms, and even context.

Google Ngram

Ngram is a more user-friendly version of TIME Corpus that crawls an even larger data set (all of Google Books, including multiple languages). While the searches cannot be quite as sophisticated as TIME Corpus–one cannot, for example, search by context–the data available allows historians to make more generalized claims about language than searching a single magazine. The user guide is a little more helpful and thorough as well.

Unlike TIME Corpus, searches do not provide context; there is no way to click through to a specific instance of usage without breaking it down into other time periods first. However, hovering over the line graph created by entering search terms shows the year and the percentage of overall usage out of the entire data set.

The resulting charts are aesthetically pleasing and make it easy to compare terms. Using the clew/clue example again, but this time searching American English from 1629-2000, this was the result:

Changing to British English created this chart:

Isolating to English fiction shows a few interesting spikes:

This allows historians to make claims about trends in language among a much more significant data set that can be then further analyzed by time or area of the world.

Since Ngram is part of Google, it is unsurprising that there are options to tweet search results or embed on a webpage. They also offer the raw data for download so that researchers can incorporate it into other tools or experiments.

Both TIME Corpus and Ngram allow historians to very quickly and fairly simply search through significant amounts of data and substantiate arguments about language over time and the importance of words (and “worms” should you choose to enter it as your search term).

Leave a Reply

Your email address will not be published. Required fields are marked *