A True Social Science: Review of “Quantitative Analysis of Culture Using Millions of Digitized Books”

Imagine being able to examine how certain aspects of culture have changed over two hundred years. This is what Michel, et. al. argue is not only possible, but can be done in just a few seconds. The methodology employed by Michel, et. al. in “Quantitative Analysis” has the potential to redraw the landscape of social sciences completely. Michel, et. al. do something truly amazing, they use the processing power of computers to observe, what they claim to be, changing cultural trajectories. To do this they examine the frequency of words in the millions of digitized books they have available to them. The context the words are found in does not matter at all, the genres of the books is not even considered, just the frequency of the words themselves. Michel, et. al. call their new field “Culturomics.”

The format of “Quantitative Analysis” will be unfamiliar to most historians because it was published in a scientific magazine and conforms to the standards of a science publication. As such, it begins by laying out the methodology and some of the more basic mathematics employed. What emerges as the most significant point from this segment is that they are employing what they estimate to be approximately 4% of books ever published, over 5 million books containing over 500 billion words. Most of these books are concentrated in recent years, with the number gradually declining as their publication date gets earlier.

Millions of Books!

The first assertion made by the authors using this vast corpus is that cultural changes guide the linguistics we use. They illustrate this using the frequency of the word slavery, which expectedly rises around the civil war. This can be carried out in reverse to use changes in linguistics in the corpus to make arguments about changes in culture.

The authors also examine the frequency that a year, such as 1951 occurs in books, and use the data they find to make the assertion that the past rapidly fades from collective memory. Along the same lines the authors also examined the time it took for an invention to become widely discussed and argue that now more than ever do people rapidly accept technological innovation. The authors then apply similar examinations to people to trace how they remain in public memory.

Perhaps the most fascinating segment of the article deals with the use of the corpus to detect censorship. First the authors set out to examine whether the cultural impact is noticeable when someone is censored. By observing the frequency of names of censored individuals the authors observe that particularly for politicians (and less so for historians), censorship resulted in a sharp decline in the appearance of their name (what the authors would call cultural impact). The authors then moved to examining whether one could detect censorship without foreknowledge of if someone was censored. Using statistical comparisons on Germany during Nazi censorship the authors found they were easily able to see censorship. The ability to detect censorship without the censorship itself being recorded has the potential to be an extremely useful tool for historians.

The authors also use their corpus to estimate the size of the English language and attempt to explain why dictionaries fail to include many words. They even examine the frequency at which new words emerge and other words die out. One of the more interesting points they make in their discussion of lexicon is that irregularities in the English language often result in new words being used, such as Americans moving from saying sneaked to snuck. While this is not as useful to historians the authors assert that lexicographers will find this extremely useful.

As you would expect, the authors think “culturomics” should be applied to more than just books. According to them, “culturomics” can be applied to manuscripts, art, and more in the future. They also see “culturomics” as being useful to both social sciences and humanities. While there is a disagreement whether history is a social science or humanity, “culturomics” certainly has a potential use. Using a statistically significant sampling size, historians now have the ability to make sweeping examinations of cultural changes for society as a whole.

Still, there are a number of issues with “culturomics.” In reality you are not examining cultural changes of the entire culture, just of the literate elite. At the same time the information acquired is devoid of all contexts, so historians may make erroneous conclusions or connections. An example of this would be a historian attempting to see the significance of a speech, and after seeing a change in culture that corresponds with the speech connecting the two, while in reality the change observed was caused by something different. Also the massive size of the corpus being examined brings its own limitations. Only sweeping generalizations can be made. If your examination is limited to a certain group or place “culturomics” would be of little help. Sweeping generalizations of all the speakers of a certain language have little usefulness and are more likely to be misleading.

Do you think “culturomics” is a useful method? Do you think this is a tool for historians, or other disciplines? Do you agree with the pitfalls I pointed out? Do you see any other potential pitfalls for using this method? I think the potential of bringing quantitative analysis to history has merit, but we are still a long way from finding a truly useful way of doing so. Historians will also most likely be resistant to quantitative analysis, especially with the presence of the postmodernist critique of scientific rationality. Hopefully one day history and mathematics and science will work together to create a more robust and nuanced field.

If you would like to see the full dataset of 2 billion “culturomic” trajectories go to: www.culturomics.org, or you can just use google n-gram.

One Reply to “A True Social Science: Review of “Quantitative Analysis of Culture Using Millions of Digitized Books””

  1. Great post Nathan! After reading Michel, et. al.’s article, I was completely hooked on “culturomics,” but you bring up important drawbacks to this methodology. Particularly, as you stated, that “culturomics” is a study of the literate population. My hope is that if and when, as Michel, et. al. suggested, manuscripts, newspapers, and art are incorporated into these searches, historians can gain a greater sense of sweeping cultural trends of a wider sample size of the population.

    Either way, historians and scholars cannot rely on “culturomics” alone. Historians still need to have a background knowledge of the context and, as you mentioned Nathan, understand the pitfalls of taking words out of context. For example, I did a Google Ngram search of “pirate.” After 1930, the use of the word pirate drops, but sharply rises again after 2000 and continues to rise. Is this because the blockbuster Pirates of the Caribbean movies were released throughout the 2000s? Or is it because there has been a rise of piracy off the coast of Somalia? Or is it a rising interest in the Pittsburgh Pirates baseball team? We know there has been a rise in the usage of “pirate,” but historians will need to delve further to uncover why this is so.

    With that said, “culturomics” gives historians unprecedented access to millions of books and trends among those books. This methodology can be a great way to make new discoveries, but more importantly can also stimulate more questions. Just by my quick search of “pirate,” I now have many interesting questions about the history of pirates and culture. Research on these questions would surely turn into a great research paper!

Leave a Reply

Your email address will not be published.