Finding Patterns in Your Text with Voyant Tools

Voyant Tools is a text analysis platform that allows you to analyze a single text document or compare multiple documents. It is useful to analyze big [language-based] data over time.


Single Text Document: Paste the text into the text box. This prevents additional text from a webpage being included in your analysis. For example, if I want to look at the language in the Constitution, I should paste the plain text Constitution into the text box rather than the National Archives Constitution Transcript webpage to prevent “archives” appearing as a top word in my analysis, as seen below.

Multiple Webpages: Paste the links to each webpage in the textbox with each URL on a separate line.

Multiple Documents: Move all the documents you want to analyze into a shared folder for easy access. Click “Open” under the textbox and select all the documents you want to analyze. You have to select everything you want to include at once, as you cannot add more documents at a later time. This is why it is easiest to put your materials in one central location.

The Main Tools:

Cirrus: Essentially a word cloud. Check out Ike’s blog post on Wordle for more information on how to use word clouds in your research.

Reader: Reader allows you to view the full uploaded text and select individual words to see reveal frequency data across the text(s). For example, I uploaded transcripts for three oral histories I conducted on abortion last semester to analyze how much each narrator talked about health insurance in their interview. To do this, I clicked on the word “insurance” in the Reader section and Voyant Tools generated the below data.

The data shows that first narrator talked about insurance during one segment of the interview, the second discussed insurance throughout the interview, and the third narrator did not directly talk about insurance during her interview. This data can be seen in three different formats. In the top left section, you can see where in the interview the word insurance was used. In the top right section, you can see the frequency of use, and in the bottom section there are links to where in the text insurance is used to provide context for your analysis.

While this data provides little insight on its own, it is useful as evidence with the context of the interviews. Since I conducted the interviews, I know each narrator has a different understanding of her abortion—Narrator 1 was concerned with her own experience and aware of her privilege, Narrator 2 was concerned about women’s access to abortion who were less privileged than her, and Narrator 3 was still grappling with her relationships to her former partners and parents. Given this knowledge, I can use the frequency of the term “insurance” in each interview as a way to show that each woman conceptualized her experience with abortion differently, but without that knowledge, there is little direct meaning in the data.

Trends: The trends function allows you to compare the relative frequency (useful for comparing texts of varied lengths) of words across a singular text or between multiple texts in various graph forms. It can help show variation between the focus of differents texts or how language has changed over time.


I found Voyant Tools to be unwieldy. It gave me a lot of data that I did not need or that did not have much relevance to any analysis, and it was hard to sort through the data to find meaningful relationships between different texts. Perhaps this is because I was comparing only 2-4 relatively short texts at a time, rather than large inputs of data that the platform seems intended for. Or it is because I did not know what I was looking for and I was hoping it would reveal interesting relationships between different texts. However, that does not mean it is not useful to any historical research.

Voyant Tools is exactly what it sounds like—a tool for historical research. It is useful to help bolster an argument or provide evidence for an argument, but it does not make an argument for you. A researcher has to know what they are looking for and have a deep understanding of their subject to use Voyant Tools in a truly productive manner. But, if that is done, it can help provide visual or numerical evidence for a historical argument.

Voyant Tools created a great “Getting Started” guide that I recommend exploring if you want to use Voyant Tools in your research. It has helpful details on the plethora of tools available on the platform, as well as a guide on how to save your corpus for future reference.

What do you think of Voyant Tools? Do you think you could use Voyant Tools in your research? Did you have any success using Voyant Tools yourself?

History and the Internet

Let me come clean. Despite my being what many would call a “digital native,” I found myself surprisingly hesitant about what producing history in such an unregulated space such as the internet might do to the field. I automatically presumed the dangers associated with leaving historical interpretation to those with little or no professional training or expertise in a given field—my point of reference, of course, is domestic politics. Databases with transcription tools, genealogy sites that connect people to their ancestors, and websites and exhibits built and maintained by professional historians (in concert with design and tech professionals) on the other hand, seemed like safe and responsible ways to let people engage with history online. I often championed what Roy Rosenzweig refers to as “possessive individualism” because it was the most familiar principle to me in the historical profession. It is as though this weeks readings gave me permission to find excitement in what may well be limitless possibility for crowdsourcing history on the internet.

People from all walks of life and with a host of different skillsets and interests have a tremendous capacity to make historical information available to the masses. Causer and Wallace’s piece on Transcribe Bentham was a fascinating example of the enormous potential for volunteer contributors and staff to work together to make a voluminous sum of archival material available to any interested party. (As an aside, in 2017, TIME Magazine named Steven Pruitt, a.k.a. Ser Amantio de Nicolao, a volunteer Wikipedia editor and contributor as one of the 25 most influential people on the internet.) As they themselves indicated, crowdsourcing for projects like Transcribe Bentham are almost entirely volunteer-based out of necessity. The humanities have never (at least to my knowledge) been funded to the point that faculty and students need compete for relatively scarce resources. Crowdsourcing is, in these examples, collaborative, not exploitative.

I have no doubt that volunteers would benefit from fair pay for their contributions to community-based projects, transcription or not. Brabham’s argument is a convincing one. The overwhelming number of members of the crowd are equal in qualification and skill to those who have the good fortune of being tenured or tenure-track. And his argument is principled and one that not only I do not dispute, but rather wholly support. Graduate students such as myself often lunge at opportunities to assist researchers or contribute to projects in part for the much needed and much appreciated additional income. And for students who may be preparing for lives in academia, recognition of such work acts like capital in a marketplace; it raises our profile and further integrates us into a system that excludes those without major publications or service to the academy. But that said, there is something to be said for volunteerism. The humanities has always relied on committed individuals who understand the value of imparting knowledge to present and future generations. Thankless (unpaid) work is just as no less important to the humanities, and especially to the digital humanities, as the seminal article or paradigm-shifting monograph. “Enthusiasm,” write Causer and Wallace, is the steady force behind the “formidable” contributions of volunteer crowds.

That said, the internet is as much a wild west as it is a sort of terra nullius. (Unlike the land west of the Mississippi River, the space of the internet is actually unclaimed and open for cultivation.) Vigilance is needed to direct and coordinate advancements in the field of digital history. According to Edson, the notion that museums are meant to be physical structures visited by commuters and travelers is as false as the scientific argument that the universe is only what is visible to our naked eye. Museums and their staff have the whole of the internet to play around with and put to work in service of their institutional missions. Through using the example of the Vlogbrothers, Edson seems to suggest, and rightly so, that “memory institutions” like the Louvre need to be open to experimentation. Vulnerability and a welcoming attitude toward the crowd, coupled with institutional safeguards to prevent squirreling away the institution’s resources, is necessary if museums expect to compete with the likes of YouTube’s biggest celebrities.

This week’s readings present a favorable and much deserved narrative of crowds. The exception is Ford’s synthesis of WWIC— “Why wasn’t I consulted?” He gives the example of Wikipedia. “It tapped into the basic human need to be consulted and never looked back.” Yet there is a difference between people who think they understand the subject with which they are engaged and those who actually understand the profound potential implications of their contributions for how people use the past. The internet can be collaborative so long as its collaborators are limited to groups of people more interesting in getting work done than in argument or in being a contrarian. For that matter, historians and digital humanities professionals should continue to develop new methods of historical inquiry, of fostering intellectual curiosity, and inspiring bravery among people who are among the majority of site visitors to Transcribing Bentham who did not begin or complete a single transcription. It may be that collaboration is only possible if the digital humanities can convince larger and larger crowds to commit some of their leisure time in service of the humanities. It may also be that professional historians have yet to fully embrace the same sense of adventure and enthusiasm as Wikipedia contributors and editors and Transcribing Bentham contributors in bringing the humanities to the digital world. But my guess is that it is only a matter of time until this happens.

Web 2.0 and the Human Element

“We’ll need to rethink a few things.”

That’s the closing thesis of The Machine is Us/ing Us, a YouTube video about Web 2.0 produced by Michael Wesch, an Assistant Professor of Cultural Anthropology at Kansas State University. Among the things that will need rethinking, according to Wesch, are copyright, authorship, identify, aesthetics, governance, privacy, commerce, love, family, and ourselves. Quite the list.

We can add one more thing to that list: how historic memory institutions (museums, historic sites, archives, etc.) create, maintain, or re-capture relevance in the age of smartphone apps, YouTube videos, and Wikipedia pages created by, in Alison Miner’s words, “created [for free] by someone who’s just f*****g off at their corporate office job.” Miner’s piece, titled “if everything on the internet has to be free, why isn’t my healthcare, too?“, lays out one concern: “institutions will pay for a 2 year digitization project, and fancy equipment for that, but don’t want to employ another archivist so that there is actual CONTEXT to the things they digitize.”

When I was going through training to work at historic sites, one thing my trainers stressed was that the modern visitor has so many options to replace spending money to travel and visit a historic memory institution. Why waste a weekend and possibly hundreds of dollars traveling to watch someone demonstrate how a musket is loaded and fired when there are thousands of videos posted by re-enactors on YouTube? Why go to a museum when fifteen minutes on Google might answer all your questions? Michael Peter Edson says in Dark Matter that “It’s likely that the public doesn’t think of what memory institutions often do as being sufficiently accessible, smart, joyous, attentive, generous, welcoming, imaginative, bold, educational or meaningful to merit much of their attention.”

The web can be a huge source of publicity, visitors, and revenue. But these benefits only accrue to historic memory institutions that can engage and harness that potential in productive ways. Using the web to just increase the number of people who can view content fails to grasp the thing the web, especially Web 2.0, has best enabled: the ability to feel like you are being heard. In Why Wasn’t I Consulted, Paul Ford says “The web is not, despite the desires of so many, a publishing medium. The web is a customer service medium… Humans have a fundamental need to be consulted, engaged, to exercise their knowledge (and thus power), and no other medium that came before has been able to tap into that as effectively.”

Historic memory institutions can piggyback on that tapped need by making their online presence interactive and engaging, instead of just viewing it as an extension of their brick-and-mortar offerings. An excellent example of this kind of engagement is Fort Ticonderoga’s social media presence: Fort Ticonderoga presents ways for people to exercise their knowledge by posting weekly challenges asking people to look at a partial image of an artifact from the Fort’s collection and attempt to guess what the artifact is.

An important caution to remember is that this sort of engagement cannot rely solely on capturing the brief spotlight of social media fame. The Museum of English Rural Life might make headlines for being funny on Twitter but memes cannot be a replacement for the work of historians. To quote Alison Miner: “we can’t rely on the hot flash of meme-popularity to justify our existence, because our jobs require a long period of time to be done well. and fundamentally, the archives and other collections deserve better than a momentary blitz of attention.” Historic memory institutions must endeavor to use the web in ways that spark long-term interest, not just momentary acknowledgement.

It is also important to remember that there are some things that online engagement cannot do. Barring massive improvements in virtual reality, viewing the most engaging online content cannot replicate the physical sensations that come with in-person visits to historic sites. Watching a video of a musket firing cannot fully replicate the ways that watching that same demonstration in person affects the senses. Web content is also incapable of having a conversation, which is critical to effective historic education. One historic site where I worked had floated the idea of adding QR codes to various exhibit spaces. A visitor who scanned the code with their smartphone would be able to watch a video of an interpreter in historic clothing present information about the exhibit. One reason this project was scrapped, apparently, was that members of the interpretation team pointed out that replacing on-site interpreters with videos would remove the ability of visitors to ask questions and have extended conversations. That human element cannot become devalued by the flashiness and novelty of digital content, no matter how essential an online presence is.

Wikipedia: How does it work and why don’t we teach it?

Those of us who have been educated in the age of Wikipedia knew that our grade was at risk if we cited the website. This refrain resulted in my assuming that Wikipedia was inherently flawed, inevitably incorrect, and otherwise detrimental to my education. Still, if I needed a quick answer to a question, Google would lead me first to the Wikipedia page, and I was naively confident in its accuracy.

This practicum is intended to teach us how Wikipedia works, reconciling our teacher’s warnings with our own understandings of Wikipedia to better understand its strengths and weaknesses as a digital encyclopedia, available at our fingertips.

Wikipedia is a free, web-based encyclopedia— meaning that we don’t have to pay for the information we access when we click on any link starting nor are there restrictions on using their content in your own publication. Wikipedia is open-source software, meaning the source code is made available to and edited by the online community to contribute and make changes. The result is several million articles of information written by multiple, volunteer authors, available to anyone with access to the internet.

Wikipedia holds a nebulous position in the digital history field. Roy Rosenzweig describes the role of Wikipedia in the field in “Can History Be Open Source? Wikipedia and the Future of the Past”.

In a few short years, it has become perhaps the largest work of online historical writing, the most widely read work of digital history, and the most important free historical resource on the World Wide Web. (119)

So how does it work? When we do a Google search, a Wikipedia page often lands somewhere near the top of the list. Search the word “museum” and select the Wikipedia search result. The following page opens.

The two links we’ll focus on are “Talk” and “View History”. Both of these pages are Wikipedia administrative pages where editing and quality control happen. On the “Talk” page, you’ll find reviews from the Editorial Team indicating that it is low in quality (C-Class). Other interesting features include a to-do list and conversations among the article’s editors.

Clicking “View History” leads us to a page detailing the history of revisions done on the article about museums. This page identifies who made the edits and when. We notice that this museum article is not often edited—twice in 2018 and only a handful of times in 2017—perhaps contributing to its low quality rating. To better understand the array of quality ratings and to find higher rated pages, I clicked next on “Louvre” where we immediately see that this page features a green plus to the far right, indicating that it is a “good article.”

You can follow the same steps to read the “Talk” and “View History” pages specifically relating to this article to see how the conversation is different on a higher quality page. In pursuit of another high-quality article to compare, I clicked to several pages until I found a “Featured Article” specifically about history and identified as being one of the best articles Wikipedia has to offer. “Middle Ages” was the winner, as both a featured and “semi-protected” article, meaning it is at risk of vandalism or otherwise having its quality diminished by further edits. Contributions to this page are more highly monitored and must be reviewed prior to publication.

The Middle Ages article is highly curated and carefully edited. The organization of the page makes logical sense, the history is more detailed, and while the writing is fact-based without being focused on creating a narrative, it is more pleasant and informative to read than the Museum article. The discussion on the “Talk” page is brief but more detail-oriented, and the volunteers appear to be frequent contributors to Wikipedia articles. There are also a significantly higher number of edits visible on the “View History” page, indicating that it is more carefully maintained than the Museum article.

Having learned a bit about Wikipedia, I am realizing the benefits of learning how to read Wikipedia pages’ quality and history. Roy Rosenzweig would agree with my high school English teacher that I can’t cite Wikipedia in a paper, but I would argue that instead of terrifying students into not using the website at all, we should be teaching students how articles are created, how to read Wikipedia, and how to understand the difference between articles and the quality levels. Having this information, I feel better informed and more able to use Wikipedia effectively.

While I am very new to digital history, I’m wondering: did anyone already know that these pages exist and that it’s possible to read Wikipedia this way? Did you ever seriously consider how Wikipedia articles are written and edited? If so, how did you learn about them and do you think we should be teaching students these skills?

Where Historians fit in the Age of Convenient Open-sources

It’s a warm summer’s day, June 19th to be exact. You’re scrolling down your news feed on Instagram and start to see a flood of posts from your friends with the hashtag #HappyJuneteenth. Based on the tone of the posts, you get a sense that the hashtag relates to some historic milestone. Curiosity draws you to Google search, “What is Juneteenth?” Wikipedia appears at the top of the results, before PBS, Vox, and other accredited news streams. You decide to start your research on the topic with Wikipedia. This choice takes you down a rabbit hole of other fascinating historical articles about the Emancipation Proclamation, American Negro Spirituals and the Galveston Islands. You spend a good amount of time researching all the events, policies, places and people that induced what is now known as Juneteenth. Like many inquisitive minds searching the web for information, you took all the content found on Wikipedia as evidence and did not consider fact-checking your source. Although the information came from an open-source you trusted that every article you read was accurate. According to Roy Rosenzweig, trusting open-sources as citations is a concern historians have against open-sourcing. While their objections may have some merit, he believes academic research and open-source can find a way to co-exist, not just in a state of consistent competition for the users’ attention.
Rosenzweig, an American historian, and digital history pioneer, argues the complexity of user-generated content as a substitution for academic original research. In “Can History Be Open Source? Wikipedia and the Future of the Past”, Rozenweig dissects the success and shortcomings of Wikipedia and how the software’s presence impacts historians’ work. Wikipedia is a free online encyclopedia with content generated by public participation. The software has become a well-recognized reference for any and everything that draws your interest. The benefits of using Wikipedia are the fact it’s convenient, free, and, thanks to the General Public License (GPL), provides with users have the ability to share however they choose. There are even rules in place to keep the style, behavior, and content respectable. However, the accessibility and guidelines still do not quite prevent glitches and misinformation from circulating throughout the software.
Wikipedia is not regulated by an academic institution or accreditable group of scholars. Since there is no single author or editor, there is a lot of room for misinformation and plagiarism. Rosenzweig suggests the information could also be bias considering the lack of diversity in Wikipedia authors. Furthermore, Wikipedia is a platform that provides content created and edited by the public. Continuous edits are necessary since the history on Wikipedia is subject to change. However, participation is based on popularity, not every article gets the same attention for revisions. Regardless of the flaws in this model, students and knowledge seekers still prefer to use Wikipedia as a reference.
The trend of students using this open source over history books in a library is not going away and is not necessarily a problem. As Rosenzweig articulates in the reading, the problem is not the open-source, it’s the approach to using it. There are a few suggestions for historians and teachers to get the best outcomes from engaging with open-sources.
Most students stop their quest for research on a topic at Wikipedia’s website. They cite this website without verifying with other sources if the information they found is accurate. In this era of new media, people prefer, “predigested and prepared information without” the additional information to validate it. This is not solely a problem with Wikipedia. Teachers and historians should stress the importance in the critical analysis of all primary and secondary sources rather than isolate Wikipedia as the problem.
Wikipedia is free and accessible while scholarly journals are, well not. If historians have an issue with the misinformation sprinkled in with facts found on Wikipedia, if they believe open- sources bare fruit to poor, low-quality referencing, then why not make their historical work more accessible? Rozenweig suggests two options for historians to compete with the misinformation in open-sources: contribute and revise articles on Wikipedia or make professional scholarly journals more accessible. The brand of Wikipedia is that anyone can write and edit the information in good taste. If historians find the information on Wikipedia misleading, they can change it. Second, subscription-based historical journals are not accessible to everyone. Only people who are in the know with the means to afford it can benefit from the high-quality information available through these subscriptions. If scholars and academic institutions want to compete with open-sources like Wikipedia, then they need to become just as accessible and feasible for users. Otherwise, people will keep gravitating to what is most convenient for them, the free collaborative encyclopedia.
Historians can use open-sources like Wikipedia to their advantage. There does not have to be a competition between academic, single-authored research and content generated by public participation.