Final Project: Topic Modeling Enslavement Narratives

I really enjoyed dabbling with MALLET this semester. To be completely honest, I may have bitten off more than I could chew – especially once Python slithered into the mix. While I learned much, I mostly realized that I still have so much to learn. Even with the most basic technical grasp, though, MALLET can be useful as a brainstorming tool in the early stages of a research project.

If I could restart this project from the beginning, I would narrow my scope considerably and focus on familiarizing myself with existing scholarship. There is so much cool research being done with MALLET and with enslavement narratives, and I have only begun to breach the surface. In terms of my technical understanding, I’m sure that reading more MALLET studies would have answered many of my questions. And regarding the history of slavery, I can only ever keep learning and growing. I ultimately would have benefitted from spending more time with that scholarship, and less time trying to revive my programming skills.

That said, I loved being able to write code for a history class. It was simple code, and it took me far, far longer than it should have, but I enjoyed every minute of it. While my history project may have been better off with more historiography and less technical practice, having up-to-date programming skills can only ever be a good thing in the long run.

My poster is presented below, and relevant files (including my final paper) have been uploaded to my GitHub. I look forward to continuing my work on this research in the future.

Presentation Poster for “Topic Modeling Enslavement Narratives”

– Jessica Shainker

Mapping the Antebellum Cotton Industry

For my digital project proposal, I’m thinking of creating an ArcGIS StoryMap of the international cotton industry in the mid-1800s.

Map of the Civil War Cotton Trade. Library of Congress.

Multiple historians have written about “King Cotton,” slavery, American capitalism, and the global economy:

  • Empire of Cotton: A Global History by Sven Beckert
  • Accounting for Slavery: Masters and Management by Caitlin Rosenthal
  • The Half Has Never Been Told: Slavery and the Making of American Capitalism by Edward Baptist
  • Many, many more.

Based on this existing historical research, I want to create a digital project that allows viewers to follow cotton around the globe, from the plantations of the American South to the factories of England to the markets of America, Europe, and beyond.

Questions to Consider:

Where was cotton grown? What was slavery like on cotton plantations, and how did it change over time? How was cotton transported? Where was it processed? Where was it sold? Who profited? How did the cotton industry change over time, especially as the Civil War approached? To what extent were people in the American South, the American North, Britain, and elsewhere implicated in the persistence of American slavery?

The StoryMap format will allow visitors to interact with maps and historical information in a far more engaging way than a standard paper or blog post. Using a StoryMap will also allow me to include biographies of enslaved and free people who were involved in the cotton industry, adding an essential, personal dimension to a historical subject – American slavery – that’s inherently dehumanizing.

The goal here is to show general readers how slavery birthed modern economies. Even today, many economists continue to frame slavery as an antiquated system that was incompatible with our modern, industrialized economy. They try to draw a hard line between slavery and capitalism, a line that doesn’t hold up once you study the historical reality of the 19th century economy. Nor does it hold up when you examine the modern economy, where multinational corporations continue to exploit underpaid and enslaved workers. Slavery and capitalism is a popular topic among historians right now for a reason – our global economy is still deeply exploitative, and slavery is still practiced around the world. I hope that this project will prompt readers to do their own research into this topic.

— Jessica Shainker

Topic Modelling DocSouth’s North American Slave Narratives

I propose to topic model DocSouth’s archive of North American Slave Narratives, a digitized archive that’s already optimized for data analysis. In topic modelling, a computer sifts through large bodies of text and identifies “topics,” or groups of words that often occur near one another. The computer doesn’t understand the words or their meaning – it just notes their frequency and groups them based on that frequency.

There are some great examples of digital history projects online that use topic modeling…

  • Cameron Blevins’s article on Modelling Martha Ballard’s Diary
  • Matthew McClellan’s interesting work on Olaudah Equiano’s diary
  • Sarita Alami, Moya Bailey, Katie Rawson, and Sara Palmer’s project analyzing sermons given on the occasion of Lincoln’s assassination
  • Mining the Dispatch by Robert K. Nelson, which analyzes Civil War Richmond through a newspaper archive

… but very few scholars seem to have engaged with this particular archive. I only found three examples. Laura Tilton’s analysis references DocSouth’s archive among other sources and focuses on racialized dialect in narratives recorded by the Federal Writers’ Project. Jed Dobson’s github offers his code, but no output data or historical interpretation. It was done as part of his book project, which is definitely going on my reading list. The other example I found, a Gephi visualization by Jim Casey, doesn’t interpret the data either.

It’s super cool though.

It seems odd that so few people have worked with this data considering how rich and easily accessible it is. If anyone has come across a topic modelling project that uses DocSouth’s Slave Narratives, I’d love to hear about it.

To get started, I did a quick tutorial on MALLET from the Programming Historian. The tutorial was easy enough to follow, but I’ll definitely be keeping it and MALLET’s documentation handy for a while.

The Programming Historian is an amazing resource. Highly recommend.

I ran my data through MALLET a couple different times, trying out different parameters and seeing how they affected the output. The command I settled on for this proposal was:

bin\mallet train-topics –input slavenarr.mallet –num-topics 30 –optimize-interval 30 –output-state topic-state.gz –output-topic-keys enslaved_keys.txt –output-doc-topics enslaved_composition.txt

So, I fed MALLET the slave narratives (bundled into the MALLET-friendly file slavenarr.mallet) and told it to give me 30 weighted topics (written in enslaved_keys.txt) and some metadata.

A screenshot of MALLET's output
A screenshot of my command terminal after the program finished running. As you can see at the bottom of the image, this command took MALLET nearly 6 minutes to run. If I want to run it paragraph-by-paragraph – which I think I do – it will take significantly longer. Hope my computer can handle it.

The chart below shows the 5 topics that MALLET found to be most prevalent in descending order. I added the final column as a (very tentative!) attempt to assign names to these topics.

NumberWeightKeysName
31.338told time man good day thought home made years asked back house place make found give money put night peopleMemory?
151.276time made found part day received called state present means person large number hands case purpose long return great makeDoing? Society?
140.996heart poor long children eyes life mother felt friends master child mind thought friend hope face tears dear light kindEmotion
10.827master night time work house man place men day road large slaves miles people plantation water woods great cotton cornEscape
90.802life years great good time place work man character young high history knowledge strong influence early public true service churchAspirations
In order, each column represent the number MALLET assigned to the topic (arbitrary, as far as I know), the topic’s weight (how often it occurred), the words in the topic, and my sketchy attempt to name each topic.

Topic 14 hit me hard. While the two topics weighted most heavily are pretty vague, topic 14 is obviously about intense, embodied emotion. The memoirs of former slaves are of course deeply emotional – that shouldn’t surprise anyone. But the fact that the computer picked up on that emotion so quickly and clearly – and weighted it so heavily – gives me hope that this could be a worthwhile project.

My goal here isn’t necessarily to make a novel historical argument, although that would be a nice bonus. Instead, I just want to get familiar with this software and learn how to best tailor it to a specific data set. How does MALLET work, and what pitfalls do I need to be aware of? How many topics do I need to achieve a good model, and how do I find that number? What are the benefits of modelling paragraph-by-paragraph rather than memoir-by-memoir? How might I factor in temporal data – like date published – to these texts? What other parameters do I need to include in my commands? How do I best interpret the data MALLET gives back to me? How do I model and visualize that data?

I have no idea what the answers are to any of these questions, but I’d like to find out.

Digital Searching by Michaela Fehn

(Michaela’s had some trouble accessing the blog. Here is her post for this week!)

So we’re talking data and mining this week. We’ve got a great lineup up scholarly pieces, so we’re jumping in.

The History of Walking and the Digital Turn: Stride and Lounge in London, 1808–1851” by Joanna Guldi

Searching is Selection. Scholars have to be creative when searching. Guldi argues that true scholarly work is in the nuance. In fact, Google Books, as well as other search engines, offer a curated list. The question scholars must ask is How can I create nuanced search words that curate a list of diamonds in the rough?

As historians, it becomes vital to find common links – as databases become more interconnected and search engines evolve, results will become varied. For now, a good historian will follow a trail and find hidden gems. Like Guldi, we must ask How do certain words or jargon lead to different results? How can understanding the language within sources or research topics make a difference?

“Space, Nation, and the Triumph of Region: A View of the World from Houston” by Cameron Blevins

Space and Place are difficult notions – even for a seasoned historian. SO what are these notions made of? Well, according to Blevins, space is ever-changing. It’s dynamic and usually associated with processes of power. Place is built on locations. There’s often an emotional response to place. Space and Place can have tension. But what happens when the two coexist within literature? Well, that’s what Blevins is exploring.

So, Blevins discusses in depth this terminology called “distant reading.” For Blevins, this looked like noting the frequency of place names mentioned in a newspaper over a prolonged time period. By noting this, Blevins is able to blend a digital reading with a traditional reading, better understanding the political implications of space. This leads the reader to ask How can distant reading enhance other projects? Are there other types of projects where distant reading can be helpful? 

“Topic Modeling Martha Ballard’s Diary” by Cameron Blevins

Blevins again with some fantastic lessons to learn. 27 years worth of diaries leaves the reader to wonder: How is Blevins going to comb through that vast number of articles? Most important to note is Blevins’ use of topic modeling. And again, a question emerges – What is topic modeling? Topic modeling according to Blevins is “a method of computational linguistics that attempts to find words that frequently appear together within a text and then group them into clusters.” The software, specifically Mallet, produces a list of topics with words within each topic based on how the word is used rather than what a word means. This gets the audience to ask Does the software relate topics to each other? Can the software work in other languages? And how does a software like Mallet affect sources other than diaries? Would it be helpful for archives of other types?

“Digital Visualization as a Scholarly Activity” by Martyn Jessop

Graphic aids aren’t new. But digital technology enhances them. Jessop raises these three questions: What role have visualizations played in humanities scholarship in the past? If the majority of images in print are to be regarded as ‘illustrations’ what is the distinction between ‘visualization’ and ‘illustration’? How has the emergence of digital media affected the development of visualization?

By looking at visualization within the digital humanities, Jessop begins thinking about these questions. He notes that there are different categories visualization affects: space, quantitative data, text, time, and 3D visualization. This asks the audience to think How can different types of visualization affect different projects? When are the best times to bring visualization into a scholarly subject? And which category is most compelling based on Jessop’s article?

We’re going to look at some of these questions in class as we tackle talking about digital data, searching, and mining. Until then, happy reading.

Week 3 Readings: Data Feminism

Data Feminism by Catherine D’Ignazio and Lauren F. Klein is a readable and thorough entry into how data science needs feminism and how feminists and scholars can use data science to further their goals. Each chapter focuses on one of D’Ignazio and Klein’s seven principles of data feminism.

1. Examine Power

Data science is deeply influenced by unequal power structures, or matrices of domination. Readers are encouraged to ask, “Who?” when thinking about data collection and analysis: Who is doing data science? Who benefits from data science? And whose interests and goals are being served by data science?

By asking “who” questions, we can spot gaps in data collection and analysis and begin to fill those gaps.

2. Challenge Power

D’Ignazio and Klein offer four methods of challenging unjust data science:

  1. Collect: Compile counterdata.
  2. Analyze: Audit algorithms.
  3. Imagine: Imagine a future of co-liberation.
  4. Teach: Engage and empower people to use data science as a tool.

As part of their “imagine” method, the authors also advocate for a shift from data ethics, which tends to frame problems as the result of a few “bad apples” and technological glitches, to data justice, which acknowledges that injustice is structural.

Table 2.1 From data ethics to data justice

Concepts That Secure Power
Because they locate the source of the problem in individuals or technical systems
Ethics
Bias
Fairness
Accountability
Transparency
Understanding Algorithms

Concepts that Challenge Power
Because they acknowledge structural power differentials and work toward dismantling them
Justice
Oppression
Equity
Co-liberation
Reflexivity
Understanding history, culture, and context
Table 2.1 presents principles of data ethics alongside alternative, parallel concepts of data justice (60).
Why is the shift from data ethics to data justice so radical?

3. Elevate Emotion and Embodiment

Data science is weighed down by the false binary of reason vs. emotion. As historians, though, we know that there is no such thing as a neutral perspective. Instead, the feminist approach to data science is to embrace emotion and affect as a valid type of data.

4. Rethink Binaries and Hierarchies

False binaries and unjust hierarchies lead to flawed classification systems that overlook or discriminate against certain groups. Problems with classification must be evaluated on a case-by-case basis. Ethical solutions might include adding categories to a classification system, making certain data categories optional, or avoiding gathering some types of data in the first place.

How data is presented is just as important as how it is categorized. Feminist approaches to data visualization, like Amanda Montañez’s infographic on gender and sex in the Scientific American, can challenge false binaries.

5. Embrace Pluralism

Traditional data science focuses on clarity and control, sometimes to the detriment of minoritized voices. Data cleaning is sometimes necessary to prepare data for computational analysis, but it can also enact epistemic violence, perpetuating unjust hierarchies by separating data from their context.

Feminist data scientists, on the other hand, embrace multiple perspectives. Focusing on team projects and community-driven work can give us better, more complete information than the work of a single individual.

What does embracing pluralism look like in digital/public history? What are the benefits? The challenges? Are there any situations in which we should reject pluralism?

6. Consider Context

Data is meaningless without context. In this chapter, D’Ignazio and Klein coin the term Big Dick Data to refer to “big data projects that are characterized by masculinist, totalizing fantasies of world domination as enacted through data capture and analysis” (151). Big Dick Data projects overstate their scope and importance and ignore essential context. These inaccuracies can in turn lead to massively erroneous reporting, like in this FiveThirtyEight article on kidnappings in Nigeria.

Data are never raw. They are inherently cooked by their sociopolitical and historical context, and that context is essential to accurate data collection, interpretation, and visualization. Institutions need to invest significant funding into documenting, restoring, and communicating context, especially in instances involving discrimination and inequity.

What might “big dick history” look like? Can you think of any examples?

7. Make Labor Visible

Much of the effort goes into data science is invisible labor, paid, underpaid, and unpaid. Data feminism requires that we make labor visible and always give credit where credit is due.

What are some ways labor can be hidden in academia and public history? How do we rectify this?

Conclusions

D’Ignazio and Klein’s data textbook is built on a foundation of Black feminism, an intersectional ideology that prioritizes humanity and process over profit. This is a great and easy intro into data science for humanities scholars and into feminist thought for data scientists. It’s a long read, but well worth the journey. In class, we’ll think about how we can apply these principles to digital history projects. Happy reading!