Topic Modelling DocSouth’s North American Slave Narratives

I propose to topic model DocSouth’s archive of North American Slave Narratives, a digitized archive that’s already optimized for data analysis. In topic modelling, a computer sifts through large bodies of text and identifies “topics,” or groups of words that often occur near one another. The computer doesn’t understand the words or their meaning – it just notes their frequency and groups them based on that frequency.

There are some great examples of digital history projects online that use topic modeling…

  • Cameron Blevins’s article on Modelling Martha Ballard’s Diary
  • Matthew McClellan’s interesting work on Olaudah Equiano’s diary
  • Sarita Alami, Moya Bailey, Katie Rawson, and Sara Palmer’s project analyzing sermons given on the occasion of Lincoln’s assassination
  • Mining the Dispatch by Robert K. Nelson, which analyzes Civil War Richmond through a newspaper archive

… but very few scholars seem to have engaged with this particular archive. I only found three examples. Laura Tilton’s analysis references DocSouth’s archive among other sources and focuses on racialized dialect in narratives recorded by the Federal Writers’ Project. Jed Dobson’s github offers his code, but no output data or historical interpretation. It was done as part of his book project, which is definitely going on my reading list. The other example I found, a Gephi visualization by Jim Casey, doesn’t interpret the data either.

It’s super cool though.

It seems odd that so few people have worked with this data considering how rich and easily accessible it is. If anyone has come across a topic modelling project that uses DocSouth’s Slave Narratives, I’d love to hear about it.

To get started, I did a quick tutorial on MALLET from the Programming Historian. The tutorial was easy enough to follow, but I’ll definitely be keeping it and MALLET’s documentation handy for a while.

The Programming Historian is an amazing resource. Highly recommend.

I ran my data through MALLET a couple different times, trying out different parameters and seeing how they affected the output. The command I settled on for this proposal was:

bin\mallet train-topics –input slavenarr.mallet –num-topics 30 –optimize-interval 30 –output-state topic-state.gz –output-topic-keys enslaved_keys.txt –output-doc-topics enslaved_composition.txt

So, I fed MALLET the slave narratives (bundled into the MALLET-friendly file slavenarr.mallet) and told it to give me 30 weighted topics (written in enslaved_keys.txt) and some metadata.

A screenshot of MALLET's output
A screenshot of my command terminal after the program finished running. As you can see at the bottom of the image, this command took MALLET nearly 6 minutes to run. If I want to run it paragraph-by-paragraph – which I think I do – it will take significantly longer. Hope my computer can handle it.

The chart below shows the 5 topics that MALLET found to be most prevalent in descending order. I added the final column as a (very tentative!) attempt to assign names to these topics.

31.338told time man good day thought home made years asked back house place make found give money put night peopleMemory?
151.276time made found part day received called state present means person large number hands case purpose long return great makeDoing? Society?
140.996heart poor long children eyes life mother felt friends master child mind thought friend hope face tears dear light kindEmotion
10.827master night time work house man place men day road large slaves miles people plantation water woods great cotton cornEscape
90.802life years great good time place work man character young high history knowledge strong influence early public true service churchAspirations
In order, each column represent the number MALLET assigned to the topic (arbitrary, as far as I know), the topic’s weight (how often it occurred), the words in the topic, and my sketchy attempt to name each topic.

Topic 14 hit me hard. While the two topics weighted most heavily are pretty vague, topic 14 is obviously about intense, embodied emotion. The memoirs of former slaves are of course deeply emotional – that shouldn’t surprise anyone. But the fact that the computer picked up on that emotion so quickly and clearly – and weighted it so heavily – gives me hope that this could be a worthwhile project.

My goal here isn’t necessarily to make a novel historical argument, although that would be a nice bonus. Instead, I just want to get familiar with this software and learn how to best tailor it to a specific data set. How does MALLET work, and what pitfalls do I need to be aware of? How many topics do I need to achieve a good model, and how do I find that number? What are the benefits of modelling paragraph-by-paragraph rather than memoir-by-memoir? How might I factor in temporal data – like date published – to these texts? What other parameters do I need to include in my commands? How do I best interpret the data MALLET gives back to me? How do I model and visualize that data?

I have no idea what the answers are to any of these questions, but I’d like to find out.

One Reply to “Topic Modelling DocSouth’s North American Slave Narratives”

  1. Hi Jessica,

    This is a compelling idea for a project. Also, huge kudos for already having been able to get MALLET up and running with your corpus of text.

    I really like your research questions on this! I think focusing on exploring what different kinds of issues are surfaced when you use the tool with different parameters is compelling. Given that the number of topics to generate is itself a somewhat arbitrary, doing comparisons of how the topics shift and change as you play with those kinds of parameters has the potential to be really useful for further developing how topic modeling can/should be used for analysis of sources. Similarly, your question about paragraph by paragraph vs memoir by memoir is really interesting.

    I have no doubt that delving into this textual data will end up surfacing a lot of interesting questions/points about methods for this kind of work. Along with that, it’s great that you already have identified a body of scholarship and some focused work on working with this data. That is going to be re useful for identifying nuanced and targeted kinds of issues to focus your analysis on.

    If you do run with this project, I would encourage you to reach out to folks behind the project and some of the projects you identified where people have been engaging in analysis on this. I imagine that there is a good chance the folks behind those projects might have good input for you in what to focus on and they will likely also be interested in further sharing out some of the results of the work you are doing.

    Best, Trevor

Leave a Reply

Your email address will not be published. Required fields are marked *