I propose to topic model DocSouth’s archive of North American Slave Narratives, a digitized archive that’s already optimized for data analysis. In topic modelling, a computer sifts through large bodies of text and identifies “topics,” or groups of words that often occur near one another. The computer doesn’t understand the words or their meaning – it just notes their frequency and groups them based on that frequency.
There are some great examples of digital history projects online that use topic modeling…
- Cameron Blevins’s article on Modelling Martha Ballard’s Diary
- Matthew McClellan’s interesting work on Olaudah Equiano’s diary
- Sarita Alami, Moya Bailey, Katie Rawson, and Sara Palmer’s project analyzing sermons given on the occasion of Lincoln’s assassination
- Mining the Dispatch by Robert K. Nelson, which analyzes Civil War Richmond through a newspaper archive
… but very few scholars seem to have engaged with this particular archive. I only found three examples. Laura Tilton’s analysis references DocSouth’s archive among other sources and focuses on racialized dialect in narratives recorded by the Federal Writers’ Project. Jed Dobson’s github offers his code, but no output data or historical interpretation. It was done as part of his book project, which is definitely going on my reading list. The other example I found, a Gephi visualization by Jim Casey, doesn’t interpret the data either.
It seems odd that so few people have worked with this data considering how rich and easily accessible it is. If anyone has come across a topic modelling project that uses DocSouth’s Slave Narratives, I’d love to hear about it.
To get started, I did a quick tutorial on MALLET from the Programming Historian. The tutorial was easy enough to follow, but I’ll definitely be keeping it and MALLET’s documentation handy for a while.
I ran my data through MALLET a couple different times, trying out different parameters and seeing how they affected the output. The command I settled on for this proposal was:
bin\mallet train-topics –input slavenarr.mallet –num-topics 30 –optimize-interval 30 –output-state topic-state.gz –output-topic-keys enslaved_keys.txt –output-doc-topics enslaved_composition.txt
So, I fed MALLET the slave narratives (bundled into the MALLET-friendly file slavenarr.mallet) and told it to give me 30 weighted topics (written in enslaved_keys.txt) and some metadata.
The chart below shows the 5 topics that MALLET found to be most prevalent in descending order. I added the final column as a (very tentative!) attempt to assign names to these topics.
|3||1.338||told time man good day thought home made years asked back house place make found give money put night people||Memory?|
|15||1.276||time made found part day received called state present means person large number hands case purpose long return great make||Doing? Society?|
|14||0.996||heart poor long children eyes life mother felt friends master child mind thought friend hope face tears dear light kind||Emotion|
|1||0.827||master night time work house man place men day road large slaves miles people plantation water woods great cotton corn||Escape|
|9||0.802||life years great good time place work man character young high history knowledge strong influence early public true service church||Aspirations|
Topic 14 hit me hard. While the two topics weighted most heavily are pretty vague, topic 14 is obviously about intense, embodied emotion. The memoirs of former slaves are of course deeply emotional – that shouldn’t surprise anyone. But the fact that the computer picked up on that emotion so quickly and clearly – and weighted it so heavily – gives me hope that this could be a worthwhile project.
My goal here isn’t necessarily to make a novel historical argument, although that would be a nice bonus. Instead, I just want to get familiar with this software and learn how to best tailor it to a specific data set. How does MALLET work, and what pitfalls do I need to be aware of? How many topics do I need to achieve a good model, and how do I find that number? What are the benefits of modelling paragraph-by-paragraph rather than memoir-by-memoir? How might I factor in temporal data – like date published – to these texts? What other parameters do I need to include in my commands? How do I best interpret the data MALLET gives back to me? How do I model and visualize that data?
I have no idea what the answers are to any of these questions, but I’d like to find out.