This week’s readings are on the computational analysis of texts and the interpretations of abstractions of those texts. The readings are a broad scope of subjects and formats that induce the reader to consider how the author is visualizing their subjects with different methods of presentation, be it a book, article, blog, etc. The readings discuss both the capabilities that data science and digital history provide, and also the potential weaknesses and injustices they may carry.
D’Ignazio and Klein, Data Feminism
The first reading is the book Data Feminism by Catherine D’Ignazio and Lauren Klein. To quote the authors, “Data Feminism offers strategies for data scientists seeking to learn how feminism can help them work toward justice, and for feminists who want to focus their efforts on the growing field of data science. But Data Feminism is about much more than gender. It is about power, about who has it and who doesn’t, and about how those differentials of power can be challenged and changed.” It is a work based in intersectional feminism focused on both how data science needs feminism and how data science can be used by feminists. The book is divided into seven principles of Data Feminism.
Principle 1: Examine Power.
This first principle exhorts readers to examine who holds power in data science at its very foundation, and how the structures of power continue to affect the discipline in the present. The authors refer to the concept of the matrices of domination, which reveal the ways that power is entrenched in data science and in the general world. Examples of this power in data science include the story of an MIT grad student who, after discovering that a facial recognition service could not see her black face even when it could see her white colleagues, investigated the foundational elements of the service. She found that the data underlying the service was based on overwhelming numbers of white subjects, which naturally led the service to be simply unable to recognize any nonwhite faces. This is examining power and its effects.
To restate the question of the authors: What other power structures in data science do we take for granted, and who established them?
Principle 2: Challenge Power.
This second principle logically follows the first when it is accepted as a premise: If there are power structures that favor particular groups of people over others, they ought to be challenged by anyone who can. The authors demand that their readers shift their entire conceptions of power and responsibility into what is best exemplified by their table that describes the difference between data ethics and data justice. Data ethics are founded on the idea that the system is generally good and problems and injustice only occur when individuals act unethically. The authors charge their readers with denouncing this worldview as complicit with the injustice itself and to move beyond ethical concerns to focus on justice, which understands the system to be hopelessly corrupted by the structure of power that yields inequality.
Is this radical change necessary? Are there any downsides to such absolutism?
Principle 3: Elevate Emotion and Embodiment.
This principle, put succinctly, asks data scientists to reject the idea that emotion and embodiment do not belong in data science. According to the authors, emotion is an essential feature of a feminist approach to data science.
Principle 4: Rethink Binaries and Hierarchies.
In this principle the authors ask data scientists to examine how cultural preferences and conceptions extend even into the realm of data science. For a truly intersectional feminist approach to data science, the authors say one must be willing to attack these cultural issues at every level. To them, one must see every structure in data science through the lens of postmodern gender theory and strive to remake the field in accordance with that philosophy.
Principle 5: Embrace Pluralism.
This is a fairly simple and uncontroversial principle: the most holistic and complete knowledge comes from the synthesis of multiple perspectives and different methods of data compilation. The specific idea that the authors articulate is that data scientists who analyze a data set they are foreign to, perhaps because they have no personal experience with the people or place being studied, are most at risk of missing essential elements of the data necessary for good interpretation.
Principle 6: Consider Context.
Again, this is a very benign sounding principle. The authors say that data is never totally neutral, and always carry some context that will provide more insight. As intersectional feminists, they go further and clarify that the context almost always indicates some prior unequal social relationships, and it is obligatory to discover these relationships. They use an example of a mistake 538 made about kidnappings in Nigeria where the reporter misread a data source and published a story indicating that there were hundreds of kidnappings in Nigeria per day, which was just flat out wrong.
Principle 7: Make Labor Visible.
This principle states that much of the labor done in data science is invisible, meaning it is uncredited, unpaid, or underpaid. To bring about a feminist revolution in data science it is compulsory that all labor is visible, credited, and properly compensated.
Is this problem of invisible labor only limited to fields like data science?
Guldi, The History of Walking and the Digital Turn: Stride and Lounge in London, 1808–1851
The next reading is by Joanna Guldi about the plight of digital historians in the era of databases. She first explains how the advent of digital historical databases have created new opportunities for research, but also great opportunities for overconfidence in the new technology. She shows how limited and biased most of the search engines are by demonstrating where their sources usually prioritize the voice of the privileged and are often constrained by the limits of the underlying software, which is not perfect, and the curation of results by humans. Her point is that nuance, patience, and a healthy skepticism of results are all necessary to safely navigate the brave new world of digital history. Although she begins by showing how flawed many contemporary datasets are, she does also argue that if one calibrates their search according to the specific needs and idiosyncrasies of the data, useful information can be obtained. Thus, historians must be more vigilant, not less, when using the extraordinarily powerful digital tools at their disposal and always remember that databases and word searches are not infallible oracles.
What is the safest way to use historical databases? How does this relate to data feminism?
Bevins, Space, Nation, and the Triumph of Region: A View of the World from Houston
Continuing on the theme of extraordinarily powerful digital tools, Cameron Bevins discusses how the difficulties of studying large-scale concepts like space, nation, and place in relation to specific topics can be dealt with by a tool known as “distant reading.” For example, if one was studying utopianism, instead of scanning a particular 19th century book for passages on the subject one could use digital tools to examine hundreds of 19th century texts on utopianism to get a “distant read” on the subject. This is something historians could never practically do before, and if it is combined with traditional reading, it can provide amazing new works of scholarship. Of course, the problems articulated by Guldi still apply. So,
What are the dangers of distant reading? Can it be done in a diligent and nuanced way?
Blevins, Topic Modeling Martha Ballard’s Diary
This final text looks at the problem of analyzing a large volume of text, specifically a woman’s diary. In the past, it would have been necessary for someone to qualitatively analyze this text with huge generalizations, considering the sheer vastness of the source material. Now, however, Blevins shows how one can use the power of topic mining to analyze the text in a quantitative way. With a little bit of work to establish topics of a diary entry, the computer program can nearly instantly analyze the diary and present its contents in useful graphs and statistics. For example, once you explain to the program everything that would fall under the topic of “church” like worship, sermons, the pastor, etc, the program can find and enumerate every time she makes a diary entry on the topic of church. This can be anecdotally tested by creating the topic “cold weather” and having the program graph the frequency of the topic in her diary by the months of the year. Sure enough, she wrote about the topic of cold weather far more in the winter months than in the summer. Amazing.
What are the broader implications of topic modeling in historical research? What are the potential weaknesses of such an approach?
One Reply to “Data Analysis: Distant reading, Text Analysis, Visualization ”
I really appreciate how you related the readings to each other with your questions. I’m definitely glad I tackled Data Feminism before I read any of the other chapters/articles because it really helped frame all of the big pictures ideas of the readings for the week. The book really opened my eyes to biases in data and considering the sources. After summarizing Guidi’s article you asked how it relates to Data Feminism. I think Guidi proves exactly how important nuance is to understanding computer generated data. She highlights how humanity and the different layers of language can skew data, and therefore it shouldn’t be taken at face value. I think it also proves why there should be more collaboration between the STEM world and the Humanities world because they can each point out blind spots for each other.