My Life as a Warning to Others

Hello, everyone. My name is Andy Cleavenger, and I am beginning my fourth year of this two year program.

My life up to this point has been spent as a photographer and multimedia specialist at a government contractor. I work in their Communications department. My interest in this class stems from my role as the sole caretaker of our department’s image collection. For over 17 years I have been the only one capable of performing image searches, and the only one concerned with the preservation of those images. I’m in the Digital Curation track to learn how to effectively turn my collection into a self-service resource available to all employees. And I’m in this class specifically to make sure I’m doing everything possible to ensure the long-term preservation of our image collection.

I must admit that the first axiom listed in Owens – “A repository is not a piece of software” – just about made me stand up from my chair and shout “see, I told you!” at my former boss. We have always treated the image collection as a problem that can be solved with a magic-bullet purchase of DAM software.

“We bought it… we’re done!”

This is of course, extremely common. Like most offices, they forget about the systems that will come after the present one, or the unceasing march of technological progress that dictates both the increasing complexity of the images as well as the expanding diversification of their use. This was nicely summed up in Owens’ last axiom: “Doing digital preservation requires thinking like a futurist.”  I fear that they may regret some of the decisions they’ve made such as stripping all filenames from their videos, throwing everything into a single directory, and then depending on an external proprietary catalog file to save all related metadata.

We are now married to that system… and it’s failing us.

The remaining articles on either side of the digital dark age debate made some equally compelling points. Ultimately, I felt that Lyons and Tansey both came closest to hitting the mark on what form a digital dark age would take, as well as the forces that would drive it. Lyons frames the problem as one of cultural blindness. That is to say that institutions that exist within and serve a particular society tend to have difficulty in recognizing the value in – or even being aware of – the records of other communities. As such, the digital dark age will manifest itself in the silence of these socio-politically disadvantaged communities within the archival record.

This is not an unfamiliar argument, but I tend to think the motivations for its reality are less a conspiratorial omission than they are due to a sad pragmatism driven by extremely finite resources. This point was reflected well in Tansey. She makes the point that the long trend of cuts to budgets and staff force institutions to set priorities that obviously leave gaps in the archival record. In other words, even if an institution has an awareness of fringe communities, and possibly even has a sympathetic collections policy for including those records, the pragmatism of limited resources may still dictate their omission as the institution focuses on its highest priorities.

I have certainly seen this in my position in the Communications department. I’m curious if others in class have seen examples like this in their own workplaces?

Some Thoughts on Visualization in the Humanities or the worst blog post title ever (sorry)

This week’s readings expressed a wide and deeply conflicted range of attitudes regarding the assorted uses of computers, computer modeling and the data-ization of the humanities. The authors were all for it, but some of the arguments against the idea they discussed were interesting – and valid. This validity is incredibly important; having been discussing diversity and cultural inclusion on LBSC 631 this week, I found myself hyper-aware of the attitudes some of the authors were displaying to their techno-tentative brethren. However, this is a blog and I’m am going to make some grand and sweeping statements – which I will then try to back up… hopefully using memes.

Grand Statement #1: Let’s not be that guy.

You know the guy I mean.

Grand Statement #2: “Computers allow you to go further.”

If there is to be a rallying cry for the Digital Humanities, this might well be it. Yesterday, I was whining to a mechanical engineer of my acquaintance about Underwood’s observations of the reluctance in the Academy to embrace digital technologies, how they fear a total seismic shift* in their world.

I would like to assuage those fears. According to my mechanical engineer, “Computers allow you to go further. They don’t take away the work.” I was scrambling for a pen here so the next bit is a paraphrase: computers make more work and they make what you’ve got more accessible.

Take the work done with MALLET, Blevins describes how computer modeling validates itself. The Ballard diary, chosen in part one assumes for its completeness, shows how well the computer can model. Blevins even relates how surprised he was that it initially worked so well. But it worked. The tool did the thing it was designed to do. That’s great! And now there’s all this data to play with. If you wanted to only focus on the number of babies born when the crocuses were in bloom, then it’s a simple matter of correlating your data. If you want to take up the argument discussed in Graphs, Maps, and Trees, that there is no such thing as a “gothic novel” and dissolve that grouping from his chart of genres to see what the effects are, you can do so. Vistas abound, new peaks arise to be surmounted.

Tumblr

The problem, I think, is that Humanists see things like “correlating data” and “data manipulation” and they freak out because these are STEM things. That they are not scientists, but humanists. Theirs is a world of logic and rhetoric. Well, yes, fine, but notice how scientists get all the grants?

Science vs. Humanities… ›› See... and my family thought I was crazy for being a Humanities major.

We don’t need to spend years waiting on graduate students to count everything by hand. We can load that puppy, or important literary work, into a computer and run analysis, any analysis, all analyses. And then tomorrow, we can do it again, go further and deeper. Instead of relying on grad students, you can partner with other academics on the other side of the world as easily as in the next building over a la Graham, Milligan, and Weingart. Where privacy and exclusivity are a concern, there is no need to make work public as they did, but this opens a work up to more input, catching mistakes earlier, and examining multiple points of view since no one publishes the book to make money anyway. You publish the book to get tenure.

Grand Statement #3: This is not the Singularity.

Technology is moving at a brisk clip, but we’re not in any danger of be replaced by robots today or tomorrow or the next day. For whatever reason, and I’m going to guess it has a lot to do with not being good with computers 10+ years ago, some humanists aren’t on board with putting the digital into their work. This is a massive disappointment for the rest of us because the kind of work that they’re doing, work like breaking down the linguistic anachronisms in Downton Abbey (a point of much personal vindication for me) and examining the Ballard diary, is really interesting. And doing it with graphs means that people who don’t have PhDs can understand it too. Perhaps therein lies the fear; that if outsiders can see – and understand – what we’re doing, we’ll all revert to the seventh grade and get made fun of by the popular kids for liking to evaluate complexly and dig a little deeper. So how do we embrace our intelligence, how do we share the fruits of our enthusiasm in the best possible way? I would argue that charts and graphs – visualizations of complex data – are the way forward.

 

 

 

*To be fair, the idea of a seismic shift as representative of a complete overhaul of any working system was no doubt active before Mr. M. Watkins published his article “How Managers become Leaders,” but it is from him that I got the idea so I have linked to it in Google Scholar: Michael D. Watkins, “How Managers Become Leaders,” Harvard Business Review 90 (June 2012), 65-72.

 

The “True” Corpus of American English

In an analysis of the Corpus of Historical American English (COHA) and Google Books, Professor Mark Davies of Brigham Young University studies the effectiveness of both engines and their ability to properly read the English language. Both are corpuses of the American English language, but Google Books has 500 billion words compared to COHA’s 400 million. Davies argues that although COHA has a significantly smaller database, the trending patterns still mirror those of Google Books. Because of this, Davies argues that COHA is actually the more effective corpus—a smaller database means far fewer data to sift through and that means quicker search results and faster information.

COHA’s “toys” are what make it a more useful database in Davies’s eyes. While Google provides the same basic function (showing the frequency of word usage throughout the decades), COHA is able to track concepts, related words, and changes in meaning. Whereas Google tends to have a one-track mind, just like it’s general search engine, COHA manages to “think” about relations for the words being placed into the search. Looking for things such as relation, form, root words, or even cultural shifts, the searches are much more comprehensive.

Design is a huge issue for some researchers, and with this in mind, Google definitely has the upper hand. True, as Davies puts it, COHA is able to effectively portray the same statistics, but I had a genuinely hard time navigating the site. Bar graphs are nice for portraying how many dark-haired, blue-eyed people are in a class, but for a corpus of American English, I found them rather ineffective. Google Books had a much more pleasing site, nicer to the eye, and easier to follow the pattern of the language. Yes, COHA’s tables are nice for their alternative searches, but as I said, traversing the site is actually rather difficult. Google provides a much more streamlined site, and provides actions that are easy to follow

Unfortunately for Google, I think the amount of words in the database have made it impossible to create proper analyses for the grammar, word meanings, and word foundations that COHA is successfully able to analyze. If Google Books created an efficient way to sift through all of that information quickly enough, then it would immediately become the preferred site. However, because of its inability to process the large amounts of data (largely its own fault) has rendered it ineffective to be a “true” corpus of the American English language.

 

Millions of Digitized Books, Hundreds of Fascinating Conclusions

Jean-Baptiste Michel et al.’s short and sweet article Quantitative Analysis of Culture Using Millions of Digitized Books raises a number of bold points that show just how valuable Google’s bold (and originally considered foolhardy) Google Books project has been to historians.  The project uses nearly 5.2 million books (over 4% of those written, a very significant standard) containing over 5 billion words and searches them.  Let’s pause for a minute and think about what that means.  25 years ago, or even 10 years ago, if you said you wanted a search through a sizable sample of every book ever written for certain words, you would have had your head examined.  The paper points out that it would take a human 80 years just to read all the books written in one year, 2000.  Here’s this device that can go through the entire corpus in literally quicker than a blink of the eye.  Roughly 2/3rds of the 500 billion words are in English and there’s only a significant sample size for the books 1800 and on (though there are a fair amount from 1600-1800), but even with these limitations, the work allowed the researchers to come to some bold conclusions.
“What conclusions?” you ask?  Try this one on for size: they estimate that most dictionaries might only contain as little as 52% of the living lexicon at any given moment.  They estimate the total lexicon of 1-grams (single words, excluding symbols, numbers, typoes, etc) at 544k in 1900, 597k in 1950, and 1,022,000 in 2000 (counting n-grams that are used over 1/1,000,000,000 of all English words).  Some of these are not in dictionaries due to dictionaries’ traditional dislike of compound words, but others are inexcusable (they point to “deletable” as a particularly ironic example).  This lexical “dark matter,” in their charming expression, are words that are fresh for research.  No OED biography has ever examined every facet of these words, and no amount of looking up will find them.  The n-gram has saved these potentially valuable expressions from the invisibility of  their hidden nature.

 

Another bold feature the n-gram allows is to trace the rise and fall of terms over time.  Much has been made of the example of the engram for “World War I” vs “Great War,” where Great War holds strong until 1939, then falls off, while World War I rises to pick up the slack, but it’s hardly the only example.  You can do the n-gram test yourself and see the decline of a good many words and phrases, and the introduction of others.  Ever been curious to see if anyone said “Yadda-yadda-yadda” before Jerry Seinfeld?  Want to map “Reality Television” vs. “Situational Comedy” and see if you can identify the year Survivor was released?  Want to compare Claude Lamarck with Charles Darwin or Karl Marx with Sigmund Freud?  The world is your oyster.

 

The n-gram can also detect the death of older, archaic forms of words.  “Spilled” is becoming the past tense of “to spill,” but there is no use in crying over spilt milk about it, spilt had a long run.  Contemporary spouters of aphorisms think that all that glitters is not gold, but their fathers sagely opined that all that [i]glisters[/i] is not gold.  Indeed, past tense verbs that end in “t” are fighting a slow, steady, losing battle against “ed.”  Can they survive?  I feel I’ve spoilt the ending of this struggle, but I’ve been burnt on these predictions before.

 

The final section of the article struck (or will it become “striked?”) a more somber note: repression.  Examining the use of the word “Trotsky” in Russian language sources through the 1920s tells a harrowing tale, but everyone expected as much.  (I wanted to run a similar test on “New Economic Policy” vs “Five Year Plan,” but, alas, I speak no Russian, and the English results are pretty meaningless).  What is more interesting is the revelation of people never before suspected of repression.  The Nazi regime’s list of degenerate artists was apparently far more extensive than generally known, as people never included in the traditional narrative saw their mentions in German press fall off the face of the earth in the late 1930s.  Again, this was just a cursory exercise: this n-gram search opens up the possibility of a new way of looking both at the more blatant Nazi/Soviet repression, and the more subtle blacklisting preferred in the West.  There are millions of possibilities that n-grams open up for these millions of books.