Time Travel and Word Clouds: A Short Guide to Three Fun and Informative Digital History Websites

Practicums Review Post for January 21:

a) PhilaPlace is an interactive website about space, time, and Philly. It was created by the Historical Society of Pennsylvania in 2005; it connects stories to places across time in Philadelphia’s neighborhoods. It has different formats: text, pictures, audio and video clips, and podcasts. It also includes community programs and publications, from workshops for teachers, to trolley tours, and exhibits.

PhilaPlace focuses on two areas: Old Southwark and the Greater Northern Liberties, they were always home to immigrants and working class. Philadelphia was known as a multi-ethnic “workshop of the world.” By using the landscape as a lens, PhilaPlace reveals how each population that arrives in a neighborhood creates new histories, traditions, and memories tied to place. Residents of Philly are encouraged to interact with and contribute to this project. Studies showed that younger users of this website wanted to experience the neighborhoods on their own while older audiences wanted to continue to have a guided experience.

Perhaps the most fun aspect of this website is its map. By clicking on any pin, you are given a well-written and easily-digestible information about the place. You would feel like you are walking through the city with a very-knowledgeable friend who tells you about Philadelphia’s past and present. While using it, you become ever-more aware of the concept of space in an exciting way.

b) Historypin is a website that collects, curates, and structures stories to bring people together, one story at a time. It hosts 365,951 stories pinned across 27,844 projects and tours – across 2,600 cities. It is built by a community of 80,000+ storytellers, archivists and citizen historians. Historypin is a not-for-profit organization. It no longer has a community forum due to technical issues, and also probably online harassment. To sign up, go to the top right corner; the easiest way is do so through Facebook. Everyone with a profile can create a collection, and upload images and story to the website. To add a pin: go to the profile page, “add a pin” or “create a tour” would be on the right side. One of the most popular collections is San Francisco MTA archival collection. By navigating the arrow on the map, you can view pins, which then appear as old archival photos. It feels like you are traveling in the past but with useful context provided in text. This website is useful for small organizations that want a platform or to create even an easily accessible tour.

c) Wordle helps you generate “word clouds” from text that you provide. The clouds give greater prominence to words that appear more frequently in the source text. You can tweak your clouds with different fonts, layouts, and color schemes. Because the Wordle web toy does not work, you should install a desktop version of it on your laptop. Do not try to use the web one, even after downloading Firefox Extended Support Release, it does not work. Instead, if you do not have it, download and install Java. Then download Worldle for Mac or Windows, the link is on their website’s main page. It’s pretty straightforward after that, you just copy and paste the text.  

Featured Image: René Magritte, Golconda (Golconde), 1953. Oil on canvas, 31 1/2 × 39 1/2 in. (80 × 100.3 cm). The Menil Collection, Houston. © 2017 C. Herscovici / Artists Rights Society (ARS), New York

moth sides now: a statement of preservation intent

The purpose of this collection is to document the Twitter bot Moth Generator by capturing its potential as a machine for moth creation; its interactivity in social media context; and its outputs as a collection of digital images.

In previous writing about Moth Generator, I had framed this project as a way to preserve the potential for new moths and the bot’s interactivity because the artists valued these qualities most. But perhaps it’s possible, or even preferable, to reframe these qualities as aspects to capture on behalf of users.

A comment on the statement of significance pointed to the role of scarcity in the significance and preservation worthiness of digital material. In some ways, the artists’ ability to experience moth generator is less “fragile” than those of users. The artists own their code; they control the means of moth production, including the ability to generate images at a higher quality than the bot publishes. Users, on the other hand, have zero control over the moth-drawing algorithm, limited insight about how it works (and then at conceptual level only), and no real sense of when a moth will come again. These elements characterize the delight and anxiety of Moth Generator from an audience perspective, and their longevity is less sure. After considering possible ways to capture moths as people encounter them, I’ve chosen instead to focus on collecting the elements of Moth Generator that will make it possible, in future, to recreate the sense of serendipity, beauty, and strangeness that users found compelling about the Twitter bot.

The process began with sketching a quick take on the authenticity- and access-related qualities of Moth Generator, following the grid created here for Geocities. It was quickly apparent that no solution might exist or emerge to cover the gamut of levels of authenticity, much less provide easy access. Given the tools and resources that exist today, it appears that collection and access strategies will both need to rely on triangulation. I’ve looked to Henry Lowood’s assertion that “digital repositories should consider the Authentic Experience as more of a reference-point than a deliverable, as a research problem rather than a repository problem,” as support for a triangulation approach in documenting user experiences.

An early pass at identifying levels of authenticity and access for @mothgenerator
An early pass at identifying levels of authenticity and access for @mothgenerator

Preserving Moth Generator’s potential means that the collection will include the source code for the drawing program behind the generator. Preserving interactivity means crawling or recording a the Twitter feed as it looks and behaves today, as well as acquiring tweet datasets reflecting how others interact with the bot. Preserving Moth Generator as a collection of digital images means extracting all images published to the Twitter feed, maybe even versions of Moth Generator outputs published elsewhere (perhaps as GIFs). Finally, material such as artist interviews and screenshots of programming errors (collected by the artists) provide creative context for the bot and its output.

Four groups of material comprise the contents of this collection. The following sections walk through one possible plan for collecting and preparing these materials for long-term preservation and reuse, including a number of challenges and decision points.

Preserving potential

As described in the statement of significance for Moth Generator, the artists see this project as, at heart, a drawing program that creates images from text input according to certain rules. Preserving the source code for this algorithm, along with any artist-produced documentation, is necessary to maintain the potential for creating new moths — a key to the experiences of both artists and audience. Managing the source code also allows for repurposing or recontextualizing all or part of the drawing program in future. Whether embedding moth generation in a different publishing program or game, or remixing the program and rules to create new kinds of drawings, viable and well-documented source code can help extend the reach of the Twitter bot.

A version-control system like Git could be extremely useful for maintaining both code and documentation. GitHub is a popular Git repository hosting service for making code available, and publishing the drawing program there would make it easier for others to update, fork, and reuse. However, since the artists have thus far kept the code behind Moth Generator mostly hidden, and plan to reuse it themselves in future projects, it’s unclear if an open, public repository would be acceptable. Moreover, as has been frequently noted by archivists, putting something on GitHub is not digital preservation; it can’t be the only way to keep source code for the long term.

Documenting interactivity

Preserving interactivity means capturing the Twitter feed as it looks and behaves today, as well as gathering evidence of how Moth Generator and its audience interacted. While it may not be possible at this time to preserve in full what it’s like to encounter and interact with Moth Generator as a Twitter user, it’s possible to triangulate the experience by adopting several complementary strategies.

First, capturing a WARC (web archive) file of the @mothgenerator Twitter feed using the WebRecorder application produces a file equivalent to a recording that can be played back in a browser. While not all websites render correctly in WebRecorder, Twitter feeds usually benefit from its ability to capture and render dynamic content. WebRecorder’s autoscroll feature allows one to capture an entire Twitter feed by automating the process of scrolling to the end of the page and prompting Twitter to load more data. WebRecorder helpfully records individual tweets as well as the overall feed, which means that comments, retweets, and “favorites” (formerly a star, now a heart icon) are also documented. A sense of shared appreciation with other Twitter users is important to Twitter bots’ interactive appeal, whether or not users actually connect with one another beyond favoriting or retweeting the same thing. Recording the @mothgenerator Twitter feed grabs one view, from the users’ perspective, of how Moth Generator might have engaged its audience.

Collecting and preserving tweet data is another way to document Moth Generator’s interactivity. Think of it as collecting the evidence of ongoing interactivity. A large number of tools exist to help capture social media data for preservation and research. For this project, twarc — a versatile command-line tool for searching and filtering tweets by making calls to Twitter’s Search API — suggested itself as a good fit. Tweet data is returned as JSON, a structured format from which it’s possible to extract different kinds and combinations of tweet content and metadata. Here’s how to submit a call for tweets mentioning “mothgenerator”:

twarc.py –search mothgenerator > tweets.json

Twitter users submit text to be transformed into moths, retweet moths, and mention @mothgenerator in tweets. Each type of interaction could potentially be captured in a tweet data set.

It’s also possible to use twarc and the Search API to acquire data for all tweets published by the @mothgenerator account. Using the stream from user option can produce a data set with the potential to support derivative works:

twarc.py –follow “3277928935” > tweets.json

Unfortunately, data collection tools that rely on Twitter’s API are not a viable option for capturing tweet data from the lifetime of the @mothgenerator account. Twitter’s Search API “searches against a sampling of recent Tweets published in the past 7 days.” Two key items to note are the retrospective time limit — no tweets older than 7 days can be retrieved — and the word “sampling.” As Ed Summers (the developer behind twarc) points out in a blog post, this troublingly opaque statement lets us know that limited tweets are available, but not how what’s available is selected.

If given permission from the artists to access the @mothgenerator account and download an archive of tweets past, it will at least be possible to fill in outbound tweet data created by the bot. Twitter archives provide both JSON and CSV formats. twarc can be used to capture tweets to and from @mothgenerator going forward, but past interactivity may be mostly limited to what can be obtained indirectly through WebRecorder. Twitter offers access to a Full-Archive Search API as a paid service, investing in which may not make sense unless a larger social media preservation program were in play.

Looking ahead to access to tweet data, Twitter’s Developer Agreement & Policy places limits on the quantities of complete tweet data that can be made publicly available. Up to 50,000 public tweets can be provided for download in a spreadsheet, PDF, etc. When it comes to JSON data, the policy reads, “If you provide Content to third parties, including downloadable datasets of Content or an API that returns Content, you will only distribute or allow download of Tweet IDs and/or User IDs.” Users of this data would need to use these IDs to retrieve additional data from the API. (See the Be a Good Partner to Twitter section for details.) While the number of tweets involved in this project may not even come close to 50,000, providing data in both forms offers the best combination of manipulability and completeness under the circumstances.

Creating a digital collection

According to this map of a tweet, images posted to Twitter appear in the text field, their URLs truncated to pic.twitter.com… With access to @mothgenerator’s tweet archive and/or to JSON tweet data going forward, it’s possible to filter data sets down to tweets containing images, and to extract those links from tweet texts. Each image will also be recorded in the WARC file of the @mothgenerator feed, and will appear in the browser replay. But I’m still looking for a reliable and efficient way to acquire nearly 4,000 individual images as JPEGs, either from WARC files or from their URLs.

JPEGs are not a highly recommended long-term preservation format for images but, as pointed out in this comment, the choice of Twitter as a publishing platform means that only JPEGs of a certain (low) quality are made available to @mothgenerator’s audience. Batch migrating low-quality JPEGs to TIFF won’t improve their resolution, but it may be necessary for long-term preservation. While individual images garner varying reactions on Twitter, the moths also matter to users in aggregate. So it’s important to ensure that they survive in bulk, as a collection, rather than cherry-picking personally compelling examples.

Committing to preserve the JPEGs as they appear on Twitter may also forestall any concerns the artists may have about ownership of @mothgenerator and its outputs. If restrictions are established on access to and reuse of the drawing program, the artists would maintain the exclusive ability to produce and sell high-quality prints of moths in future. (This is only an example.) Lower-quality images would not then be seen as detracting from future work and its profitability.

Revealing creative context

Finally, this collection intends to contextualize Moth Generator through the acquisition of material documenting the creative process behind it. Interviews with and essays by the artists — many of which were used to research the statement of significance — will be captured via WebRecorder along with any reader comments. A list of 10,000 Latin moth names and 4,000 English names were collected via web crawler and currently seed the random generation of moth names. The lists and the web crawler (if built from scratch or customized in any way) may also contribute to this collection. Artist Katie Rose Pipkin has referred to a personal trove of programming error screenshots she has collected throughout the bot-building process. These screenshots offer a glimpse behind the curtain at the intellectual and emotional labor that led to Moth Generator.