In assembling an Archival Information Package for material documenting @mothgenerator, I had high hopes of being able to put together a METS file with descriptive, administrative, technical, and structural metadata for the entire AIP. I had been looking at Archivematica’s documentation for METS implementation in AIPs as a potential model but quickly realized that, between the variety of content and material types in the AIP and only beginner-level understanding of METS elements and attributes, I would have a nearly impossible time trying to piece together on my own what a full-service digital preservation system could produce relatively quickly.
Another compromise came in assigning and not assigning item-level descriptive metadata. This would be relatively easy for some digital objects — a folder full of JPEGs, for example — while less efficient for others — such as the CSS and Javascript files accompanying a downloaded Twitter archive. I have to confess that time became a factor here, as I hadn’t previously researched exactly what each of the Twitter archive files was for, and am still working through what WARC files are made of. “May contain: Data of various types.” Consider the solutions here to be in the spirit of extensible processing: one pass at baseline description, with plans to take a closer look in the future.
What should constitute baseline description, anyway? Each of the material groupings described in the statement of preservation intent needed its own folder; ultimately six total:
1_Drawing_Program. Files associated with the drawing algorithm behind Moth Generator.
2_Twitter_WARC. Captures of the @mothgenerator Twitter feed recorded with WebRecorder as WARC files.
3_Twitter_Archive. Tweet archive downloaded from the @mothgenerator account. Includes tweet data as both JSON and CSV.
4_Digital Images. 4,000 digital images previously generated by the drawing algorithm and published by @mothgenerator, captured and stored as JPEGs.
5_Process_Docs. A collection of tools, texts, and images created in the process of building Moth Generator.
6_Artist_Interviews. Captures of online interviews, news coverage, and essays related to Moth Generator, the artists’ bodies of work, and Twitter bots in general.
An overview of folders in the Moth Generator AIP.
Carrying out a preservation plan in full would require a lot of artist participation, which I happen not to have for various reasons. As a result many of these folders include dummy files standing in for one or several thousand like them. 1_Drawing_Program contains a text file representing Javascript files and libraries needed for a complex text processing and drawing algorithm. I don’t have actual tweet data for the @mothgenerator account in 3_Twitter_Archive. And 4_Digital_Images includes 25 JPEGs of more than 4,000 published to date on the @mothgenerator Twitter feed. Because this AIP isn’t really where it needs to be to substantively document Moth Generator, I’ve made it available to download here rather than at the Internet Archive — which would have implied a certain readiness for public consumption.
One of several ways in which FITS and I didn’t get along.
Adventures in metadata creation continued as I experimented with different tools for generating file inventories and checksums. Having had repeat bad luck running FITS on its own, I next tried DataAccessioner, a GUI tool developed at the Duke University Libraries. To use DataAccessioner, one identifies a source and target directory, excludes material not for accessioning, enters Dublin Core description at the collection, folder or file level, and clicks Migrate. DataAccessioner will move the selected material to and from the specified locations, and output an XML file with technical and administrative metadata like file formats, file size, and checksums.
Entering metadata (left) and looking over file structure (right) in DataAccessioner.
Here’s a look at the file I ended up with, after some false starts. It includes all description assigned at folder level, plus PREMIS data for each folder and file in the collection:
Collection-level metadata (DC) created with DataAccessioner.File-level technical and administrative metadata (PREMIS) created with DataAccessioner.
This file is the key outcome, for me, of using DataAccessioner in the first place. It uses FITS for file identification, and the ability to add description is helpful, if super slow by hand. The XML output can be transformed with XSLT or with this handy-sounding GUI — have not yet had a chance to try it. There are other ways to transfer files — if this was 100% archival material I might have tried creating a bag with BagIt or Exactly — but this XML output is really helpful.
Although I had started “cataloging” AIP contents at item-level, the prospect of re-entering it all in DataAccessioner or writing XML more or less by hand did not fill me with joy. A heavy-duty repository system would have let me ingest metadata from a spreadsheet (don’t hate, appreciate). I settled on assigning Dublin Core (15 elements) to the collection as a whole and to each of the six high-level folders. For the last folder — 6_Artist_Interviews — I went one level further down to distinguish between the rights situations of articles saves as HTML pages versus recorded as WARC files. I used the Getty Art & Architecture Thesaurus for subject terms, in part to see how far Moth Generator could stretch it.
Overall, this stage of the project has renewed my appreciation for the batshit crazy world of metadata creation and reconciliation in digital preservation, much of which is now accomplished by cleverly designed tools and therefore taken for granted by the blissfully ignorant rest of us. I subscribe to the prevailing (?) wisdom that sometimes it’s best to let the bits describe themselves, but also need the occasional reminder that blood / sweat / tears makes this possible.
I had also been feeling pretty smug about how good @mothgenerator was looking in WebRecorder, and thought I had things wrapped up. But the Digital Preservation Moth politely begged to differ.
The purpose of this collection is to document the Twitter bot Moth Generator by capturing its potential as a machine for moth creation; its interactivity in social media context; and its outputs as a collection of digital images.
In previous writing about Moth Generator, I had framed this project as a way to preserve the potential for new moths and the bot’s interactivity because the artists valued these qualities most. But perhaps it’s possible, or even preferable, to reframe these qualities as aspects to capture on behalf of users.
A comment on the statement of significance pointed to the role of scarcity in the significance and preservation worthiness of digital material. In some ways, the artists’ ability to experience moth generator is less “fragile” than those of users. The artists own their code; they control the means of moth production, including the ability to generate images at a higher quality than the bot publishes. Users, on the other hand, have zero control over the moth-drawing algorithm, limited insight about how it works (and then at conceptual level only), and no real sense of when a moth will come again. These elements characterize the delight and anxiety of Moth Generator from an audience perspective, and their longevity is less sure. After considering possible ways to capture moths as people encounter them, I’ve chosen instead to focus on collecting the elements of Moth Generator that will make it possible, in future, to recreate the sense of serendipity, beauty, and strangeness that users found compelling about the Twitter bot.
The process began with sketching a quick take on the authenticity- and access-related qualities of Moth Generator, following the grid created here for Geocities. It was quickly apparent that no solution might exist or emerge to cover the gamut of levels of authenticity, much less provide easy access. Given the tools and resources that exist today, it appears that collection and access strategies will both need to rely on triangulation. I’ve looked to Henry Lowood’s assertion that “digital repositories should consider the Authentic Experience as more of a reference-point than a deliverable, as a research problem rather than a repository problem,” as support for a triangulation approach in documenting user experiences.
An early pass at identifying levels of authenticity and access for @mothgenerator
Preserving Moth Generator’s potential means that the collection will include the source code for the drawing program behind the generator. Preserving interactivity means crawling or recording a the Twitter feed as it looks and behaves today, as well as acquiring tweet datasets reflecting how others interact with the bot. Preserving Moth Generator as a collection of digital images means extracting all images published to the Twitter feed, maybe even versions of Moth Generator outputs published elsewhere (perhaps as GIFs). Finally, material such as artist interviews and screenshots of programming errors (collected by the artists) provide creative context for the bot and its output.
Four groups of material comprise the contents of this collection. The following sections walk through one possible plan for collecting and preparing these materials for long-term preservation and reuse, including a number of challenges and decision points.
Preserving potential
As described in the statement of significance for Moth Generator, the artists see this project as, at heart, a drawing program that creates images from text input according to certain rules. Preserving the source code for this algorithm, along with any artist-produced documentation, is necessary to maintain the potential for creating new moths — a key to the experiences of both artists and audience. Managing the source code also allows for repurposing or recontextualizing all or part of the drawing program in future. Whether embedding moth generation in a different publishing program or game, or remixing the program and rules to create new kinds of drawings, viable and well-documented source code can help extend the reach of the Twitter bot.
A version-control system like Git could be extremely useful for maintaining both code and documentation. GitHub is a popular Git repository hosting service for making code available, and publishing the drawing program there would make it easier for others to update, fork, and reuse. However, since the artists have thus far kept the code behind Moth Generator mostly hidden, and plan to reuse it themselves in future projects, it’s unclear if an open, public repository would be acceptable. Moreover, as has been frequently noted by archivists, putting something on GitHub is not digital preservation; it can’t be the only way to keep source code for the long term.
Documenting interactivity
Preserving interactivity means capturing the Twitter feed as it looks and behaves today, as well as gathering evidence of how Moth Generator and its audience interacted. While it may not be possible at this time to preserve in full what it’s like to encounter and interact with Moth Generator as a Twitter user, it’s possible to triangulate the experience by adopting several complementary strategies.
First, capturing a WARC (web archive) file of the @mothgenerator Twitter feed using the WebRecorder application produces a file equivalent to a recording that can be played back in a browser. While not all websites render correctly in WebRecorder, Twitter feeds usually benefit from its ability to capture and render dynamic content. WebRecorder’s autoscroll feature allows one to capture an entire Twitter feed by automating the process of scrolling to the end of the page and prompting Twitter to load more data. WebRecorder helpfully records individual tweets as well as the overall feed, which means that comments, retweets, and “favorites” (formerly a star, now a heart icon) are also documented. A sense of shared appreciation with other Twitter users is important to Twitter bots’ interactive appeal, whether or not users actually connect with one another beyond favoriting or retweeting the same thing. Recording the @mothgenerator Twitter feed grabs one view, from the users’ perspective, of how Moth Generator might have engaged its audience.
Collecting and preserving tweet data is another way to document Moth Generator’s interactivity. Think of it as collecting the evidence of ongoing interactivity. A large number of tools exist to help capture social media data for preservation and research. For this project, twarc — a versatile command-line tool for searching and filtering tweets by making calls to Twitter’s Search API — suggested itself as a good fit. Tweet data is returned as JSON, a structured format from which it’s possible to extract different kinds and combinations of tweet content and metadata. Here’s how to submit a call for tweets mentioning “mothgenerator”:
twarc.py –search mothgenerator > tweets.json
Twitter users submit text to be transformed into moths, retweet moths, and mention @mothgenerator in tweets. Each type of interaction could potentially be captured in a tweet data set.
It’s also possible to use twarc and the Search API to acquire data for all tweets published by the @mothgenerator account. Using the stream from user option can produce a data set with the potential to support derivative works:
twarc.py –follow “3277928935” > tweets.json
Unfortunately, data collection tools that rely on Twitter’s API are not a viable option for capturing tweet data from the lifetime of the @mothgenerator account. Twitter’s Search API “searches against a sampling of recent Tweets published in the past 7 days.” Two key items to note are the retrospective time limit — no tweets older than 7 days can be retrieved — and the word “sampling.” As Ed Summers (the developer behind twarc) points out in a blog post, this troublingly opaque statement lets us know that limited tweets are available, but not how what’s available is selected.
If given permission from the artists to access the @mothgenerator account and download an archive of tweets past, it will at least be possible to fill in outbound tweet data created by the bot. Twitter archives provide both JSON and CSV formats. twarc can be used to capture tweets to and from @mothgenerator going forward, but past interactivity may be mostly limited to what can be obtained indirectly through WebRecorder. Twitter offers access to a Full-Archive Search API as a paid service, investing in which may not make sense unless a larger social media preservation program were in play.
Looking ahead to access to tweet data, Twitter’s Developer Agreement & Policy places limits on the quantities of complete tweet data that can be made publicly available. Up to 50,000 public tweets can be provided for download in a spreadsheet, PDF, etc. When it comes to JSON data, the policy reads, “If you provide Content to third parties, including downloadable datasets of Content or an API that returns Content, you will only distribute or allow download of Tweet IDs and/or User IDs.” Users of this data would need to use these IDs to retrieve additional data from the API. (See the Be a Good Partner to Twitter section for details.) While the number of tweets involved in this project may not even come close to 50,000, providing data in both forms offers the best combination of manipulability and completeness under the circumstances.
Creating a digital collection
According to this map of a tweet, images posted to Twitter appear in the text field, their URLs truncated to pic.twitter.com… With access to @mothgenerator’s tweet archive and/or to JSON tweet data going forward, it’s possible to filter data sets down to tweets containing images, and to extract those links from tweet texts. Each image will also be recorded in the WARC file of the @mothgenerator feed, and will appear in the browser replay. But I’m still looking for a reliable and efficient way to acquire nearly 4,000 individual images as JPEGs, either from WARC files or from their URLs.
JPEGs are not a highly recommended long-term preservation format for images but, as pointed out in this comment, the choice of Twitter as a publishing platform means that only JPEGs of a certain (low) quality are made available to @mothgenerator’s audience. Batch migrating low-quality JPEGs to TIFF won’t improve their resolution, but it may be necessary for long-term preservation. While individual images garner varying reactions on Twitter, the moths also matter to users in aggregate. So it’s important to ensure that they survive in bulk, as a collection, rather than cherry-picking personally compelling examples.
Committing to preserve the JPEGs as they appear on Twitter may also forestall any concerns the artists may have about ownership of @mothgenerator and its outputs. If restrictions are established on access to and reuse of the drawing program, the artists would maintain the exclusive ability to produce and sell high-quality prints of moths in future. (This is only an example.) Lower-quality images would not then be seen as detracting from future work and its profitability.
Revealing creative context
Finally, this collection intends to contextualize Moth Generator through the acquisition of material documenting the creative process behind it. Interviews with and essays by the artists — many of which were used to research the statement of significance — will be captured via WebRecorder along with any reader comments. A list of 10,000 Latin moth names and 4,000 English names were collected via web crawler and currently seed the random generation of moth names. The lists and the web crawler (if built from scratch or customized in any way) may also contribute to this collection. Artist Katie Rose Pipkin has referred to a personal trove of programming error screenshots she has collected throughout the bot-building process. These screenshots offer a glimpse behind the curtain at the intellectual and emotional labor that led to Moth Generator.
Moth Generator (@mothgenerator) is an interactive, multi-faceted, collaborative digital artwork by Katie Rose Pipkin and Loren Schmidt. The following statements illustrate its complexity and set the stage for an eventual preservation plan for this work:
Moth Generator is:
A Javascript drawing program that creates images of imaginary moths by translating text into numbers
A Twitter feed where moths are regularly published and @replies are used as moth-generating text
A collection of computer-generated moth images and names, including looping animations created from generated moths and reused for other purposes
An element of a complex virtual world project
A collaboration between a game designer and an artist whose work deals in large part with code and bots