Digital Preservation Policy: Web Archiving for the Washingtoniana Collection

Introduction:

In my previous posts on this blog I have surveyed the digital preservation state of the District of Columbia Public library’s Washingtoniana collection. This survey was preformed via an interview with Digital Curation Librarian Lauren Algee  using the NDSA levels of digital preservation as a reference point.

In our survey we discovered that the DCPL Washingtoniana collection has very effective digital preservation which through a combination of knowledgeable practices and the Preservica service (an OAIS compliant digital preservation service) naearly reaches the 4th Level in every category of the NDSA levels of Digital Preservation. With this in mind my next step plan for the archive looks at a number of areas the archive has been interested in expanding and presenting some thoughts on where they could begin taking steps towards preservation of those materials.

Of particular interest in this regard is the collecting of website materials. Being dynamic objects of a relatively new media, collecting these items can be fairly complex as it is hard to precisely pin down to what extend is a website sufficiently collected. Websites may appear differently on different browsers, they may contain many links to other websites, they change rapidly, and they often contain multimedia elements. As such outlined below will be a policy which discusses these issues and specifically offers a digital preservation plan for websites.

Website Digital Preservation Policy for the Washingtoniana collection

The Washingtoniana collection was founded in 1905 when library director Dr. George F. Bowerman began collection materials on the local community. The collection stands as one of the foremost archives on the Washington, D.C area, community, history, and culture. Naturally it makes sense then with the increasing movement of DC social life and culture to online or born digital platforms that the Washingtoniana collection would consider collecting websites.

Selection

The same criteria for determining selection of materials for Washingtoniana materials should apply here. Websites should be considered if they pertain to Washington, DC or its surrounding areas, events that take place in or discus that area, pertain to prominent Washington D.C. related persons, DC related institutions, or websites otherwise pertaining to Washington D.C. community, arts, culture, or history.

Like any physical preservation decision, triage is an essential process. Websites that are likely to be at risk should be high priority. In a sense all web content is at risk. Websites that are for a specific purpose, or pertain to a specific event may have a limited operational window. Websites for defunct businesses, political election sites, and even an existent website on a specific day may be vulnerable and thus a candidate for digitization. In addition the materials in question should not be materials which are being collected elsewhere, and should be considered in relation to the rest of the collection.

Although automation tools may be used for identification, discretion for selection is on librarian hands. In addition, suggestions from patrons relevant to the collection should be considered, and a system for managing and encouraging such suggestions may be put in place.

Metadata

A metadata standard such as MODS (Metadata Object Description Standard ) should be used to describe the website. MODS is a flexible schema expressed in XML, is fairly compatiable with library records, and allows more complex metadata than Dublin Core and thus may work well. Metadata should include but not be limited to website name, content producers, URL, access dates, fixity as well as technical information which may generated automatically from webcrawlers such as timestamps, URI, MIME type, size in bytes, and other relevant metadata. Also, extraction information, file format, and migration information should be maintained.

Collection

A variety of collection tools exist for web archiving. The tool selected should be capable of the below tasks as outlined by the Library of Congress web archiving page

  • Retrieve all code, images, documents, media, and other files essential to reproducing the website as completely as possible.
  • Capture and preserve technical metadata from both web servers (e.g., HTTP headers) and the crawler (e.g., context of capture, date and time stamp, and crawl conditions). Date/time information is especially important for distinguishing among successive captures of the same resources.
  • Store the content in exactly the same form as it was delivered. HTML and other code are always left intact; dynamic modifications are made on-the-fly during web archive replay.
  • Maintain platform and file system independence. Technical metadata is not recorded via file system-specific mechanisms.

A variety of tools are capable of this task, a web crawler such as the Heritrix open source archival webcrawler or a subscription solution Archive-IT should be used. Both are by the Internet Archive, however the first is more of an open source solution while the second is a subscription based service which offers storage on Internet Archive servers.

Upon initial collection fixity should be taken using a Checksum system. This can be automated either with a staff written script or a program like Bagit, which automatically generates fixity information. This information should be maintained with the rest of the metadata for the digital object.

Websites should be kept in the most stable web archival format available. At the moment of this posts writing that format should be the WARC (Web ARChive) file format. This format allows the combination of multiple digital resources into a single file, which is useful as many web resources are complex and contain many items. Other file formats may be accepted if archived webpages are received from donors.

Preservation

Upon initial ingestion items may be kept on internal drives, and copied to at least one other location. Before the item is moved into any further storage system the file should be scanned for viruses, malware, or any other undesirable or damaging content using safety standards as agreed upon with the division of IT services. At this point fixity information should be taken as described above, and entered into metadata record.

Metadata should be described as soon as possible, as to which point the object with attached metadata should be uploaded into The Washingtoniana’s instance of Preservica.

Although Preservica automates much of the preservation process, a copy of the web archive should be kept on external hard drives. On a yearly interval a selection of the items within the harddrive should be checked against the items in Preservica to insure the Preservica fixity checks and obsolesce monitoring are working as desired.

References

Jack, P. (2014, February 27). Heritrix-Introduction. Retrieved November 14, 2016, from https://webarchive.jira.com/wiki/display/Heritrix/Heritrix#Heritrix-Introduction
Web Archiving-Collection development. (n.d.). Retrieved November 16, 2016, from https://library.stanford.edu/projects/web-archiving/collection-development
The Washingtoniana Collection. (n.d.). Retrieved November 16, 2016, from http://www.dclibrary.org/node/35928
Web Archiving at the Library of Congress. (n.d.). Retrieved November 16, 2016, from https://www.loc.gov/webarchiving/technical.html
Niu, J. (2012). An Overview of Web Archiving. Retrieved November 16, 2016, from http://www.dlib.org/dlib/march12/niu/03niu1.html
AVPreserve » Tools. (n.d.). Retrieved November 17, 2016, from https://www.avpreserve.com/avpsresources/tools/
Kunze, J., Bokyo, A., Vargas, A., Littman, B., & Madden, L. (2012, April 2). Draft-kunze-bagit-07 – The BagIt File Packaging Format (V0.97). Retrieved November 17, 2016, from http://www.digitalpreservation.gov/documents/bagitspec.pdf
MODS: Uses and Features. (2016, February 1). Retrieved November 17, 2016, from http://loc.gov/standards/mods/mods-overview.html
About Us. (2014). Retrieved November 17, 2016, from https://archive-it.org/blog/learn-more/

 

The Three C’s of Digital Preservation: Contact, Context, Collaboration

Three big themes I will take from learning about digital preservation: every contact leaves a trace, context is crucial, and collaboration is the key.

“Every Contact leaves a trace”

Matt Kirschenbaum and an optical disk cartridge in 2013.
Matt Kirschenbaum and an optical disk cartridge in 2013.

Matt Kirschenbaum’s words (or at least his interpretation of Locard’s words) will stick with me for a long while.  That when we will look at a digital object for preservation, we need to consider what it is we are looking at, and know that what we see is not necessarily all that there is.  Behind the screen there is a hard drive, and on that hard drive are physical traces of that digital object.  There is a forensic and formal materiality to digital objects – what is actually going on in the mechanical/physical sense versus what we see and interpret from those mechanical processes as they are converted to digital outputs.  We cannot fall into the trap of screen essentialism – of only focusing on the digital object as it is shown on our screens, without taking into consideration the hardware, software, code, etc. that runs underneath it.  

Which leads into my next point, about platform studies.  I am really intrigued by this idea that as digital media progresses we are seeing layers and layers of platforms on top of platforms for any given digital object.  The google doc that I wrote this blog draft in is written using Google Drive (a platform), which is running on my Chrome browser (a platform), which is running on Windows 7 (a platform).  These platforms can be essential to run a particular digital object, and yet with platforms constantly obsolescing or upgrading or changing, these platforms cannot be relied upon to preserve all digital objects.  Especially since most platforms are proprietary and able to disappear in an instant.  For example, my Pottermore project was spurred by the fact that the original website (hosted on the Windows Azure platform as well as the Playstation Home) had vanished and was replaced with a newer version.  If I had more time I would have liked to further develop the project by exploring the natures of the different platforms used by Pottermore, like Windows Azure and Playstation Home, and how those platforms influenced the experience of the game.

Context is Crucial

If content is king, context is queen!
If content is king, context is queen!

There’s no use in saving everything about a digital object if we don’t have any context to go with it.  Future researchers who have access to the Pottermore website files can examine them thoroughly and still have no idea why Pottermore was so important.  For this reason it is important to capture the human experience with digital objects.  Whether using oral history techniques or dance performance preservation strategies, there need to be records that try to capture the experience of using the digital work.  This can include interviews with the creators, stories from the users, Let’s Play videos, the annotated “musical score” approach so that a work can be re-run in a different setting.

This is really what the Pottermore project was about: providing context to the website that is all but lost to us.  In case the game does reappear, there will not be materials like the Pottermore Wiki and the Let’s Play videos that can explain how the game was played.  Furthermore, it can help future researchers realize the sense of community of the Pottermore users, and why they reacted so negatively when the old website was replaced.

Collaboration is the Key

Pottermore was a collaboration of many different entities, including JKR, Sony, and Microsoft.
Pottermore was a collaboration of many different entities, including JKR, Sony, and Microsoft.

There are a number of roles played by different people in digital preservation, and these roles are conflating and overlapping.  The preservationist may be the user who is nostalgic for an old game and so creates an emulation program for it.  The artist may use feedback from the users and incorporate it into their next work.  The technological expertise of IT folk may need to be ascertained in order to understand how to best save some works – in what formats, in which storage devices, etc.  Archivists and librarians may be the fans themselves, contributing to the fanfiction community that they are trying to preserve.  With funding only getting tighter and tighter and the digital world growing more complex, collaboration is going to become essential for a lot of digital preservation projects.    

What next?

Best practices, next exit sign
We’ll get here eventually… right?

Of course this leaves us with many unanswered questions.  How do we balance out the roles of different experts? How do we match the large scale of digital works on a limited budget? How much context do we need to give a certain work? In almost all cases the answer is going to be “it depends.” But these are questions that I am excited to figure out as I go on in the field.