Digital Preservation Policy: Web Archiving for the Washingtoniana Collection

Introduction:

In my previous posts on this blog I have surveyed the digital preservation state of the District of Columbia Public library’s Washingtoniana collection. This survey was preformed via an interview with Digital Curation Librarian Lauren Algee  using the NDSA levels of digital preservation as a reference point.

In our survey we discovered that the DCPL Washingtoniana collection has very effective digital preservation which through a combination of knowledgeable practices and the Preservica service (an OAIS compliant digital preservation service) naearly reaches the 4th Level in every category of the NDSA levels of Digital Preservation. With this in mind my next step plan for the archive looks at a number of areas the archive has been interested in expanding and presenting some thoughts on where they could begin taking steps towards preservation of those materials.

Of particular interest in this regard is the collecting of website materials. Being dynamic objects of a relatively new media, collecting these items can be fairly complex as it is hard to precisely pin down to what extend is a website sufficiently collected. Websites may appear differently on different browsers, they may contain many links to other websites, they change rapidly, and they often contain multimedia elements. As such outlined below will be a policy which discusses these issues and specifically offers a digital preservation plan for websites.

Website Digital Preservation Policy for the Washingtoniana collection

The Washingtoniana collection was founded in 1905 when library director Dr. George F. Bowerman began collection materials on the local community. The collection stands as one of the foremost archives on the Washington, D.C area, community, history, and culture. Naturally it makes sense then with the increasing movement of DC social life and culture to online or born digital platforms that the Washingtoniana collection would consider collecting websites.

Selection

The same criteria for determining selection of materials for Washingtoniana materials should apply here. Websites should be considered if they pertain to Washington, DC or its surrounding areas, events that take place in or discus that area, pertain to prominent Washington D.C. related persons, DC related institutions, or websites otherwise pertaining to Washington D.C. community, arts, culture, or history.

Like any physical preservation decision, triage is an essential process. Websites that are likely to be at risk should be high priority. In a sense all web content is at risk. Websites that are for a specific purpose, or pertain to a specific event may have a limited operational window. Websites for defunct businesses, political election sites, and even an existent website on a specific day may be vulnerable and thus a candidate for digitization. In addition the materials in question should not be materials which are being collected elsewhere, and should be considered in relation to the rest of the collection.

Although automation tools may be used for identification, discretion for selection is on librarian hands. In addition, suggestions from patrons relevant to the collection should be considered, and a system for managing and encouraging such suggestions may be put in place.

Metadata

A metadata standard such as MODS (Metadata Object Description Standard ) should be used to describe the website. MODS is a flexible schema expressed in XML, is fairly compatiable with library records, and allows more complex metadata than Dublin Core and thus may work well. Metadata should include but not be limited to website name, content producers, URL, access dates, fixity as well as technical information which may generated automatically from webcrawlers such as timestamps, URI, MIME type, size in bytes, and other relevant metadata. Also, extraction information, file format, and migration information should be maintained.

Collection

A variety of collection tools exist for web archiving. The tool selected should be capable of the below tasks as outlined by the Library of Congress web archiving page

  • Retrieve all code, images, documents, media, and other files essential to reproducing the website as completely as possible.
  • Capture and preserve technical metadata from both web servers (e.g., HTTP headers) and the crawler (e.g., context of capture, date and time stamp, and crawl conditions). Date/time information is especially important for distinguishing among successive captures of the same resources.
  • Store the content in exactly the same form as it was delivered. HTML and other code are always left intact; dynamic modifications are made on-the-fly during web archive replay.
  • Maintain platform and file system independence. Technical metadata is not recorded via file system-specific mechanisms.

A variety of tools are capable of this task, a web crawler such as the Heritrix open source archival webcrawler or a subscription solution Archive-IT should be used. Both are by the Internet Archive, however the first is more of an open source solution while the second is a subscription based service which offers storage on Internet Archive servers.

Upon initial collection fixity should be taken using a Checksum system. This can be automated either with a staff written script or a program like Bagit, which automatically generates fixity information. This information should be maintained with the rest of the metadata for the digital object.

Websites should be kept in the most stable web archival format available. At the moment of this posts writing that format should be the WARC (Web ARChive) file format. This format allows the combination of multiple digital resources into a single file, which is useful as many web resources are complex and contain many items. Other file formats may be accepted if archived webpages are received from donors.

Preservation

Upon initial ingestion items may be kept on internal drives, and copied to at least one other location. Before the item is moved into any further storage system the file should be scanned for viruses, malware, or any other undesirable or damaging content using safety standards as agreed upon with the division of IT services. At this point fixity information should be taken as described above, and entered into metadata record.

Metadata should be described as soon as possible, as to which point the object with attached metadata should be uploaded into The Washingtoniana’s instance of Preservica.

Although Preservica automates much of the preservation process, a copy of the web archive should be kept on external hard drives. On a yearly interval a selection of the items within the harddrive should be checked against the items in Preservica to insure the Preservica fixity checks and obsolesce monitoring are working as desired.

References

Jack, P. (2014, February 27). Heritrix-Introduction. Retrieved November 14, 2016, from https://webarchive.jira.com/wiki/display/Heritrix/Heritrix#Heritrix-Introduction
Web Archiving-Collection development. (n.d.). Retrieved November 16, 2016, from https://library.stanford.edu/projects/web-archiving/collection-development
The Washingtoniana Collection. (n.d.). Retrieved November 16, 2016, from http://www.dclibrary.org/node/35928
Web Archiving at the Library of Congress. (n.d.). Retrieved November 16, 2016, from https://www.loc.gov/webarchiving/technical.html
Niu, J. (2012). An Overview of Web Archiving. Retrieved November 16, 2016, from http://www.dlib.org/dlib/march12/niu/03niu1.html
AVPreserve » Tools. (n.d.). Retrieved November 17, 2016, from https://www.avpreserve.com/avpsresources/tools/
Kunze, J., Bokyo, A., Vargas, A., Littman, B., & Madden, L. (2012, April 2). Draft-kunze-bagit-07 – The BagIt File Packaging Format (V0.97). Retrieved November 17, 2016, from http://www.digitalpreservation.gov/documents/bagitspec.pdf
MODS: Uses and Features. (2016, February 1). Retrieved November 17, 2016, from http://loc.gov/standards/mods/mods-overview.html
About Us. (2014). Retrieved November 17, 2016, from https://archive-it.org/blog/learn-more/

 

A Glass Case of Emotion: User Movitivation in Crowdsourcing

The web is inherently made up of networks and interactions among its users. But what is the nature of these interactions – participatory? collaborative? exploitative? These questions play out when cultural heritage institutions take to the web and attempt to engage the vast public audience that is now accessible to them. Crowdsourcing is a means to allow everyday citizens to participate and become more involved with historic materials than ever before. Similarly, these volunteer projects can overcome institutional monetary and time constraints to create products not possible otherwise. What most interested me in the readings is the motivations of those involved in these projects. Why do citizens choose to participate? Why are institutions putting these projects out there? How do they play on the motivations of their users? These questions link back to the overarching general ideas about the nature of interactions on the web.

Why Wasn’t I Consulted?

Paul Ford describes the fundamental nature of the web with the phrase “Why wasn’t I consulted” or WWIC for short. Ford claims that feedback and voice on content is what the web is run on. By giving people a voice, even through the basest form of expression in likes, favorites, +1’s, or “the digital equivalent of a grunt,” users are satisfied that they were consulted and that they can give their approval or disapproval.

User experience, in Ford’s mind, is centered on their emotional need to be consulted. Additionally, the expression of approval is what feeds other users to create content, receiving a positive emotional response from those who consume their work. Organizations create spaces that shrink the vast web down into communities where the WWIC problem can be solved. Essentially, these structures create a glass case of emotion.

Ron Burgundy in a Phone Booth

Libraries, archives, and museums have to deal with users’ emotions when creating their crowdsourcing ventures. How do we create places where the users will feel consulted and desire to participate? Like Ford, Causer & Wallace in describing the Transcribe Bentham project of University College London, and the Frankle article on the Children of Lodz Ghetto project at the United States Holocaust Memorial Museum, emphasize understanding users and volunteers as well as finding the appropriate medium is important in these undertakings.

Causer & Wallace identify a much more detailed set of motivations of their user groups than Ford’s WWIC idea. Many of their participants claimed they had interests in the project such as history, philosophy, Bentham, or crowdsourcing in general. Other than these categories, the next biggest reasoning for joining the project was a desire to be a part of something collaborative. The creators of Transcription Bentham failed to create an atmosphere where users felt comfortable collaborating which may have been why the project decreased in popularity over time. The Children of Lodz Ghetto project, on the other hand, is much more collaborative with administrators guiding researchers through each step of the process. Eventually they hope to have advanced users take over the role of teaching newcomers. The Holocaust Museum’s project is a much more sustainable model that could lead to lasting success.

Crowdsourcing (For Members Only)

While collaboration and having an interesting topic is a key factor in motivating participation, how do online history sites get the attention of the public to join in the first place? The push for the openness of both the internet and cultural institutions is something I greatly support, but I think motivating the populace to get involved in these projects needs a return to exclusivity. There is still a prevailing notion that archives and other cultural organizations are closed spaces that only certain people can access. In many European institutions this is still the case. Why don’t we use the popular notions of exclusivity to our own benefit?

Hear me out. What these articles lacked was the idea that many people desire what they cannot get or what only few can. I’m not advocating putting collections behind a paywall or keeping collections from being freely available online. Instead, I think participation in crowdsourcing projects should be competitive or exclusive in order to gain the initial excitement needed to gain a following and spur desire for inclusion.

Other social media platforms such as early Facebook and more recently Ello or new devices such as Google’s Google Glass, have made membership or ownership limited, creating enormous desire for each. In these examples, the majority of the populace is asking why wasn’t I consulted? and therefore want to be included. Thus, having the initial rounds of participation be limited to a first-come, first-serve, invite-only platform would spark desire for the prestige of being the few to have access to the project.

In Edson’s article, he wrote about the vast stretches of the internet that cultural institutions do not engage, what he called “dark matter.” While there are huge numbers of people out there who are “starving for authenticity, ideas, and meaning,” I think the first step should be creating a desire to participate and then growing the project. Without something to catch the public’s attention, create a community, and grow an emotional desire to participate, another crowdsourcing website would simply be white noise to the large number of internet users in the world.  The users, who are visiting the websites looking for a way into the projects but denied, could discover the free and open collections which are there right now. After this first limited period, once the attention is there, I think scaling up would be easier. Of course these ideas will only work if the institution has created a place that understands the emotional needs of its users and provides a collaborative and social environment where users are comfortable participating.

 

Bridget Sullivan Print Project Proposal

In recent years, museums and archives have made a concerted effort to take advantage of digital media in connecting with public audiences. These institutions have undertaken a multitude of projects to make their collection available to a greater audience through digital access to these types of collections. For my print project, I would like to take a closer look at some of these approaches to presenting historic material culture to a public audience and how digitization efforts have affected the way that the public engages with historical narratives through material culture.

 

Specifically, I would like to focus on the digital offerings of the National Archives and the Library of Congress. Historically, these are two of the most widely used research facilities for American history. As such, they have fallen into the category of most archives, which tend to discourage visitation from anyone outside of serious historical researchers. There is little opportunity to explore the holdings of these types of institutions and they can even be intimidating for newer researchers.

 

However, digitization has broken down the barrier between the public and these repositories of American public knowledge. Both have taken great strides to make portions of their collections available to all types of researchers through the Internet. Further, these efforts have been targeted at different audiences. The National Archives and the Library of Congress have both made documents and finding aids available through general search features of their websites. However, they have also gone beyond the basics of digitization. Each has created online offerings that are more suited to general exploration of their collections, as opposed to research with a specific focus and mission.

 

The National Archives offers the Digital Vaults, a way to digitally wander through their collections. Documents are linked by categorical tagging. It also allows explorers the ability to create their own collections of documents and artifacts that are interesting to them. Similarly, the Library of Congress has created MyLOC. Explorers can register for their own account and create collections of interest to them. These collections can incorporate all aspects of the website, including general information about visiting the Library of Congress as well as online exhibits.  

 

I will compare and contrast these two sites, focusing on the audiences they target and the various pathways these audiences have to interact with the collections of these institutions. Additionally, I will address how the ability to interact with collections online has affected the demographics of those who take an interest in these collections.

On the Potential Benefits of “Many Eyes”

In 2007 IBM launched the site Many Eyes, which allows users to upload data sets, try out various ways of visualizing them, and most importantly, discuss those visualizations with anyone who sets up a (free) account on Many Eyes.  As professor Ben Shneiderman says, paraphrased in the New York Times review of Many Eyes, “sites like Many Eyes are helping to democratize the tools of visualization.”  Instead of leaving visualizations to highly trained academians, anyone can make then and discuss them on Many Eyes, which is a pretty neat idea.

Many Eyes allows viewers to upload data sets and then create visualizations of them.  Many Eyes offers users the ability to visualize data in 17 different ways, ranging from the wordle type of word cloud, to maps, pie charts, bubble graphs, and network diagrams, just to name a few.  There are other sites or programs that will allow users to create charts in some of these ways, Microsoft Excel for example, but Many Eyes offers the advantage of multiple types of visualizations all in one place.

Additionally,  people in disparate locations can talk about the data sets and visualizations through comments.  The comment feature even allows for the “highlighting” of the specific portion of a visualization you might be referencing. The coolest feature of Many Eyes is that anyone can access and play with data uploaded by anyone else, in the hopes that “new eyes” will lead to surprising and unexpected conclusions regarding that data.

If you create an account on Many Eyes, you can access their list of “Topic Centers”, where people who are interested in data sets and visualizations relating to specific topics, can interact and comment with one another, as well as link related data sets and visualizations.  However, a quick perusal of the topic centers show that the vast majority of topics are being followed by only one user.  The few topics that have more than one user seem to be pre-established groups with specific projects in mind.

Unfortunately, it appears that a crowdsourcing mentality, where people who don’t know each other collaborate to understand and interpret data, hasn’t really materialized.  In this IBM research article, the authors even hint at how Many Eyes “is not so much an online community as a ‘community component’ which users insert into pre-existing online social systems.”  Part of the difficulty in realizing the democratizing aspect of Many Eyes might be a simple design problem in that the data sets, visualizations, and topic centers display based on what was most recently created, rather than by what is most frequently tagged or talked about.  This clutters the results with posts in other languages or tests that aren’t interesting to a broader audience.  Many Eyes developers might adopt a more curatorial method where they link to their top picks for the day on the front page in order to sponsor interest in certain universal topics.  But maybe the problem might be more profound; what do you think?

Ultimately, I’m not sure how relevant Many Eyes is to historians.  It seems that asking for a democratized collection of strangers to collaborate on visualizing your data seems unlikely based on the usage history of the site.  However, groups of researchers who already have a data set to visualize and discuss might be able to make use of this site for cliometrics-style research.  Classrooms and course projects in particular can benefit from this site, since it’s relatively easy for people with a low-skill level to use.  What do you think?  What other applications do you see Many Eyes having?  How relevant will it be for your work in the digital humanities?

Flickr

Flickr is a free photosharing site. It allows you to create a profile and upload photos to a format that makes them easy to share with friends, family and the general public. Flickr makes it easy to get started. In addition to step by step instructions when creating a profile, it also provides a tour of the site that explains all of its features. Aside from uploading photos, you can comment on other users’ uploads or mark images that are especially interesting to you as favorites, allowing you to easily return. Flickr also lets you add people to photos to easily alert other users who may like that image. One feature that I found interesting was the guest list. This feature allows access to images that you choose for people who do not have a Flickr account. On that note, it also contains privacy settings that limit who can see photos on an individual basis.

Two features that I thought were especially useful were the map and linking. Flickr allows you to upload collections of photos from your account to a separate website. This feature is helpful for institutional accounts because they can connect the photos on Flickr to their main webpage. It also could be used by bloggers to share Flickr collections through that medium. The map feature allows you to attach photos to a specific location. Again, this type of technology could be utilized by historical institutions to teach about events or themes through photos.

The search feature is a great way to explore the Flickr world. When searching it brings up photographs tagged with that term as well as groups, individual photographers and places associated.Flickr also allows you to comment on photos. One piece of this feature that was interesting was that you can comment directly on a photo.

The Flickr Commons is the most obvious historical aspect of this site. The Commons provides users the opportunity to help describe photo collections from various institutions across the globe, such as NASA, The National Archives, the New York Public Library, and Smithsonian. Users can add tags and comments to any of the photos available in The Commons.

Flickr also allows you to organize photos into sets and collections, as well as create groups to aggregate photos with a common theme. Some examples of historically minded groups are

http://www.flickr.com/photos/nersess/sets/72157603339444029/with/2066890192/

http://www.flickr.com/photos/usnationalarchives/4166259453/