OVERVIEW
The Archeology Program Office of the Prince George’s County Department of Parks and Recreation was established in 1988 to excavate, preserve and protect archeological sites in county parks. It is part of the Maryland-National Capital Park and Planning Commission. As part of its mission, the program curates millions of artifacts, and over the years, related documentation has been created in various formats and on disparate media. Documentation is primarily in the form of digital and/or physical copies of reports, catalogs, slides, print photographs, negatives, drawings, maps, and videos. All artifacts and documentation are stored in the same facility. Their goal is to have all digital content centralized on one shared network which is currently in development.
This report outlines recommended next steps to preserve their digital content following a framework designed by the National Digital Stewardship Alliance (NDSA). NDSA’s Levels of Digital Preservation presents a series of recommendations based on five areas of concern for digital preservation – storage and geographic location, file fixity and data integrity, information security, metadata and file format. Level 1 recommendations relate to the most urgent activities need for preservation and serves as a prerequisite to the higher levels.
STORAGE AND GEOGRAPHIC LOCATION
Establish a central storage system for all digital content and maintain at least one complete copy stored in another location.
Digital content is currently stored locally on five desktop computers, two laptops, a back-up drive of files from former staff, approximately 300 compact discs and about 700 3.5” floppy disks. Staff have been working on a shared drive to organize content in one place. While staff are backing up their own drives locally, there is no complete copy of all digital content.
Following NDSA’s Level 1 recommendation, staff should move all files off disparate media and into a storage system and create a complete copy of files that should be stored in a different location. This serves a number of purposes:
– Transferring to a new storage system will guard against potential loss of data on old media.
– Minimizing number of storage locations will facilitate data integrity monitoring.
– Creating a complete copy will provide a back-up in case files are corrupted or lost.
– Storing a copy in a different location will guard against loss specific to one location such as damage to equipment as a result of severe weather or a catastrophic event.
– This process is also a necessary first step in achieving a goal for the program, which is to have all files organized and accessible without extensive searching.
1. Create a complete copy of what is accessible right now.
The development of the shared network in coordination with the program’s IT department will likely serve as their established storage system. A portion of files are currently inaccessible on floppy disks. However, the bulk of digital content is stored locally on various hard drives. Any content that is currently accessible on hard drives should be copied and stored in another location to safeguard against loss. See “File Fixity and Data Integrity” below for checking data integrity and creating a file manifest. In the short term, storage can be on an external hard drive that is kept in another location in a secure place. Alternatively, staff could use a cloud storage service such as Dropboxwhich would have the added benefit of storing files offsite. Once files are integrated into the shared drive, a copy can be created from this set of content. Staff can work with the IT department to see if they can use a back-up storage system that is already in place with other departments.
2. Establish a file structure in the shared directory
Staff can begin integrating current files that they work with into the networked drive in order to develop a workable structure that will be practical for finding files. Historical documents can then be organized into that structure. Once a structure and naming system has been established, staff should document and follow this structure. This can be done concurrently with the following step.
3. Copy files from compact disks and 3.5” floppy disks using an unnetworked workstation
It will likely take a while to sift through decades worth of files. Given the urgency of potential loss from old media and a potential risk of viruses from disks used by former staff, copy over all the files from compact and floppy disks first onto a an unnetworked drive and run a virus scan before combining with current files.
a. Drive space needed: Assuming approximately 700 MB of data per CD and 1.44 MB per floppy disks, there is approximately 211 GB of data if all the disks are full.
b. Install anti-virus software on unnetworked workstation: Speak with the IT department about installing antivirus software so that it can run in the background while copying over files. Establish a schedule to run full virus scans depending on how regularly files are being copied over.
c. Install drivers for 3.5” external floppy drive on unnetworked drive: At the time of the survey, staff had an external drive to read 3.5” floppy disks but lacked the software to run the drive. This step is necessary in order to access any of this information.
d. Install fixity software on unnetworked and networked drives: See the section “File Fixity and Data Integrity” below.
e. Transferring files from CDs and floppy disks to unnetworked hard drive: Currently staff use context such as file extensions, file location, file name, and the author name to find content. Therefore, it’s recommended that they preserve these elements and related descriptive information that may be written on the disks as much as possible until they can sort out all the content. Disks were labeled in a number of ways. The majority of floppy disks where organized in boxes of about 10 disks with labels on the boxes suggesting that there were groups of related disks. Some disks had vague or no labeling. Related files kept together on disks or boxes of disks should be copied into a directory folder with the descriptive information from the label. This can either be transcribed onto a text file or a picture can be taken of the label and included in the directory.
f. Timing: While each floppy disk may hold a relatively small amount of data, the process of copying over hundreds of disks can be labor intensive. However, during this window, the work is not being backed up. Therefore, consider either setting a short window of time to copy over all of these files, or setting a schedule where small batches of files are copied over from disks and scanned for viruses before being copied to another location. This can either be added to shared drive or another external drive that is stored in another location. Once files have been safely transferred, they can be integrated into the file structure on the shared drive.
4. Set a schedule of back-ups to maintain a complete copy.
METADATA
The NDSA Level 1 recommendation it to create an inventory of digital content and storage location. Like the digital content itself, the inventory should be backed up and stored in another location. As mentioned in the Section “File Fixity and Data Integrity” below, AVP’s Fixity software includes a function to generate a manifest of file paths that will assist with inventory. Staff can maintain this inventory as they continue to develop the file structure on the shared drive.
For file and directory naming, consider creating a controlled vocabulary and syntax to make it easier for staff to find files. This can include specific terms for archeological site names, document type (e.g., site form, report), and a version, year or other modifier (e.g., draft, final) when needed.
FILE FIXITY AND DATA INTEGRITY
File fixity is a way of ensuring that files have not changed. It is recommended to run fixity checks whenever files are transferred (Owens, p. 110). This will generate an alphanumeric string called a checksum that can be compared before and after the transfer. Changing the content of the file including the format will change the checksum value. If the IT department does not already use fixity software, Fixity is a free tool from AVP that can be used to generate and compare checksum values and ensure that all files have been transferred. The software also generates a manifest of file paths along with the checksums that could prove useful in establishing an inventory of digital content.
The Level 2 recommends virus checking high risk content while Level 3 recommends virus checking all content. Virus checking high risk content is addressed 3b of “Storage and Geographic Location.” Staff should have antivirus software installed at their workstations and run scheduled scans.
Level 3 also recommends checking fixity as fixed intervals to ensure data integrity over time. Consider establishing a yearly schedule of validating fixity. Any corrupt or missing files can be replaced with a copy that passes fixity validation.
INFORMATION SECURITY
This step will outline who has access to the content and what they can do with it. This will prevent files from getting deleted or changed by unauthorized staff. NSDA Level 1 recommends to identify who is authorized to read, write, move or delete individual files. Related to this, Level 4 in the section on file fixity also recommends that no one is authorized to have write access to all copies. This reduces the likelihood of changing or deleting all copies of one or more files.
Staff have taken initial steps in the process by creating three different directories with different levels of access for their users: one directory for onsite archeology staff, one directory for the rest of the Prince George’s County Parks Department, and one directory for Dinosaur Park, another program of the Prince George’s County arm of the Maryland National Capital Park and Planning Commission that shares the same workspace.
However, more levels of access may be necessary if only one person in the office is allowed to have read or write permissions. In addition, staff should clearly delineate working files from historical files that should not change. This will help to prevent the document from being changed or deleted. It will also help with fixity validation since working files will likely involved changes in content which will change the checksum value. This can be accomplished by setting permissions to specific subdirectories or to specific sets of files. Document access restrictions and store in a location that all users can access.
FILE FORMAT
File formats can become obsolete. In some cases, once the format is obsolete the file might not be able to be opened in another format or will not be rendered in exactly the same way. The purpose of this section is to minimize these problems by using formats that are less likely to become obsolete, or that can be effectively rendered in another format. Widely used formats are generally considered to remain accessible because there will be a demand to either keep them accessible or develop a means of migrating them (Owens, 121).
Since some media have not yet been accessed, a current inventory of file formats that have been used is not available. Formats currently in use are jpegs andfiles generated from different versions of Microsoft Word, Excel and Access. Mapping files are created using GIS (geographic information systems) technologies. Older files have been created using WordPerfect, CAD (computerized aided drafting) software, and Paradox relational database. Staff are currently having trouble with opening Paradox files since the database is no longer supported and cannot be opened using current versions of Access or Excel.
Formats such as jpeg and Microsoft Word and Excel are commonly used, although the latter two undergo regular updates which may render slight changes if a file is opened in a new version. As the files are incorporated into the new directory structure on the shared drive, staff should develop an inventory of formats that they are using, work with the IT department to monitor them for obsolescence, and be prepared to migrate as needed.
FUTURE STEPS
NDSA Level 2 for storage recommends creating a third copy of the content and Level 4 recommends at least three copies stored in locations with different disaster threats. The Archeology Program Office could combine this recommendation with a means of sharing some of their content. This could be through a subject-specific repository for archeology or something more general like the Internet Archive.
Levels 2-4 address steps to maintain storage media so that files continue to be accessible in the long term. Staff should work with their IT department to document storage used for their shared drive and back-up copies, monitor for obsolescence, and have a plan in place for updating systems.
Staff also expressed an interest in resuming digitization of their physical documentation. Some reports and slides have already been digitized. As a starting point, staff can discuss their experiences with past efforts and lessons learned to establish goals for the program and how this will fit into the file structure that they are creating for current digital content. The Still Image and Audiovisual Working Groups of the Federal Agencies Digital Guidelines Initiative can be a good resource to establish best practices for digitization.
This is a nicely structured and well-articulated report. Great work! Your initial suggestion is really critical. They need to get any of that legacy content that matters to them off of the CDs and floppy disks and into a managed digital environment that they can then start running batch actions to create copies of and check files in.
This comes through at several points, but I think a big part of making the best use of their resources is going to be figuring out what kinds of support or assurances come through with various IT systems that they get and work in from the County. There is a good chance that the County could either meet many of the needs they have for digital storage or at least clarify what levels of assurance come with the services they provide.
My sense is that given the extent of what they have on hand today, a big part of their work going forward is going to need to be about doing triage and prioritizing what really is the most essential of this content to bring under more managed care.
Again, great observations and suggestions.