Digitization, Digital Preservation, and File Formats

So I successfully scared everyone off from blogging about the The NINCH Guide to Good Practice in the Digital Representation and Management of Cultural Heritage Materials. If you get a bit lost in this reading I suggest reading over mwest24’s post on it from last year.

Along with the post I would suggest the following vocabulary terms that we should all become familiar with. I will weave these terms together into a bit of a how computer files and storage work discussion that will help set us up for Kirschenbaum’s book.

As this is technical info, I am just going to crib most of it for Wikipedia. Don’t get too lost in the details, but please read these over and click the links out to Wikipedia if you have no idea about some of the terms.

Key Terms for File Characteristics 

Dots per inch (DPI) is a measure of spatial printing or video dot density, in particular the number of individual dots that can be placed in a line within the span of 1 inch (2.54 cm). The DPI value tends to correlate with image resolution, but is related only indirectly.

character encoding system consists of a code that pairs each character from a given repertoire with something else—such as a bit pattern. In our work, the most useful things to know about are ASCII and Unicode.

Data compression, source coding, or bit-rate reduction involves encoding information using fewer bits than the original representation. Compression can be either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by identifying marginally important information and removing it.

Embedded Metadata: Metadata that is embeded inside a given file instead of existing outside the file in some other database or something. Examples include ID3 tags for MP3 audio files and Exif for image files.

Storage Terms

binary file is a computer file which may contain any type of data, encoded in binary form for computer storage and processing purposes; for example, computer document files containing formatted text

bit (a contraction of binary digit) is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states. These may be the two stable states of a flip-flop, two positions of an electrical switch, two distinct voltage or current levels allowed by a circuit, two distinct levels of light intensity, two directions of magnetization orpolarization, the orientation of reversible double stranded DNA, etc.

The byte (play /ˈbt/) is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer[1][2] and for this reason it is the basic addressable element in many computer architectures.

disk image is a single file or storage device containing the complete contents and structure representing a data storage medium or device, such as a hard drivetape drive,floppy diskoptical disc, or USB flash drive. A disk image is usually created by creating a complete sector-by-sector copy of the source medium and thereby perfectly replicating the structure and contents of a storage device.

File Types:

Documents: Of particular importance for us are .txt .doc .pdf .xml and .html

Images: Of particular importance for us are .jpg .tiff  and JP2000

Audio: Of particular importance for us are .mp3 .wav

Digital Video Encoding: this one is tricky, we will talk about .mov .mpg .swf .mp4 and .avi


4 Replies to “Digitization, Digital Preservation, and File Formats”

  1. This might also be a good place to talk about optical character recognition (OCR).

    For most scanned documents, what you’re producing is an image of text — not exactly useful for building a text-searchable database or doing any sort of data driven text analysis. The problem originates from the fact that a desktop scanner doesn’t really know what text is, so viewing the file on a computer is analogous to looking at any photo you’ve taken.

    That’s where OCR steps in.

    OCR scans for recognizable characters in an image of text and (depending on the method) translates or overlays it into selectable, searchable text — like a Word document. The method I’m most familiar with for doing OCR is using Adobe Acrobat. Acrobat 8+ includes a simple button for running OCR with reasonable accuracy depending on the quality of the source image. Acrobat overlays selectable text over the original image and rotates each page so the text is in a straight line. From there, you can control-F your way to success with a fully-searchable text document.

    Here’s the rundown for Acrobat 9 Pro, but the process is nearly identical across versions.

Leave a Reply

Your email address will not be published. Required fields are marked *