TIME Magazine Corpus Practical Practicum

Mike Davies’ TIME Magazine Corpus of American English is a search tool of the online archives of TIME Magazine from the 1920’s through the 2000’s.  The tool is free and can be found here.  Once you have played around on the site it will ask you to create a free username so that BYU can keep track of how the site is being used.

On the front page of the website, Davies claims, “You can see how words, phrases and grammatical constructions have increased or decreased in frequency and see how words have changed meaning over time”.  The website certainly meets the challenge of the mission statement, however, it can be a little complicated to navigate the site.  The examples on the first page are good to play around with for beginners.  One of the examples given is –gate, and how the use of it changed in the 1990s (e.g. Monicagate).  Click on –gate and the top box will show words that use –gate.  Scroll down to Monicagate (number 5 on the right), this will pop up the year and magazine articles which you can click for further context.

Another useful feature is the option to compare multiple features in the search.  For example, you can compare two words like ‘husband’ and ‘wife’ and then you can further limit the search by adding the collocate ‘divorce,’ this can be even further restricted by choosing a time range in which to search.  Once you pick an actual article, TIME Magazine Corpus directs you to the TIME Magazine website where you can email the document to yourself, print it, or share it via blog, twitter, facebook, etc.

You have to be familiar with the specific ways to search the site in order to really be able to use it.  There are plenty of ways to find help on the site, take a look at the information that pops up when you click the question marks by the search boxes.

 

Even with this help, the site takes some getting used to and can be rather time consuming to use.  It is certainly easier to use than to try and go through the texts yourself to see how words have changed over time.

As far as complexity, TIME Magazine Corpus is similar to Voyeur.  It is also reminiscent of the Library of Congress’ Chronicling America website, though I find Chronicling America much easier to use.  The example page is great but perhaps some sort of short instructional video to go along with the example would be helpful.  At least a tutorial would be great.

Though the site is limited to TIME Magazine, the amount of material is huge, ‘100 million words,’ and still growing as TIME keeps releasing publishing.  A researcher could use this site to study almost anything, I conducted random searches in gender studies, film media, parts of speech, phrases, etc. and very rarely did the search conclude with less than three examples to pick from.  In fact, the amount of information that normally pops up can be overwhelming.

Please play around on the site and let me know if you think that it is a useful site.  Do you find it a bit difficult to navigate?

The Google Custom Engine: Refining Searching in a Few Steps

Sometimes it is a frustrating experience to search for a topic through the internet, only to have the search engine turn up results that are not related to what you are looking for. This problem is similar to what the Bing commercials looked to address with “search overload” during internet searches.

The Google Custom Search Engine provides its users with a search engine to put on their website; the main feature is that it is customizable to refine its search results based upon parameters set by the user.

This makes it easy to find information because the search engine will only look through the user-set websites and pages, and not through other places that are not topic-related.

Setting up a Google Custom Search Engine is an easy three-part step. The first step has the user setting the parameters of the search engine, listing the websites the search engine will use. The second step is only a setup of how the engine will appear on the website, and the third step provides the code to paste into the user’s website.

There are tons of smaller options that allow the search engine to be customized even further, from choosing sites to emphasize during the search, to making money from Google’s AdSense program.

One problem I could see with the search engine is that its usefulness is only as good as the sites that the user lists for the engine to use; if they do not know enough sites to put on the list, the search results may not be as complete.

One solution is that the search engine allows collaboration with invited users with limited access, letting them add sites and labels to the list as needed. The search engine can also choose instead to search through all pages, but emphasize the list of websites provided by the user.

The Google Custom Search Engine is basic in what it is used for, but can be further customized for advanced use in user interaction and how results are shown. Easy to set up, this search engine is one way for websites to ensure that their users are finding search results that are topic-related.

External Link to Example Search Engine
Smithsonian and DC Museums

Victorian Researcher Finds Google Makes His Life A Lot Easier

If you thought “Googling the Victorians” was about something else, you’ll be disappointed. In this article, Patrick Leary discusses how Google has made his life as a researcher of the Victorian era so much easier.

That’s to be expected with anything in digital history — wouldn’t our lives as historians be so much harder without Google?

But what is so surprising and unique about Leary’s article is how he views Google’s usefulness as something of an accident.

Leary writes about his search for a phrase that appeared in the Sunday Review.. His search for this phrase appeared in a number of other sources as well.

Leary writes: “Such experiences reinforce the conviction that the very randomness with which much online material has been placed there, and the undiscriminating quality of the search procedure itself, gives it an advantage denied to more focused research.”

While Google has helped his work, Leary also writes that it is no silver bullet and that one should always verify the authenticity of a source that is returned in a Google search.

“A great many legitimate scholarly purposes can nevertheless be served by an array of online texts that are, to one degree or another, corrupt,” he writes.

Later in the paper, we hear with excitement the prospects of expanded digitization projects as well as improvements in optical character recognition, or OCR, the technology that enables the searching of 19th century Victorian documents. Leary is also excited about the expanded number of non-profit digitization initiatives, like the Internet Archive.

He then discusses how new generations will take this kind of research for granted.

“What we are seeing is arguably not merely an electronic supplement to traditional library and archival research, but a more fundamental shift in our relationship to the textual universe on which our research depends,” he writes.

In all, this paper is not at all surprising. It could be extrapolated and made applicable to other topics within history, or even other fields. But what makes it important is Leary’s anecdotes about how this has changed his life — and his field.

9/11 Archive

All of us here at AU and in this class were alive on September 11, 2001. Not only that, we all have direct firsthand knowledge of the event, whether we lost loved ones or just remember hearing about it on the news for the first time. However, as the years role by, and generations grow up who never remember it, how will history describe 9/11? What will learning about that event look and feel like? In previous times, people went to libraries to read books or hear recordings of radio broadcasts and televised speeches. What will it be for our children?
If they will be using the September 11 Digital Archive, then they will be in good hands. The organization, “funded by a major grant from the Alfred P. Sloan Foundation and organized by the American Social History Project at the City University of New York Graduate Center and the Center for History and New Media at George Mason University”, organizes and records stories and facts from 9/11 (Abo 9/11 archive).
What’s so great about this database? What I was struck by was how precisely it was organized. When you go to the front page, go to the top and click on browse. You then go to another page which breaks the available information down into categories like stories, documents, etc. Within these categories, there are further subdivisions, like for stories, which break the information down into categories based on where the stories came from. This preciseness makes finding information a process that is easy because of the way everything flows. It’s quick, efficient and to the point, perfectly suited for the internet age.
That being said, the site isn’t perfect. The fact that some of the same pieces of information appear under the link browse as for the link research seems a little redundant. Also, there is information about flyers that were on the streets of NY during 9/11. It seems meant to capture the mood, but the website puts so much information into the attack rather than setting up what a day in NY would have been like back than that it makes little sense…It seems like something that belongs in an exhibit made years after the attack. But overall, the website is a solid way to record and present information on 9/11 that could serve our children well.

Bibliography

“About the September 11 Digital Archive”, American Social History Project (2004): http://911digitalarchive.org/about/index.php

Digitization 101

“The National Initiative for a Networked Cultural Heritage (NINCH) is a US-based coalition of some 100 organizations and institutions from across the cultural sector: museums, libraries, archives, scholarly societies, arts groups, IT support units and others. It was founded in 1996 to ensure strong and informed leadership from the cultural community in the evolution of the digital environment. Our task and goal, as a leadership and advocacy organization, is to build a framework within which these different elements can effectively collaborate to build a networked cultural heritage.”

This guide promotes itself as a long term, collaborative effort among professionals in the business of cultural heritage preservation and the technical support professionals who make it possible to digitize historical materials. This comprehensive survey of and guide to digitization programs can, and probably should, be used as a fundamental reference for any serious effort in digitally preserving cultural history. The six core  ‘Good Practices’ put forth by NINCH are:

1) Optimize interoperability of materials

2) Enable broadest use

3) Address the need for preservation of original materials

4) Indicate strategy for life-cycle management of digital resources

5) Investigate and declare intellectual property rights and ownership

6) Articulate intent and declare methodology.

This comprehensive guide is laden with jargon, technical references and anecdotal evidence about digitization projects for professional historians. When your time comes to manage a digitization project, I encourage you to read this guide in full, but for now let’s stick to the basics.

At the beginning of Chapter V, the author lays out some ubiquitous questions and concerns like, what format(s) is best, how much detail is necessary, and what are the user activities we should be supporting when digitizing? We’re told we should also consider the nature of the original materials, the purpose of digitizing something and the availability of expertise, tech support and funding to succeed with a certain project.

Different original materials will come in different shapes and sizes. Let’s briefly consider some of the issues, variations, tools, etc. that accompany each format of original material.

Text-based manuscript material:

  • Issues: ‘Proprietary Software’- word processing/imaging platforms like Microsoft Word & Adobe whose licensing and longevity are unreliable
  • Solution– “standards-based methods”- new encoding language like ‘Standard Generalize Markup Language’ (SGML) and “Extensible Markup Language” XML, which “avoid the problems of proprietary software, offering data longevity and the flexibility to move from platform to platform freely.”
  • Variation– Page Image vs Full Text
  • Tools– Scanners. Optical Character Recognition Software. Data capture service.
  • Formats– SGML, XML, TEI, ASCII, HTML, EAD, DTD, METS

 

Images/ 2D art:

  • Issues– Delicacy/irregularity of materials. Quality of digital image. Consistent standards
  • Solution– ‘Intermediaries”, Prioritization of researcher’s needs and investment in quality digitization tools
  • Variation– The needs of different mediums to produce the best digital rendering. For example, digitizing an oil painting has a different set of requirements from digitizing a black and white photograph.
  • Tools– High quality scanners or cameras, adequate storage space, specialized software
  • Formats– TIFF, JPEG, PDF

 

Audio/Visual materials:

  • Issues– Extinction of recording equipment, transmission of files, time, storage and money constraints
  • Solutions– Deal with it
  • Variation– Many different recording methods over the history of audio material come with their own machines, vices and challenges.
  • Tools– Analog playback devices, analog-to-digital converter, editing software
  • Formats– Audio: WAVE, MP3, RealAudio      Video: MPEG, QuickTime, RealVideo    Metadata: METS, SMIL

 

The NINCH Guide also discusses issues of Quality Control and Quality Assurance that are basically the promises made by contributors to digitization projects to their researchers and audiences. These teams are responsible for “the procedures and practices that [are] put in place to ensure the consistency, integrity and reliability of the digitization process.” Progress and quality standards in a digitization project should be built-in from the start and vetted regularly.

The primary goal of digitization is to preserve the original materials by taking them out of regular circulation. But, much foresight and specificity is required to make a digitization project worth the time and money. The idea is that digitization should only have to happen once and the file format will remain flexible throughout the evolution of technology.