OCLC took the lead to develop a suite of Web archiving tools to identify, select, describe and harvest Web-based content. Named the Web Archives Workbench , the suite bridges the gap between manual selection and automated capture of Web-based content by transforming collection policies into software-based rules and configurations. Based on the Arizona selection model, the tools will help information professionals implement Web collection policies, add metadata to harvested objects, and package harvested objects for ingest into a digital repository. Descriptions of the four tools in the suite follow below. [Note that indented information in each section is excerpted from the article ECHO DEPository Project by Judy Cobb, Richard Pearce-Moses, Taylor Surface. IS&T's 2005 Archiving Conference. Washington, DC; April 26, 2005; p. 175-178.]
The first step in building a Web collection is to identify the Web sites that have content your organization wants to collect. The Discovery Tool will provide machine-assisted identification of Web domains that need to be analyzed for potential collecting work.
The tool is based on the assumption that the vast majority of websites will be referenced on at least one other related website. Thus, by analyzing the links on all pages, it is possible to discover related domains. Starting with a seed list of URLs a spider builds a list of all links on those pages, and analyzes the links to create a list of distinct domains. In Arizona, the initial scan of four large websites captured some 10,000 links, but fewer than 700 domains.
The list of domains is then manually evaluated. Some domains will hold content that is within the scope of the organization’s collecting policy; other domains will be out of scope. The user will mark each domain as “in scope” or “out of scope” and, based on those indications, the Discovery tool will continue to monitor the list, looking for new domains that need to be evaluated and domains that no longer exist.
Building the list of domains is merely a means to an end. The ultimate goal is a list of content providers and their websites. Each domain is associated with a content provider, and the content providers are organized into a taxonomy that documents the relationships between content providers and links content providers to their websites. Further descriptive information about the content provider can be added and will be inherited by captured Web content using the Analysis Tool.
Once a state website has been identified, the second step is to determine which documents on that website should be acquired for the collecting program. Using an archival approach, selection is done at the series level, rather than considering each document individually.
An archival series is “a group of similar records that are arranged according to a filing system and that are related as the result of being created, received, or used in the same activity.”1 The Arizona Model’s presumption that series exist on websites is founded on the common human behavior of organizing related materials into groups to help manage them. Because this is a general behavior rather than a requirement, different Web masters will organize their sites differently and with varying degrees of consistency. Those idiosyncrasies mean that a series-level approach to selection will have varying degrees of success.
In order to be able to appraise and select at the series level, the Analysis tool will provide a site analysis to help users visualize and understand the directory structure of a website. When the user is able to understand a website’s structure, it is possible to make decisions about selection. The user will be able to identify existing series, name each series, describe them, and then indicate specific criteria for how often a spider should search for new and changed content within each series. When the spider identifies new and changed content, the user may choose to be notified, or have the tool automatically capture the content into their collection. Each series is associated with its content provider’s properties, and those properties are inherited by the series, and the individual documents within the series.
Once series have been identified and content within each series is captured and a copy of the content is acquired by the system, the packager tool creates an information package that contains all the files necessary to reconstruct any document within the series. Descriptive metadata taken from the document’s parent collection and series is created and administrative and preservation metadata is added to the package.
The Packager tool will likely use the Metadata Encoding and Transmission Standard (METS) as the basis of the package structure. The information packages created will be usable by different types of digital collections software, such as Greenstone, Fedora, DSpace, and the OCLC Digital Archive.