Tools Development

OCLC took the lead to develop a suite of Web archiving tools to identify, select, describe and harvest Web-based content. Named the Web Archives Workbench , the suite bridges the gap between manual selection and automated capture of Web-based content by transforming collection policies into software-based rules and configurations. Based on the Arizona selection model, the tools will help information professionals implement Web collection policies, add metadata to harvested objects, and package harvested objects for ingest into a digital repository. Descriptions of the four tools in the suite follow below. [Note that indented information in each section is excerpted from the article ECHO DEPository Project by Judy Cobb, Richard Pearce-Moses, Taylor Surface. IS&T's 2005 Archiving Conference. Washington, DC; April 26, 2005; p. 175-178.]

  1. Discovery tool
    Will provide machine-assisted identification of potentially relevant Web domains by crawling 'seed' Web sites and extracting domains of possible interest. Domains are then manually evaluated as in- or out-of-scope based on the collection policies of the collecting organization. The domains list will continue to be monitored and updated automatically for human review.
    Beta testing begun in May 2005.

    [Excerpt from IS&T article]

    The first step in building a Web collection is to identify the Web sites that have content your organization wants to collect. The Discovery Tool will provide machine-assisted identification of Web domains that need to be analyzed for potential collecting work.

    The tool is based on the assumption that the vast majority of websites will be referenced on at least one other related website. Thus, by analyzing the links on all pages, it is possible to discover related domains. Starting with a seed list of URLs a spider builds a list of all links on those pages, and analyzes the links to create a list of distinct domains. In Arizona, the initial scan of four large websites captured some 10,000 links, but fewer than 700 domains.

    The list of domains is then manually evaluated. Some domains will hold content that is within the scope of the organization’s collecting policy; other domains will be out of scope. The user will mark each domain as “in scope” or “out of scope” and, based on those indications, the Discovery tool will continue to monitor the list, looking for new domains that need to be evaluated and domains that no longer exist.

  2. Properties tool
    Will facilitate the entry of metadata about the content providers associated with the in-scope domains identified by the Discovery tool. Metadata will be inherited by captured content from that domain. Also organizes Web sites hierarchically according to relationships between content providers.
    Beta testing begun in May 2005.

    [Excerpt from IS&T article]

    Building the list of domains is merely a means to an end. The ultimate goal is a list of content providers and their websites. Each domain is associated with a content provider, and the content providers are organized into a taxonomy that documents the relationships between content providers and links content providers to their websites. Further descriptive information about the content provider can be added and will be inherited by captured Web content using the Analysis Tool.

  3. Analysis tool
    Will determine which documents should be harvested by providing a site analysis to help users visualize and understand the directory structure of a Web site. Based on this, users will make decisions about selection at the series level, input series metadata, and control how often the site should be checked for new and changed series data.
    Beta testing planned for first quarter 2006.

    [Excerpt from IS&T article]

    Once a state website has been identified, the second step is to determine which documents on that website should be acquired for the collecting program. Using an archival approach, selection is done at the series level, rather than considering each document individually.

    An archival series is “a group of similar records that are arranged according to a filing system and that are related as the result of being created, received, or used in the same activity.”1 The Arizona Model’s presumption that series exist on websites is founded on the common human behavior of organizing related materials into groups to help manage them. Because this is a general behavior rather than a requirement, different Web masters will organize their sites differently and with varying degrees of consistency. Those idiosyncrasies mean that a series-level approach to selection will have varying degrees of success.

    In order to be able to appraise and select at the series level, the Analysis tool will provide a site analysis to help users visualize and understand the directory structure of a website. When the user is able to understand a website’s structure, it is possible to make decisions about selection. The user will be able to identify existing series, name each series, describe them, and then indicate specific criteria for how often a spider should search for new and changed content within each series. When the spider identifies new and changed content, the user may choose to be notified, or have the tool automatically capture the content into their collection. Each series is associated with its content provider’s properties, and those properties are inherited by the series, and the individual documents within the series.

  4. Packaging tool
    Will create an information package containing all files necessary to reconstruct any document within the series. Package will include the Web content itself, descriptive metadata inherited from the document's parent collection, and administrative and preservation metadata. The Metadata Encoding and Transmission Standard (METS) will be used as the basis of the package structure

    [Excerpt from IS&T article]

    Once series have been identified and content within each series is captured and a copy of the content is acquired by the system, the packager tool creates an information package that contains all the files necessary to reconstruct any document within the series. Descriptive metadata taken from the document’s parent collection and series is created and administrative and preservation metadata is added to the package.

    The Packager tool will likely use the Metadata Encoding and Transmission Standard (METS) as the basis of the package structure. The information packages created will be usable by different types of digital collections software, such as Greenstone, Fedora, DSpace, and the OCLC Digital Archive.

Further resources