This methodology differs from approaches utilized to date. Manual, item-level selection fails because information professionals cannot keep up with the enormous number of resources on the Web. A fully automated approach to capture all the Web results in substantive materials being buried under a mountain of ephemeral, redundant, or irrelevant information.

Instead, this selection methodology is based on an archival approach to the Web. In this approach materials are managed as they are in paper-based archives: as a hierarchy of aggregates rather than as individual items. This approach reduces to a more practical size the sheer volume problem of preserving Web materials, while maintaining a scalable degree of human involvement. Developed by the Arizona State Library, it is the guiding model for the OCLC tools suite described in the next section.

Introduction—The Arizona Model for Managing Web Content

In many ways, the Web can be a boon for a library or archives responsible for collecting, managing and providing ongoing access to resources. The increased number of documents on the Web means a vastly richer collection of reports and publications, and the Web has made it much easier to locate and capture documents that may have never been received in print. However, Web documents present a number of challenges to traditional ways of curating a print-based collection. The Web is used to distribute ephemeral documents, in addition to official reports and publications, making it difficult to clearly distinguish documents that should be added to a collection. Web documents often lack the formal elements of printed reports and publications; without a cover sheet or title page, finding the information necessary to describe the documents can be a challenge. Where printed documents have a simple and familiar structure – ink on paper sheets with a binding that defines the content’s sequence and boundaries – Web documents are often created using specialized software and may contain links that blur the document’s boundaries.

To realize the potential benefits of the Web, the collecting organization must find ways to identify, select, acquire, describe, and provide access to the enormous amount of digital information that is now online. What we do will remain fundamentally the same, but how we do those things in a digital environment will change significantly. Those changes are reflected in the vernacular as the words publication and document are replaced by information.

To date, institutions building a collection of Web publications have generally followed one of two models. The “bibliocentric” model is based on traditional library processes of selecting documents one by one, identifying appropriate documents for acquisition; electronically downloading the document to a server or printing it to paper; then cataloging, processing, and distributing it like any other paper publication. This approach can capture a low volume of high quality content. However, it cannot be scaled to the massive number of Web publications without a large increase in human resources.

The "technocentric" model focuses on software applications that can capture virtually everything with automatic Web crawls. This approach trades human selection of significant documents for the hope that fulltext indexing and search engines will be able to find documents of lasting value among the clutter of other, ephemeral Web content captured in the process. This approach essentially transfers the work of selection from the curating organization to the patron.

An Archival Approach

The ECHO DEPository project is investigating another approach to curating collections of Web publications. The model is based on an approach developed by the Arizona State Library and will be implemented by tools developed by the Online Computer Library Center, Inc.

This model is based on the observation that a website is similar to an archival collection. Both are collections of documents that have common provenance. Both group related documents together; on the Web, the groups are called directories and subdirectories, while in archival collections they are called series and subseries.

The approach is based on the following archival tenets:

  • Materials are managed as a hierarchy of aggregates. In general, archivists do not manage collections at the item level unless the individual items are of great importance.

  • Respect for provenance requires that documents from one source are not mixed with documents from another source.

  • Respect for original order requires that documents be kept in the order that the creator used to manage the materials.

  • Respect for provenance and original order ensures that documents remain in context, and that the context can yield a richer understanding of the individual documents.
The benefits of an archival approach to curating a collection of Web documents, focusing first on aggregates (collections and series), rather than on individual documents, reduces the size of the problem to a more practical number. Spending just five minutes each to process the 300,000-plus Web documents would take twelve years to complete. Taking an archival approach by spending ten hours analyzing the series (directories) on the 200 collections (websites), the work could be done in a year. To the extent series are stable on a website, the amount of work after the initial analysis will be substantially less in subsequent years.

The Craft of Curating a Collection

Curating a collection of Web documents using archival principles is relatively straightforward. The archivist approaches the documents on a website as an organic whole, then, moving down the hierarchy, looks at each series in the collection as a whole. The archivist stops when further subdivision the hierarchy is no longer useful.

The challenge of curating a collection of Web documents is in understanding the structure of the website. In particular, the archivist may have access to the documents through the website, but may not have direct access to the underlying server or its file system.

Specialized software can facilitate this process of curating a collection of Web documents as an archival collection. However, the tools alone will not guarantee success. First and foremost, the Arizona Model focuses on craft rather than technology. It seeks to articulate a rational way to perform tasks and to use tools in an integrated fashion to produce a reasonable result.

