Home > Phase I (2004-2007) > Phase II (2008-2010) > Extracting Metadata for Preservation (EMP)

Extracting Metadata for Preservation (EMP)


The Extracting Metadata for Preservation (EMP) project developed a metadata creation and extraction tool. It was new to ECHO DEPository Phase II.

With the increasing amount of digital content there is an increasing need to improve the efficiency of metadata creation.  Our approach was to provide machine assistance for metadata creation using linguistic technology.  Building on work at OCLC, at the Illinois Department of Computer Science, and at the University of Maryland, EMP developed stand-alone open-source tools, or web services, for automated metadata extraction.

Specifically, we developed a generalized metadata tool architecture and building a Named Entity Metadata Extraction tool. Development was based on two approaches:
  • to extract names from existing structured marked-up text (metadata extraction)
  • to extract names from free text (metadata creation)

The key deliverables in EMP were:

  • Development of a documented open-source tool for high quality Named Entity metadata extraction and creation
    • This work encompassed mapping to authority files (specifically, WorldCat Identities, created by OCLC) and involved developing external metadata profiles and a machine-learning approach.
    • In addition, use cases/scenarios were explored to drive system development.
  • Development of a general metadata tool architecture extensible to other types of metadata tools
  • An evaluation and analysis of the new tools with existing Named Entity metadata tools