Jean Godby <firstname.lastname@example.org>
Eric Miller <email@example.com>
Office of Research
OCLC Online Computer Library Center, Inc.
Scholars and professional information providers are beginning to address the problem of creating descriptions and catalogs that identify electronic resources accessible on the Internet. In many cases, these resources consist of important primary data, theses and dissertations, computer software, prepublication drafts of papers, or electronic versions of rare or difficult-to-access books and manuscripts--in other words, the raw material of scholarly activity. Though they may be included in the well-known Web indexes such as Lycos or Yahoo, they are inadequately described by the automated methods used by these systems to identify Internet resources. As a result, they are lost in noise, effectively inaccessible to all but the most patient searcher. Until automatically generated indexes can provide a clear idea of the logical structure, quality and subject of the work, they will not replace indexes and catalogs that have been created by human experts. There is a growing need for hand-crafted descriptions of resources that are judged to be so important by their creators or sponsors that their discovery can't be left to chance.
In response to this need, information providers are managing collections of electronic resources with the same care given to traditional library materials. The management of electronic texts and related objects engenders unique problems of selection and description; finding solutions to these problems is an active subject of scholarship in library and information science. For example, six projects supported by the National Science Foundation are currently underway that strive to create and describe collections in the digital library of the future . The American Library Association sponsors discussion groups for those interested in creating and maintaining collections of electronic text. A project underway at OCLC that is sponsored by a grant from the U.S. Department of Education enlists the participation of librarians in the selection and description of Internet resources .
In this paper, we address one of the problems associated with the management of networked electronic resources. When information providers create descriptions of an electronic resource, they have two choices. They can use the standard formats for the creation of descriptive records--or metadata, to use the currently popular terminology--that are optimized for the precise characterization of the unique data under their control. For example, if they are describing texts that might be of interest to linguists, they might use the Text Encoding Initiative (TEI; McQueen and Burnard, 1994)--which, among other things, contains guidelines for transcribing conversations and for recording the social context of an utterance. If they are describing maps, they would use the Federal Geographic Data Committee (FGDC 1994) standard, which has guidelines for coordinating geospatial data with maps. If they are describing a resource that is traditionally handled by libraries, they might use the MARC (MARC 1994) standard. An alternative is to register the resource with an Internet directory service such as Yahoo , Excite , or Alta Vista .
These alternatives have tradeoffs. The application of metadata standards that have been developed to serve the needs of particular scholarly communities can produce the best descriptions, but software that manipulates these formats is expensive to develop and not freely available. Moreover, users would be required to have extensive knowledge of these formats if they have a multidisciplinary research topic. For example, a user who is interested in the geographic distribution of a dialect feature might plausibly be expected to search databases encoded in the TEI and FGDC formats, both of which have many pages of documentation and are still evolving. On the other hand, Internet directory software is accessible and easy to use, but it can't be used to create descriptions that are detailed enough to support scholarly inquiry.
For example, consider the resource in the Appendix I. It is a description, suitable for encoding in FGDC, for a Web site describing a research project sponsored by United States Geological Survey that analyzes the distribution and characteristics of polar sea ice in a specified latitude/longitude range. The Web site has pointers to technical reports as well as raw geospatial data. Some of the technical reports that are available in machine-readable form on the Internet are based on conventionally published papers. What happens when someone tries to register this resource with Yahoo? To create a description that has a chance of being found again by the target audience, it would be necessary to supply the publisher, date, geographic coverage, and relationships among the raw data and the technical reports. Because all this information must go in Yahoo's unstructured Comments field, there is no guarantee that the description will be in a standardized form that would enable an automatic process to group similar records.
Once created, the description must be placed in Yahoo's subject hierarchy. Although Yahoo has several categories for geography, the classification scheme is alien to a professional geographer, so it isn't clear where the new record belongs. To find a suitable location, it is necessary to scroll down to the end of hierarchies like science/geography/maps/institutes or regional/regions/arctic to find similar objects, essentially doing an exhaustive search through a subtree that is constantly growing. The user has to repeat much of this process to find the resource. As long as the number of resources in a given subtree is relatively small, the tasks of registering and finding Internet resources may be manageable, but it is an open question as to how well this scheme will scale up.
An improved tool would enable the user to create and access structured records that are now available only in the standard metadata formats such as FGDC, MARC or TEI through a program that is as easy to use as the current generation of Internet directory services. To do this, it is necessary to define a bridge across these standards, acknowledging that it is neither practical nor desirable to unify them under a single model. This involves extracting semantic overlap, especially in fields that identify, classify and point to the location of a resource. Once extracted, a core record would be created that could be mapped or linked to FGDC, MARC or TEI records.
The core record can be used to achieve a limited degree of interoperability among metadata standards that is analogous to the interoperability among computer systems that communicate through standard protocols. Because the most important feature is the semantic mapping among the common fields, we refer to this interchange as "semantic interoperability." For example, semantic mapping would resolve the TEI <authorstatement>, the FGDC author attribute-value pair, and the MARC 245 field to a single author field, despite differences in syntax and internal complexity. The mapping is semantic because it can't be achieved simply through structural manipulation; the metadata standards are different enough that intellectual analysis by human experts is required (Guenther 1995). If the results of this analysis could be recorded in a language that is independent of a particular implementation, we would be one step closer to the definition of a protocol for the the interoperability of metadata standards.
A computer system that is designed around the alternative view of metadata sketched above might be in a better position to serve the scholarly community than the current Internet registry services, with only a slight increase in user-apparent complexity. To register the Web page for the research project on polar sea ice described in Appendix I, the geographer would fill out a form that is based on the core record and that asks for author, title, publisher, date, subject, electronic location, etc. Optionally, the geographer would indicate, in the semantic interoperability mapping language, that this record has an FGDC flavor that interprets the subject as an FGDC subject and requires FDGC's field for recording the geospatial data's latitude and longitude. The computer system would use this information to update two cross-linked databases: one of simple core records and one of FGDC records. The user would search for the record through an easy-to-use interface to the database of core records, with the option of linking to the FGDC database for additional detail. With a system like this, it would be possible to enlist the power of established metadata standards to create more precise descriptions of Internet resources than those currently available, while hiding the complexity of these standards from the casual user.
An outline of this method for achieving semantic interoperability among metadata standards was proposed at a meeting of the NSF/NASA/ARPA Digital Library Initiative projects held in November in Santa Barbara, California, in November 1995. In the rest of this paper, we attempt to make that proposal concrete and demonstrate its use in a prototype system.
We use the Dublin Core Element Set ("Dublin Core") to implement semantic interoperability. The Dublin Core is a set of 13 metadata elements that originated from discussions at OCLC/NCSA Metadata Workshop (Weibel et al. 1995a) and that are intended to facilitate the discovery and retrieval of electronic texts and similar objects. The elements are listed below:
It was argued at the metadata workshop that the meanings of these elements could be understood by users with no training in formal cataloging and could be used to create descriptions of Internet resources that are more detailed than automatically generated indexes. Because the Dublin Core elements are represented in some fashion in most of the metadata standards, they can be used as a common language to access those formats. In our work, the Dublin Core is used to create the core record discussed in the previous section.
However, the Dublin Core proposal is lacking in two important details. First, the Dublin Core elements are sufficient to create a simple descriptive record, but this record is of limited usefulness unless it can be formally linked or extended to a more detailed record. In terms of the discussion in the previous section, this lack of extensibility means that the Dublin Core standard provides no specification for getting from the simple, unified interface that the user initially encounters back to the heterogeneous databases that ultimately satisfy the information need. Second, the definition of the Dublin Core element set says nothing about how the fields in Dublin Core records should be interpreted. The definitions are general enough to accept data from casual users as well as professional information providers trained in formal cataloging. Without further definition, Dublin Core records can be ambiguously interpreted as MARC, TEI or any idiosyncratic encoding that a user may devise. To disambiguate the records, it is necessary to link the Dublin Core elements to an external scheme that specifies how to interpret the encoding of the field. With an implementation of schemes, it would be possible to determine by an automatic process that, for example, the data in a Subject field are from controlled vocabulary such as the Library of Congress Subject Headings.
Our implementation is an extension and generalization of the Spectrum system. As has been described in detail elsewhere (Vizine-Goetz et al. 1995), the Spectrum system presents an interface to users who wish to register and describe a resource for inclusion in a Web-accessible database of Internet resources. Figure 1 shows that the Spectrum system has three major components: a record-creation subsystem, a record-conversion subsystem and a record-retrieval subsystem. Interacting with with a series of HTML forms, the user creates a simple but useful description that is based on the Dublin Core. The record-conversion subsystem converts this record to the MARC format and creates a database of records in a compatible format. At the user's request, the Spectrum system can also convert the input record to the TEI format. The record-retrieval subsystem presents an HTML interface to a database that is accessible from the Internet via WebZ (Weibel et al. 1995b), an HTTPD server that maintains a database session and bridges the gap between the HTTP protocol and the Z39.50 information retrieval protocol.
Figure 1. Spectrum's System Architecture
Spectrum has the two design features that make it suitable for our current work. First, all components except the user interface are written using industry-wide standards and software that is available for license or purchase, primarily OCLC's SiteSearch . An immediately apparent result of Spectrum's design is that the database of Internet resources created by the interaction with the user interface contains sophisticated structured records that support the formulation of highly specific queries. For example, a user can request all electronic texts about Shakespeare written in English, French or German but not Portuguese accessible by FTP with dates no later than 1994. This degree of specificity is beyond the scope of popular Internet searching tools such as Yahoo, Excite and others (Courtois et al. 1995). Second, the Spectrum system uses OCLC's Document Grammar Builder software  to map from the Spectrum input record to the TEI and MARC formats. Because the intellectual mappings among the record types are recorded in the Document Grammar Builder's fourth-generation scripting language, the interoperability that already exists between the Spectrum input record and TEI and MARC records is easily extensible to other metadata standards. The mappings can also be changed as the standards evolve, without changing Spectrum's source code.
However, the current design of Spectrum also has some limitations. The most important problem is that the user is limited to a single input record, shown in Appendix II. The design of the Spectrum data-entry form was guided by the desire to gather information that could be mapped to a minimal record in several metadata standards, using the Dublin Core as a starting point. Though the Spectrum data-entry form may be valuable to some user communities, the geographer who wishes to register the Web page describing the research on polar sea ice discussed above will have problems similar to those encountered in the attempt to register the resource on Yahoo. There is no way to indicate unambiguously in the Spectrum record that the subject is from a classification scheme used by professional geographers. And the Spectrum record gives no place to record the geographic coordinates that precisely identify the distribution of the data. As a result of the fixed input record, Spectrum's ability to achieve semantic interoperability among metadata standards is underutilized.
In our revision of the Spectrum system, we use Spectrum's back-end processes that map among metadata standards and build a Web-accessible database of Internet resources, but we have a more abstract view of the user interface. Figure 2 shows the revised design.
Figure 2. System design for a generalized Spectrum
Instead of the fixed record-entry form in Appendix II, the user interface is built with a simple but extensible metalanguage coded in SGML that generates HTML markup appropriate for entering descriptive records and submitting these records to Spectrum's back-end processes. We have dubbed this language the "Spectrum Cataloging Markup Language," or SCML. SCML is flexible enough to allow user communities to build customized HTML interfaces. Though SCML currently generates only HTML, it could, in principle, be extended to generate scripts in Java and Visual Basic.
Trivially, SCML can be used to generate Spectrum's original record-entry form. For example, the Author field can be created with the code fragment in Figure 3a. After this fragment is processed with a CGI (Common Gateway Interface) script, the result is the HTML form in Figure 3b. If we create a record entering our names in the HTML text field, Spectrum's CGI scripts produce the SGML record in Figure 3c.
Figure 3. SCML, HTML, and SGML code for Spectrum's Author field
<pre> <element id = "author"> <definition type = "text"></definition> <element>
<b>Author:</b><input name=NULL size=40></input>
<author>Jean Godby and Eric Miller</author>
The HTML fragment is created by using two tags in SCML: the <element> tag and the <definition> tag. The <element> tag specifies the Dublin Core metadata element. The <definition> tag has several attributes that enable the user interface designer to create a customized form for the element. In this case, we used the value of the "type" attribute to generate an HTML text field.
Other values of this attribute generate the rest of the HTML 1.0 data input formats. In Figure 4, the SCML defines a list box that lets a user choose from a controlled list of Internet Media Types.
Figure 4. SCML for an HTML list box
<definition type = "Select" scheme = "IMT"> <data> <item><val>text/html</val></item> <item><val>text/plain</val></item> <item><val>text/richtext</val></item> </data> </definition>
Other attributes on the <definition> tag make it possible to create customized definitions of the Dublin Core elements and specify a scheme indicating that the default Dublin Core field has additional encoding. For example, the TEI statement for <publisher> has internal structure that would be inadequately represented in a free-text field. Figure 5 shows the SGML definition of a simple TEI author field, an HTML form that could be used to elicit this information, and the SCML fragment that generates it. As in Figure 3, this SCML script creates text-entry forms. The value of the "scheme" attribute identifies this as a TEI record.
Figure 5. The TEI <authorstatement> with corresponding SCML and HTML fragments
<pubstmt> <resp> <role> <name> >/resp> >pubstmt>
<b>Publisher:</b><input name=NULL size=40></input> <b>Responsibility</b><input name=NULL size=40></input> <b>Role:</b><input name=NULL size=40></input> <b>lName:;</b>>nput name=NULL size=40></input>
<element id = "publisher"> <definition name = "responsibility" scheme = "TEI"> <definition name = "role" type = "text"></definition> <definition name = "name" type = "text"></definition> </definition;> </element>
After the user has entered a record with the HTML form in Figure 5b, Spectrum's CGI scripts convert the data to SGML like that in Figure 5a, except that the TEI tag <pubstmt> is renamed to the Dublin Core <publisher>.
The Spectrum back-end processes do two things with this record. First, because this is a Dublin Core record, it can be added to a database of Dublin Core records. In this database, the additional TEI tagging is ignored. Second, the fields relevant to a TEI record are extracted and the resulting record is included in a database of TEI records that are cross-linked to the corresponding records in the Dublin Core database. Document Grammar Builder translation scripts do the conversions.
It is also possible to specify extensions to the Dublin Core element set because SCML can recognize elements in the metadata standards that are not part of the Dublin Core. For example, the TEI field <editDecl>, which maintains a revision history of the object being cataloged, could be coded in SCML and processed in a way that is analogous to the example in Figure 5. A Dublin Core record is created, but the <editDecl> field is omitted because it is not a Dublin Core element. However, a record containing this field is added to the TEI database, and the two records are linked.
SCML is a simple but powerful way to implement schemes and extensions for the Dublin Core element set. For example, the geographer who wishes to describe the Web page on polar sea ice in Appendix I can enter a description in the SCML-generated template that contains Dublin Core elements, plus the fields from the FGDC record required for a precise description of the resource. See  for a sample. With the generalized version of Spectrum, we achieve some measure of interoperability among metadata standards while hiding their complexity from casual users. And because the records in the Spectrum system can support descriptions that are deemed adequate by the scholarly communities that generate them, users can expect search results of higher quality than those obtained from the current generation of Internet search services.
Federal Geographic Data Committee, 1994. Content Standards for Digital Geospatial Metadata. Washington, D.C.: Federal Geographic Data Committee.
Martin P. Courtois, William M. Baer, and Marcella Stark, 1995. "Cool Tools for Searching the Web," Online (November/December 1995): 15-32.
Rebecca Guenther, 1995. "Mapping the Dublin Core Metadata Elements to USMARC." Library of Congress. MARBI discussion paper No. 86.
Network Development and MARC Standards Office, ed., 1994. USMARC Format for Bibliographic Data. Washington, D.C: Cataloging Distribution Service, Library of Congress.
C.M. Sperberg-McQueen and Leu Burnard, eds., 1994. Guidelines for Electronic Text Encoding and Interchange. Chicago and Oxford: Text Encoding Initiative.
Diane Vizine-Goetz, Jean Godby and Mark Bendig, 1995. "Spectrum: A Web-based Tool for Describing Electronic Resources," Computer Networks and ISDN Systems 27: 985-1001.
Stuart Weibel, Jean Godby, Eric Miller, and Ron Daniel, 1995a. "The OCLC/NCSA Metadata Workshop Report," http://www.oclc.org:5047/oclc/research/conferences/metadata/dublin_core_report.html
Stuart Weibel, Eric Miller, Jean Godby, and Ralph Levan, 1995b. "An Architecture for Scholarly Publishing on the World Wide Web," Computer Networks and ISDN Systems 28: 239-245.
Keyword: Sea Surface Temperatures
Title: Modern Average Global Sea-Surface Temperature
Author: Schweitzer, Peter N
Author's email address: firstname.lastname@example.org
Publisher:U.S. Geological Survey
Publication date: 1993
Object type: technical report
Object form (IMT): text/html
Electronic location: http://geochange.er.usgs.gov/pub/magsst/magsst.html
Relationship (child of): http://geochange.er.usgs.gov/pub/info/holdings.html
Relationship (sibling of): file://geochange.er.usgs.gov/pub/sea_ice/README.html
Source (book): NOAA Advanced Very High Resolution Radiometer Multichannel Sea Surface Temperature data set produced by the University of Miami/Rosenstiel School of Marine and Atmospheric Science
Originator: Jet Propulsion Laboratory
Keywords: North Atlantic Ocean, South Atlantic Ocean, Indian Ocean, Pacific Ocean, Mediterranean Sea
Spatial: West: Orientation = W; Deg = 180
East: Orientation = E; Deg = 180
North: Orientation = N; Deg = 72
South: Orientation = S; Deg = 66
Temporal: Begin (YYMMDD): 19811001
End (YYMMDD): 19891231