[Help] Last update at http://inet.nttam.com : Mon Aug 7 21:40:15 1995

Abstract -- Experiences with On-line access to Chemical Journals Application Technology Track
A1: Information Space Environments

[Previous] [Table [Next]
[Paper [Paper


Experiences with On-line access to Chemical Journals

Kirstein, Peter ( P.Kirstein@cs.ucl.ac.uk)
Montasser-Kohsari, Goli ( G.MontasserKohsari@cs.ucl.ac.uk)

Abstract

Several US organisations have collaborated in the CORE project to deliver electronic information from primary publications to end-user chemists. For this, Bellcore have scanned 500,000 pages of post-1987 ACS journals, and have re-processed the typesetting tapes from these journals into the SGML format for indexing and searching. They have set-up an electronic database containing approximately 100,000 articles, representing 500,000 pages of journal articles, at the Cornell U Mann Library for access over Local Area Networks by Cornell chemists.

University College London (UCL-CS) has been involved with the CORE project since 1988, relying heavily on the work of Bellcore, and using the ACS data. Their CODA Project work, supported by the British Library Research and Development Department, provides facilities similar to the CORE project, but has concentrated on additional areas covering the use of ODA as a distribution medium and the usage of relatively low-bandwidth networks such as the ISDN. This paper discusses the way the database is set up - which involves conversion from a SGML representation into an ODA one, the methods of indexing, the access methods provided, and our user experience. We discuss also the motivation of many of our implementation choices. Because of ACS constraints, the CODA data cannot be made available outside the University of London.

When the CODA project started in 1991, UCL-CS was involved with ESPRIT PODA projects in the use of ODA, while the CORE project was using no standard language for the representation of the text; hence use of ODA was a natural choice for CODA project. Later the ACS textual material became available in SGML form; even so, there are significant advantages in the use of ODA, which are discussed in the paper. For example, ODA is a blind open interchange format for which a number of converters are available - unlike SGML, in which the interchange is dependent on the DTD. Our decision to use the ODA formulation required a SGML -> ODA converter.

Having an on-line database of scientific journals offers many advantages over the conventional paper-based journals. Electronic searching texts for information is much easier than manual; far more productive searching can be undertaken using a computer system. In our environment all the journals are indexed so that, despite the size of the database, searches are very fast. Electronic access provides additional advantages:

It is non-exclusive - any number of people can access the same journal simultaneously.
It is distributed; users can be remote from the database.
It can be integrated with the users' facilities, so that it is possible to extract information for other purposes - always subject, of course, to copyright and other constraints.
Our document database can be queried in a convenient manner, allowing users throughout the University to browse search results on-screen using a number of different tools. A portion of the data was provided originally in the same form as in the CORE project; now, the database is supplemented by transforming the whole data into the ODA/ODIF format, and making it available to the University chemists in that form. We are using a large set of the 1987-1994 ACS journals, providing a number of interfaces to access that data, including WAIS (a system based on the ANSI Z39.50 IS&R protocol), and two Bellcore Bellcore tools; one is for showing bit-map page images (Xpixlook) and the other a Hypertext Browser (SuperBooks).

At the start of the CODA project, the most sensible device for storing such a large database was an Optical Juke Box (JB); hence a 90 GB HP magneto-optical JB was acquired - to which a high speed storage server, with some 18 GB of disk space is attached as front end. A reverse index of all the document text is held in the disc storage. For the whole of ten years of data this contains about 4 GB. All document text searching is from the disc storage; the retrieval of the documents themselves is from the JB which holds the documents in all forms. To assuage the worries of publisher we have added various forms of integrity control, authentication and audit trails.