Using META Tag-Embedded Indexing for Fielded Searching of the Internet
Philip COOMBS <firstname.lastname@example.org>
Full-text searching on the Internet has run its course. A new approach adding fielded searching is vital to the effectiveness of information discovery and retrieval in the years ahead. This paper presents the results of one year of operation of a statewide government locator service employing indexing embedded in META tags, common attribute schema, and combined full-text and fielded searching applications. It provides evidence that author-indexed information is practical, viable, and powerful when embedded into the source files available on the Internet. This method has drawn interest and acclaim from governments and industry as it demonstrates the critical role META tags will play in the Internet of the next few years.
Public demand for access to government information has existed
throughout the years. However, with the creation of the Internet
and modern search and retrieval tools, the public has turned up
the pressure for agencies to make more information available.
Washington State created a Public Information Access Policy Task
Force in 1995 to determine the action needed to support public
discovery and retrieval of government information. The 1996 legislature
directed the creation of a pilot project to assist public access
of state and local electronic government information. Washington's
Government Information Locator Service (GILS) project began in
late 1996. The project staff considered conventional cataloging
and index capture techniques, but rejected them in favor of embedded
metadata in Web pages and use of
Why did the project team reject traditional "full-text" indexing schemes?
To the majority of citizens around the world, the Internet presents the opportunity to explore an almost unlimited depth of facts, directions, multimedia presentations, and near-facts -- in short, information. That is the blessing of this powerful tool. But to these same people, the Internet will not yield its treasures without a struggle. The discovery of significant, relevant information requires the patience and precision of a surgical operation. Each day more characters, pixels, and sounds are added to the collective cache called the Internet. Each day, the challenge of separating relevant data from irrelevant data becomes more daunting. What was once an amusing game of thinking of homonyms, synonyms, and antonyms foils our attempts to locate appropriate information resources on the Web. Given the open license authors have to express themselves using all available nouns and adjectives, it is truly remarkable if a searcher's chosen term matches a related concept in a document in the Internet. Yet, this is the current strategy for discovering relevant information on the Web.
What tools are presently used by searchers?
The major search engines are categorized as directory, full-text, or abstracting.
"Directory search tools provide subject headings for navigation, usually created by a humans. No text is taken from the page for indexing; rather, the pages are examined and classified into a subject heading hierarchy. Examples include Yahoo and parts of Magellan, Excite and Lycos.
Full-text search tools index every word on every page of the database. Alta Vista and Open Text fall into this category. Full-text tools are not good for general subject searching, which is one of the greatest frustrations for users.
Abstracting search tools take a selected portion of the target site for indexing. These tools use some type of algorithm to select and index frequently used or prominent words on a page. Examples include Excite, Lycos, Magellan, Web Crawler, and Hotbot. They are good for general subject searching to generate clusters of related citations.(1) However derived, the index created by full-text and abstract tools is based upon a machine-aided compilation of searchable terms. They employ a concordance list of terms contained in documents discovered during robot "spidering" of Web sites and pages. Searchers must locate relevant sources by matching search terms to words contained in the concordance files.
How accurate is this process?
Many studies have evaluated resource searching on the Internet, concluding that even under the best conditions, locating the most timely, accurate and relevant resources is difficult. Most searchers will not use all the tools available, such as Boolean syntax or multiple, related terms in their query. "Rarely or inconsistently used keywords, for example, may turn up only a few hits, while search criteria that are too broadly defined can return cumbersome heaps of hits."(2); This leaves the searcher to "refine" a large result list further by choosing search terms and observing outcomes -- not a pretty process. And search precision is not always guaranteed. In one study, typical of discovery and retrieval analysis, researchers found that at least one quarter of searchers did not appear to retrieve useful citations.
Thankfully, other search methodologies and tools exist. One has been in existence for more than a century-- the card catalog system in libraries. A catalog contains various attributes and values to abstractly describe the contents of a document, media products, work of art, or other physical objects. The catalog concept is based on the use of specific fields and a discipline or structure to assign values to them. More precise searching is possible using multiple-indexed fields to pinpoint sources in a method similar to triangulation in navigation. By selecting complementary or contrasting index terms, a searcher can use the power of Venn subsets to include and exclude clusters of subjects and specific words. Searches using a combination of terms such as title, subject, and author quickly narrow the results and reduce irrelevant citations (3).
Could this concept be applied to the Internet?
Creating a catalog of information on the Internet has become a goal for many organizations and individuals on the planet. Several significant issues challenge our achievement of this objective.
The card catalog works well in the environment of printed material or media because the summary or metadata is separate from the object described, thus insuring the index remains intact while the object travels, i.e., is checked out of the library. In the world of electronic objects, however, such remoteness becomes problematic. The contents of a referenced object can change through revision or wholesale replacement. The location of the object can be moved from one Internet address to another, rendering a URL citation obsolete. We thought the remedy was to dynamically link an abstract or index record to its referenced source.
The process of creating an abstract of a source of information has evolved into a profession in libraries, complete with a body of knowledge and internationally accepted standards. Consistency and accuracy in cataloging ensure reliable discovery and retrieval. The primary impediment to cataloging the Internet is the training and experience needed to comply with the standards. But accuracy is a relative term. Few webmasters and authors have training in cataloging principles.
Could the concept of "good enough" be applied to an Internet index?
Fielded searching is most powerful when there are limits to the range of choices of possible values. This is accomplished using a controlled vocabulary where options can be reasonably limited to a commonly accepted set of values. While this increases the probability of a match during searching, it likewise limits the cataloger to the values available. Given the scope of topics potentially encountered in the Internet, any controlled vocabulary covering subjects or themes must be fairly comprehensive. Existing authority sets such as the Library of Congress classification system or the Dewey Decimal system presents a daunting array of choices, far too many for the untrained cataloger. Indeed, for Internet use and general public searching, a simpler vocabulary is needed.
A related issue is the challenge of segmenting a seemingly continuous spectrum of topics into a discrete list that can answer the searcher's question. For example, a city is uniquely different from a county, except when they share certain services such as police or street maintenance. Therefore, a Web site covering local government street repairs might be cataloged into both jurisdictions. This distinction challenges a cataloger of Internet resources.
If an Internet resource is to be searched and discovered using pattern matching on descriptive words, the choice of the cataloged word or words is critical. But the choice is not limited to words used in the text of the resource. Many more good adjectives that often aren't found in the text of the resource can be used to properly describe it. This distinguishes the abstract process from the full-text indexing process. The most effective index uses specific but commonly used words. Use of cataloger-indexed values rather than a concordance list of words found in a resource allows additional choices, but also, perhaps also an element of subjectivity.
Could non-trained catalogers do an adequate job of choosing keyword and subject terms?
Initially, the GILS Project Team considered using professional catalogers to create the catalog of Washington state and local government Internet Web sites and pages. This quickly became a tedious process with no real chance of valid scope and depth coverage. We needed something to accelerate the pace of cataloging. Additionally, the catalogers needed time to become familiar with the content and purpose of each resource. The product of their effort was mostly not used. The most accurate assignment of index values comes from the author. However, providing summary information prior to publishing was not an established discipline.
Would the originators participate substantially in cataloging their information to satisfy index requirements?
Once a dynamic resource like a Web site is cataloged, the work is not done. Each revision to the resource may justify an update to the related index or abstract record. It is often an effort to get an Internet resource initially cataloged, let alone, to update the index on a regular or as-needed basis.
Was there a way to maintain accuracy with minimum effort?
Once a resource has been cataloged, the next issue is how to collect the summary records into a searchable database. Traditional manual capture mechanisms employ forms or CGI scripts to direct the cataloger in the reporting process. The information is then transmitted (or worse, rekeyed) into a file server to manage the sorting and retrieval. This is the present process of U.S. federal government agencies complying with the federal GILS program, and it has spawned multiple databases over scores of Internet accessible servers. Specially trained GILS coordinators within federal agencies promote the inclusion of "significant" government information and assist program staff in completing the GILS index record. While this approach has built a few thousand, quality records, "[federal] GILS implementation has not achieved the vision of a virtual card catalogue of government information nor have the majority of agency GILS implementations matured into a coherent and usable government information locator service" (4). The project team wanted to avoid a similar outcome in Washington state.
Initially, the Washington State GILS Project followed along this federal methodology. However, within six months, it was apparent that any significant volume of resources would not be collected using this approach. Other competitive information sources, such as the major Internet search engines, using full-text or abstract index spidering, could easily create more records than the much smaller GILS databases. To acquire sufficient records for depth of coverage, a blend of full-text (concordance) file spidering and specific attribute-value spidering was needed.
During the development of HTML, the standard made a provision
for attribute-value pairs in the <HEAD> portion of the page.
Initially, it was used primarily for browser control and limited
content description. While the variety of uses continued to grow
and evolve (5), the
The project staff asked, "Could this feature work for wholesale cataloging of state and local government Web pages?
It was not enough to extract metadata about Internet-based resources. Search engines lacked the ability to discover and index non-Internet resources such as printed publications and personal contacts. Further, many Web pages did not provide adequate contact information for the reader to pursue. The more robust attribute set used by the GILS program supported entry of additional useful data and contact information to aid searchers. We concluded that indexing both Internet and non-Internet resources was needed to assure that citizens could both find and retrieve information.
Many server applications have been created to index Web sites
and present a database on the Internet for browser access. Only
a few applications, however, are available that support harvesting
Did such a tool exist?
Were it not for the discovery and application of
The Washington State GILS Project staff initially populated the
test database with "stand-alone"
By July 1997, a few state and local government agencies had volunteered
Since it would be impractical to ask agencies to "
The final challenge was to provide general full-text searching (popularly referred to as "keyword" searching) and more specific attribute searching of the GILS database. With modifications to the Netscape product's searching algorithm, the final process presented both options to the searcher (8).
To address non-Internet resources, the "stand-alone"
GILS record was promoted. Using a Microsoft Word 6.0 template
/ macro file on a floppy disk, agency staff created HTML pages
that contained both the hidden
After a year in operation, the project evaluated the success of
The success of improved precision in searching Internet pages
Agency participation in cataloging their resources using
At no time during the project did the staff receive complaints
from government agency staff about the effort required to apply
Author- or webmaster-generated
Webmaster or author assignment of values to fields was judged "good enough" to support fielded searching.
For uncontrolled fields, the government agency webmaster or author assigned many terms, even some not otherwise included in the text. Particularly important was the choice of "also known as" values, such as in the case of the popular name for a statutory program or law. This factor has great potential for enhancing search and retrieval success for the average person. By contrast, searching full-text or abstract indexing search engines using popular terms not found in the concordance file would fail to locate resources unless the engine employed an extensive, "street-wise" external thesaurus.
For controlled fields, such as subject and government type, the government agency catalogers had to choose between imprecise terms. However, they appeared to choose values that closely matched those that citizens would use in searching. In several cases, they assigned the most comprehensive set of values. For example, on one page (11) a Department of Health webmaster assigned nine different subjects from the most detailed level of the GILS-controlled vocabulary (12) to ensure a good cross-reference and discovery process.
Discovery of relevant pages on the Internet is significantly improved using schema-controlled attribute-value indexes. When used by a searcher, the power of "triangulation" between complementary terms in fields can pinpoint the needed information far better than "keyword" (concordance list) searching.
During tests of
Establishing a self-maintained environment for author or webmaster submitted index values requires an organizational commitment, structure, and human effort. The project team addressed the following issues during program development:
The most challenging task of the project was to get the word out
Agency Guide to GILS pamphlets were distributed to explain
the GILS mission and the role agencies could play in creating
Flyers were distributed to citizens' organizations explaining how the public could benefit and how they could access the GILS site.
References to the GILS search engine were published in the 40,000 copies of Citizens Guide to Locating Government Information and distributed to government counters, libraries, and newspaper offices statewide.
The enabling legislation directed the State Library Commission
to establish content-related standards for common formats and
agency indexes for state agency produced information (13).
This was an important step, one that established the validity
It was essential to build a network of organizations willing to
promote and support the increased benefit of
The GILS Project staff assisted several agencies and jurisdictions
with organizing and indexing their Web content. Patrons visiting
the GILS search site were presented with an option to send e-mail
to the library for assistance if they could not find the information
they needed. This service is a rare but increasingly important
aspect of large search services. Incentives for
The value of using HTML
Recognition is a prime motivator in gaining acceptance of new
concepts. Several agencies demonstrated support and commitment
Other inducements for volunteer organization participation during development included:
Additional "after-the-fact" analysis applications are under development to check the quality of metadata during spider and import processes.
Library staff members, on an "as needed" basis, do some quality assurance over submitted content.
A smaller Focus Group was formed from the larger stakeholder organization to provide direct guidance to the development effort.
"Hit" statistics are available to agencies to analyze the popularity of their Web pages. Also, a "mailto:" function is provided on every result list page. Many comments are submitted by searchers and are forwarded to the appropriate agencies for action.
The GILS Project was accomplished with minimal financial expense. About $200,000 U.S. per year was spent on all objects. The project team employed one full time project manager, two half-time senior librarians, and one part-time technician providing webmaster, programmer and systems administrator support. This team ran the project for the 24 months of development (July 1996 until June 1998). The project became a permanent program of the state in July 1998.
Additional volunteer assistance came from state and local government agency staff, private sector professionals, and citizens of the state.
The GILS project is using (as of February 1998) Netscape's Compass Server software (version 3.0) (17).
This is running on a Pentium Pro-based server with 128 MB of RAM and two hot-swappable 4 gb drives (mirrored). The operating system is Microsoft NT, version 4.0. The server is connected to the Internet through an Ethernet connection to the Internet Service Provider (18).
There will always be naysayers and doomsday prophets for all new ideas. Commitment to explore, err, and triumph was needed for the project to succeed. This included legislative willpower, political influence, and risk-taking. Substantial effort was expended to advise, influence, promote, encourage, and support government leaders involved with such a large endeavor.
As the project phase drew to a close, promotional effort increased. With mechanical and procedural issues fairly resolved, alerting the public to this search tool has become the primary mission. An aggressive promotional campaign is underway. The GILS Web site and service must be advertised just like any other Internet service competing for customers. While it remains a free tool, state funds are expended and must be justified. Success will continue to be measured by popular interest and use of the tool.
On the drawing boards are interactive applications to check
GILS uses software originally designed for intranet applications. While there are similarities, searching the Internet presents more challenges. Volumes of resource records in the millions are expected. Spidering software must be very efficient and quick. Graphical interfaces must be flexible as requirements change frequently (e.g., non-English GUIs).
Finally, the structure and rules governing the Internet are constantly
changing. The work that was begun on our site using HTML as a
vehicle for catalog information will yield to the new architecture
of XML and RDF (19). Conversion to the next
standard is expected, but Washington state believes the transition
will be easier because it has invested in
1 Nicholson, Scott, "Indexing and Abstracting on the World Wide Web: An Examination of Six Web Databases," Information Technology and Libraries, (June 1997):73-81
2 Molly Joss and Stanley Wszola, "Search Engines that Can," CD-ROM Professional, (June 1996):31-48
3 Ferl, Terry Ellen and Millsap, Larry "The Knuckle-Cracker's Dilemma," Information Technology and Libraries, (June 1996): 81-91
4 Moen, William and McClure, Charles, An Evaluation of the Federal Government's Implementation of the Government Information Locator Service (GILS): Final Report. U.S. Government Printing Office (Stock Number 022-003-01190-1)(1997), also http://www.unt.edu/slis/research/gilseval/gilsdocs.htm
9 Su, Louise. "Value of search results as a whole as a measure of information retrieval performance." ASIS '96 Proceedings of the 59th ASIS Annual Meeting. 33, (1996):226-237.
(View HTML source code to see