INET Conferences




Other Conferences

[INET'98] [ Up ][Prev][Next]

Using META Tag-Embedded Indexing for Fielded Searching of the Internet

Philip COOMBS <>
Washington State Library


Full-text searching on the Internet has run its course. A new approach adding fielded searching is vital to the effectiveness of information discovery and retrieval in the years ahead. This paper presents the results of one year of operation of a statewide government locator service employing indexing embedded in META tags, common attribute schema, and combined full-text and fielded searching applications. It provides evidence that author-indexed information is practical, viable, and powerful when embedded into the source files available on the Internet. This method has drawn interest and acclaim from governments and industry as it demonstrates the critical role META tags will play in the Internet of the next few years.


Background to Washington State's cataloging program

Public demand for access to government information has existed throughout the years. However, with the creation of the Internet and modern search and retrieval tools, the public has turned up the pressure for agencies to make more information available. Washington State created a Public Information Access Policy Task Force in 1995 to determine the action needed to support public discovery and retrieval of government information. The 1996 legislature directed the creation of a pilot project to assist public access of state and local electronic government information. Washington's Government Information Locator Service (GILS) project began in late 1996. The project staff considered conventional cataloging and index capture techniques, but rejected them in favor of embedded metadata in Web pages and use of META tag-sensitive harvesting robots.

Why did the project team reject traditional "full-text" indexing schemes?

Evaluation of current Web indexing

Derived indexes

To the majority of citizens around the world, the Internet presents the opportunity to explore an almost unlimited depth of facts, directions, multimedia presentations, and near-facts -- in short, information. That is the blessing of this powerful tool. But to these same people, the Internet will not yield its treasures without a struggle. The discovery of significant, relevant information requires the patience and precision of a surgical operation. Each day more characters, pixels, and sounds are added to the collective cache called the Internet. Each day, the challenge of separating relevant data from irrelevant data becomes more daunting. What was once an amusing game of thinking of homonyms, synonyms, and antonyms foils our attempts to locate appropriate information resources on the Web. Given the open license authors have to express themselves using all available nouns and adjectives, it is truly remarkable if a searcher's chosen term matches a related concept in a document in the Internet. Yet, this is the current strategy for discovering relevant information on the Web.

What tools are presently used by searchers?

The major search engines are categorized as directory, full-text, or abstracting.

"Directory search tools provide subject headings for navigation, usually created by a humans. No text is taken from the page for indexing; rather, the pages are examined and classified into a subject heading hierarchy. Examples include Yahoo and parts of Magellan, Excite and Lycos.

Full-text search tools index every word on every page of the database. Alta Vista and Open Text fall into this category. Full-text tools are not good for general subject searching, which is one of the greatest frustrations for users.

Abstracting search tools take a selected portion of the target site for indexing. These tools use some type of algorithm to select and index frequently used or prominent words on a page. Examples include Excite, Lycos, Magellan, Web Crawler, and Hotbot. They are good for general subject searching to generate clusters of related citations.(1) However derived, the index created by full-text and abstract tools is based upon a machine-aided compilation of searchable terms. They employ a concordance list of terms contained in documents discovered during robot "spidering" of Web sites and pages. Searchers must locate relevant sources by matching search terms to words contained in the concordance files.

How accurate is this process?

Many studies have evaluated resource searching on the Internet, concluding that even under the best conditions, locating the most timely, accurate and relevant resources is difficult. Most searchers will not use all the tools available, such as Boolean syntax or multiple, related terms in their query. "Rarely or inconsistently used keywords, for example, may turn up only a few hits, while search criteria that are too broadly defined can return cumbersome heaps of hits."(2); This leaves the searcher to "refine" a large result list further by choosing search terms and observing outcomes -- not a pretty process. And search precision is not always guaranteed. In one study, typical of discovery and retrieval analysis, researchers found that at least one quarter of searchers did not appear to retrieve useful citations.

Attribute-value indexing

Thankfully, other search methodologies and tools exist. One has been in existence for more than a century-- the card catalog system in libraries. A catalog contains various attributes and values to abstractly describe the contents of a document, media products, work of art, or other physical objects. The catalog concept is based on the use of specific fields and a discipline or structure to assign values to them. More precise searching is possible using multiple-indexed fields to pinpoint sources in a method similar to triangulation in navigation. By selecting complementary or contrasting index terms, a searcher can use the power of Venn subsets to include and exclude clusters of subjects and specific words. Searches using a combination of terms such as title, subject, and author quickly narrow the results and reduce irrelevant citations (3).

Could this concept be applied to the Internet?

Challenges in cataloging the Internet

Creating a catalog of information on the Internet has become a goal for many organizations and individuals on the planet. Several significant issues challenge our achievement of this objective.

Mobility of resources

The card catalog works well in the environment of printed material or media because the summary or metadata is separate from the object described, thus insuring the index remains intact while the object travels, i.e., is checked out of the library. In the world of electronic objects, however, such remoteness becomes problematic. The contents of a referenced object can change through revision or wholesale replacement. The location of the object can be moved from one Internet address to another, rendering a URL citation obsolete. We thought the remedy was to dynamically link an abstract or index record to its referenced source.

Body of knowledge required

The process of creating an abstract of a source of information has evolved into a profession in libraries, complete with a body of knowledge and internationally accepted standards. Consistency and accuracy in cataloging ensure reliable discovery and retrieval. The primary impediment to cataloging the Internet is the training and experience needed to comply with the standards. But accuracy is a relative term. Few webmasters and authors have training in cataloging principles.

Could the concept of "good enough" be applied to an Internet index?

Use of a controlled vocabulary

Fielded searching is most powerful when there are limits to the range of choices of possible values. This is accomplished using a controlled vocabulary where options can be reasonably limited to a commonly accepted set of values. While this increases the probability of a match during searching, it likewise limits the cataloger to the values available. Given the scope of topics potentially encountered in the Internet, any controlled vocabulary covering subjects or themes must be fairly comprehensive. Existing authority sets such as the Library of Congress classification system or the Dewey Decimal system presents a daunting array of choices, far too many for the untrained cataloger. Indeed, for Internet use and general public searching, a simpler vocabulary is needed.

Classifying resources

A related issue is the challenge of segmenting a seemingly continuous spectrum of topics into a discrete list that can answer the searcher's question. For example, a city is uniquely different from a county, except when they share certain services such as police or street maintenance. Therefore, a Web site covering local government street repairs might be cataloged into both jurisdictions. This distinction challenges a cataloger of Internet resources.

Keyword choice

If an Internet resource is to be searched and discovered using pattern matching on descriptive words, the choice of the cataloged word or words is critical. But the choice is not limited to words used in the text of the resource. Many more good adjectives that often aren't found in the text of the resource can be used to properly describe it. This distinguishes the abstract process from the full-text indexing process. The most effective index uses specific but commonly used words. Use of cataloger-indexed values rather than a concordance list of words found in a resource allows additional choices, but also, perhaps also an element of subjectivity.

Could non-trained catalogers do an adequate job of choosing keyword and subject terms?

Contributor participation

Initially, the GILS Project Team considered using professional catalogers to create the catalog of Washington state and local government Internet Web sites and pages. This quickly became a tedious process with no real chance of valid scope and depth coverage. We needed something to accelerate the pace of cataloging. Additionally, the catalogers needed time to become familiar with the content and purpose of each resource. The product of their effort was mostly not used. The most accurate assignment of index values comes from the author. However, providing summary information prior to publishing was not an established discipline.

Would the originators participate substantially in cataloging their information to satisfy index requirements?

Maintaining currency

Once a dynamic resource like a Web site is cataloged, the work is not done. Each revision to the resource may justify an update to the related index or abstract record. It is often an effort to get an Internet resource initially cataloged, let alone, to update the index on a regular or as-needed basis.

Was there a way to maintain accuracy with minimum effort?

Collecting significant numbers of catalog records

Once a resource has been cataloged, the next issue is how to collect the summary records into a searchable database. Traditional manual capture mechanisms employ forms or CGI scripts to direct the cataloger in the reporting process. The information is then transmitted (or worse, rekeyed) into a file server to manage the sorting and retrieval. This is the present process of U.S. federal government agencies complying with the federal GILS program, and it has spawned multiple databases over scores of Internet accessible servers. Specially trained GILS coordinators within federal agencies promote the inclusion of "significant" government information and assist program staff in completing the GILS index record. While this approach has built a few thousand, quality records, "[federal] GILS implementation has not achieved the vision of a virtual card catalogue of government information nor have the majority of agency GILS implementations matured into a coherent and usable government information locator service" (4). The project team wanted to avoid a similar outcome in Washington state.

Initially, the Washington State GILS Project followed along this federal methodology. However, within six months, it was apparent that any significant volume of resources would not be collected using this approach. Other competitive information sources, such as the major Internet search engines, using full-text or abstract index spidering, could easily create more records than the much smaller GILS databases. To acquire sufficient records for depth of coverage, a blend of full-text (concordance) file spidering and specific attribute-value spidering was needed.

META tags in HTML

During the development of HTML, the standard made a provision for attribute-value pairs in the <HEAD> portion of the page. Initially, it was used primarily for browser control and limited content description. While the variety of uses continued to grow and evolve (5), the META tags have been mostly ignored as a vehicle for robust indexing of Web information. Several major search engine tools, such as Alta Vista, do assign a higher relative weight to text found in the META tags. But, there is little encouragement for Web masters to use META tags for more than simple "keyword" and "description" attributes.

The project staff asked, "Could this feature work for wholesale cataloging of state and local government Web pages?

Cataloging non-Internet resources

It was not enough to extract metadata about Internet-based resources. Search engines lacked the ability to discover and index non-Internet resources such as printed publications and personal contacts. Further, many Web pages did not provide adequate contact information for the reader to pursue. The more robust attribute set used by the GILS program supported entry of additional useful data and contact information to aid searchers. We concluded that indexing both Internet and non-Internet resources was needed to assure that citizens could both find and retrieve information.

Indexing software

Many server applications have been created to index Web sites and present a database on the Internet for browser access. Only a few applications, however, are available that support harvesting META tag values for fielded searching. And these are marketed towards intranet use. For the GILS project to succeed, the application used had to harvest META tags from 2,700+ state and local government Web sites on 500+ separate servers.

Did such a tool exist?

The Washington state experience

Were it not for the discovery and application of META tags, Washington state's locator service project would have ended in 1996 for lack of a suitable capture and index vehicle for catalog data. The project became invigorated with the discovery that attribute values embedded in the Web source page addressed many of the Internet cataloging challenges.

Pilot project experience with META tags

The Washington State GILS Project staff initially populated the test database with "stand-alone" META tagged HTML pages. Each record carried the full GILS attribute set in the META tag text strings and replicated the values in the <BODY> in visible text (6). This allowed each page to act as a resource descriptor, similar to the federal GILS record. Netscape's Catalog Server software was purchased, configured, and deployed on an NT-based platform to spider government Web servers, capture the META tag values and import them into an Internet-accessible, browser-searchable database. This eliminated any additional keying or conversion for data capture.

By July 1997, a few state and local government agencies had volunteered to "META tag" major nodes in their Web sites for the project. Over 300 HTML pages were indexed using Washington state's version (7) of the federal GILS attribute set. The Netscape Catalog Server software, though initially designed for an intranet environment, had demonstrated an ability to build an index of Internet resources.

Since it would be impractical to ask agencies to "META tag" all pages on their sites, major nodes were targeted first. This meant that the majority of information on the Internet would not be immediately searchable using the GILS attribute set. This negatively impacted the measure of "recall" of information from agency servers. So, to satisfy the objective of increasing access to the information, full-text review and indexing (concordance) were also applied during META tag spidering. The combined index contained full-text indexing for all server pages in addition to the specific GILS attribute set indexing for the META tagged pages.

The final challenge was to provide general full-text searching (popularly referred to as "keyword" searching) and more specific attribute searching of the GILS database. With modifications to the Netscape product's searching algorithm, the final process presented both options to the searcher (8).

To address non-Internet resources, the "stand-alone" GILS record was promoted. Using a Microsoft Word 6.0 template / macro file on a floppy disk, agency staff created HTML pages that contained both the hidden META tags and a visible array of the field values in the <BODY>. Many such records were written and loaded in government servers to await the visit and capture of the values by the GILS spider.

Comparison of full-text versus META tagged index performance

After a year in operation, the project evaluated the success of using META tags to carry index data. Some general observations are drawn:

Using the full text search feature

  1. Simple text-matching searches appeal to the public because the average citizen searching for government information will not use advanced search tools such as Boolean syntax and synonym sets.
  2. As judged by the searcher, the most effective searches result in relevant resources. This has been the finding of many studies, such as the series of University of Pittsburgh studies on information retrieval performance (9). While the full-text search often returned many "hits," relevancy was dependent upon the quality of the choice of search terms and the range of possible meaning of the terms. For example, a search on "welfare" produced more irrelevant hits than "public assistance." The choice of terms needs to meet the searcher's knowledge or even (inaccurate) understanding of the information they seek.
  3. Full-text searching provided the greatest recall of potential resource records. Initially, the volume of "hits" was acceptable. Once the index grew to over 100,000 records, the value of additional pages of hits was unproductive.

Using META tag searching

  1. The number of Web pages carrying the GILS META tag index has grown, but is still less than 10% of the total state and local government pages spidered. While the volume is a concern, the percent of total cataloged records is growing strongly. Over 100 new government pages are META tagged weekly.
  2. The majority of META tagged pages are "node" points in a government Web site. This means the precision of fielded searching is applied to connect searchers to branching or launch points, not the final files or documents.
  3. The use of values applied against specific fields produces exact matches. For example, the search of a subject within the jurisdiction of "city" returns the set of records only involving city governments. The only factor of error was in the accuracy of what values were initially assigned during cataloging.
  4. At first, webmasters, public information officers, and program managers were curious or uncertain about the GILS request to add index data to major Web pages on their site. After some promotion and training, it became a non-issue for them. In fact, many webmasters expressed interest in the concept and willingly provided leadership within their agencies to get pages "tagged."
  5. The assignment of values to the attribute fields was performed by agency webmasters or document authors. Errors in judgment or a lack of understanding of the content of a page can generate gross inaccuracies in cataloging. However, while some errors in content or syntax were discovered, the quality reviews performed by librarians validated the overall great work accomplished by the government agency "catalogers." The values assigned to the attributes were "close enough" to foster resource discovery by the general public. The greatest issues were that some mandatory fields such as "keywords" or "description" were not completed.
  6. No abuse of the power of META tags was discovered. Some vendors (10) have expressed concern that commercial webmasters will subvert the value of META tag indexing to improve a site's chance of discovery, perhaps unfairly. That belief influenced some major search engines to "soft peddle" the importance of using META tags for keywords and descriptions. However, government webmasters did not attempt to manipulate the placement of words in a document (e.g., redundant text, such as white text on a white background) or assignment of excessively broad subjects to increase the probability of retrieval.
  7. One search engine company, Alta Vista, actively promoted adding META tagged index data during their presentations to state and local government managers and technicians. They assured agencies that the probability of their page moving up to the beginning of an Alta Vista result list of "hits" was significantly improved if the search words were found in a META field. This compatibility of rewards has been a major contributor to the success of META tagging in Washington state.

Significant results

Compliance without coercion

The success of improved precision in searching Internet pages using META tagged values is directly influenced by the percentage of all pages that carry this additional information, but not all pages must be META tagged. Therefore, the author or webmaster's effort for META tagging is limited to a manageable number of pages. The greatest motivator is the improvement in visibility in state and international search engines. Also, file discovery and retrieval within an agency's intranet is improved where they deploy META tag-sensitive spidering software.

Agency participation in cataloging their resources using META tags continued to grow during 1997. From the start of the use of META tags in April 1997 until August 1997 the collection of META tagged government HTML pages grew to 350 records. In September 1997, the decision was made to add non-META tagged records to the database through full-text spidering. It allowed full recall of resources while providing the precision of fielded searching of the major nodes that were META tagged.

At no time during the project did the staff receive complaints from government agency staff about the effort required to apply META tags or the nature of the process. The only apparent challenge was merging META tagging into the routine of publishing material on the Web. The process of META tagging fit naturally into their ongoing efforts of Web page creation and was no perceived as an additional burden.

Depth of metadata

Author- or webmaster-generated META tag indexing provided greater depth of narrative and description when compared to Machine Readable Cataloging (MARC) Standards. More, robust keywords and descriptive text were provided for each referenced object. It appeared that once an author or webmaster accepted the process, they attempted to do the best possible job of generously describing the content and contact information.

Webmaster or author assignment of values to fields was judged "good enough" to support fielded searching.

For uncontrolled fields, the government agency webmaster or author assigned many terms, even some not otherwise included in the text. Particularly important was the choice of "also known as" values, such as in the case of the popular name for a statutory program or law. This factor has great potential for enhancing search and retrieval success for the average person. By contrast, searching full-text or abstract indexing search engines using popular terms not found in the concordance file would fail to locate resources unless the engine employed an extensive, "street-wise" external thesaurus.

For controlled fields, such as subject and government type, the government agency catalogers had to choose between imprecise terms. However, they appeared to choose values that closely matched those that citizens would use in searching. In several cases, they assigned the most comprehensive set of values. For example, on one page (11) a Department of Health webmaster assigned nine different subjects from the most detailed level of the GILS-controlled vocabulary (12) to ensure a good cross-reference and discovery process.

Improvement in retrieval

Discovery of relevant pages on the Internet is significantly improved using schema-controlled attribute-value indexes. When used by a searcher, the power of "triangulation" between complementary terms in fields can pinpoint the needed information far better than "keyword" (concordance list) searching.

During tests of META tagged resource discovery by customers, the following results were observed:

  1. All government agency Web pages using META tagged values achieved a high relevancy rating, thus, moving higher on the result lists. This increased the page's probability of discovery and retrieval.
  2. The display of the common attributes such as title, description, and URL did not always reveal the contents of the cited resource. When the searcher viewed the full GILS record, including subjects, keywords, originator jurisdiction, and point of contact, a better selection was made of links to pursue.
  3. The assignment of a subject classification in the META tags allowed similar topics to be located. This was considered a desirable feature by many searchers.

Creating a supportive discipline for contributor-submitted indexing

Establishing a self-maintained environment for author or webmaster submitted index values requires an organizational commitment, structure, and human effort. The project team addressed the following issues during program development:


The most challenging task of the project was to get the word out about META tagging and the process to participate in Internet indexing. A large stakeholder group of over 500 members from state and local agencies, law firms, newspapers, universities, and libraries gave advice and feedback during system development

Agency Guide to GILS pamphlets were distributed to explain the GILS mission and the role agencies could play in creating META tagged sites.

Flyers were distributed to citizens' organizations explaining how the public could benefit and how they could access the GILS site.

References to the GILS search engine were published in the 40,000 copies of Citizens Guide to Locating Government Information and distributed to government counters, libraries, and newspaper offices statewide.

Standard setting

The enabling legislation directed the State Library Commission to establish content-related standards for common formats and agency indexes for state agency produced information (13). This was an important step, one that established the validity of META tagging.


It was essential to build a network of organizations willing to promote and support the increased benefit of META tagging. Also, several statewide technology groups pledged to comply with the indexing standards.

The GILS Project staff assisted several agencies and jurisdictions with organizing and indexing their Web content. Patrons visiting the GILS search site were presented with an option to send e-mail to the library for assistance if they could not find the information they needed. This service is a rare but increasingly important aspect of large search services. Incentives for META tagging

The value of using HTML META tags becomes apparent when the Web pages are given additional weighting by the major search engines. Government agencies associated the cataloging effort with Web visibility. Rather than consider the cataloging task as compliance with some abstract format standard, agencies considered it "enhancing their pages for improved discovery and retrieval."

Recognition is a prime motivator in gaining acceptance of new concepts. Several agencies demonstrated support and commitment to META tagging early in the project. These agencies were honored in recognition ceremonies that emphasized their dedication to providing government information to the public.

Other inducements for volunteer organization participation during development included:

  1. Creation of a unique icon for Web pages that comply with GILS standards. Display of this icon on their Web pages became prestigious and a symbol of their progressive, public-oriented philosophy.
  2. Potential to search for cataloged documents within their intranet. By deploying a low-cost META tag-sensitive search server, any organization could create an internal document discovery engine. Any document so cataloged could be posted to the Internet without additional index effort.

Quality assurance/feedback

Several tools were developed to assist agencies with creating good metadata and complying with META tag syntax (14)(15)(16).

Additional "after-the-fact" analysis applications are under development to check the quality of metadata during spider and import processes.

Library staff members, on an "as needed" basis, do some quality assurance over submitted content.

A smaller Focus Group was formed from the larger stakeholder organization to provide direct guidance to the development effort.

"Hit" statistics are available to agencies to analyze the popularity of their Web pages. Also, a "mailto:" function is provided on every result list page. Many comments are submitted by searchers and are forwarded to the appropriate agencies for action.

Requirements to maintain a contributor-based indexing process

Human resources

The GILS Project was accomplished with minimal financial expense. About $200,000 U.S. per year was spent on all objects. The project team employed one full time project manager, two half-time senior librarians, and one part-time technician providing webmaster, programmer and systems administrator support. This team ran the project for the 24 months of development (July 1996 until June 1998). The project became a permanent program of the state in July 1998.

Additional volunteer assistance came from state and local government agency staff, private sector professionals, and citizens of the state.


The GILS project is using (as of February 1998) Netscape's Compass Server software (version 3.0) (17).

This is running on a Pentium Pro-based server with 128 MB of RAM and two hot-swappable 4 gb drives (mirrored). The operating system is Microsoft NT, version 4.0. The server is connected to the Internet through an Ethernet connection to the Internet Service Provider (18).

Political will

There will always be naysayers and doomsday prophets for all new ideas. Commitment to explore, err, and triumph was needed for the project to succeed. This included legislative willpower, political influence, and risk-taking. Substantial effort was expended to advise, influence, promote, encourage, and support government leaders involved with such a large endeavor.

Work ahead

Full promotion

As the project phase drew to a close, promotional effort increased. With mechanical and procedural issues fairly resolved, alerting the public to this search tool has become the primary mission. An aggressive promotional campaign is underway. The GILS Web site and service must be advertised just like any other Internet service competing for customers. While it remains a free tool, state funds are expended and must be justified. Success will continue to be measured by popular interest and use of the tool.

Additional aids to assist contributors and ensure quality of cataloging

On the drawing boards are interactive applications to check META tagged sites for compliance with GILS attribute schema and syntax. These include programs to verify accessibility for disabled citizens and attribute analysis.

Evolution of software

GILS uses software originally designed for intranet applications. While there are similarities, searching the Internet presents more challenges. Volumes of resource records in the millions are expected. Spidering software must be very efficient and quick. Graphical interfaces must be flexible as requirements change frequently (e.g., non-English GUIs).

Finally, the structure and rules governing the Internet are constantly changing. The work that was begun on our site using HTML as a vehicle for catalog information will yield to the new architecture of XML and RDF (19). Conversion to the next standard is expected, but Washington state believes the transition will be easier because it has invested in META tags and created a discipline of indexing government information.


1 Nicholson, Scott, "Indexing and Abstracting on the World Wide Web: An Examination of Six Web Databases," Information Technology and Libraries, (June 1997):73-81

2 Molly Joss and Stanley Wszola, "Search Engines that Can," CD-ROM Professional, (June 1996):31-48

3 Ferl, Terry Ellen and Millsap, Larry "The Knuckle-Cracker's Dilemma," Information Technology and Libraries, (June 1996): 81-91

4 Moen, William and McClure, Charles, An Evaluation of the Federal Government's Implementation of the Government Information Locator Service (GILS): Final Report. U.S. Government Printing Office (Stock Number 022-003-01190-1)(1997), also





9 Su, Louise. "Value of search results as a whole as a measure of information retrieval performance." ASIS '96 Proceedings of the 59th ASIS Annual Meeting. 33, (1996):226-237.


11 (View HTML source code to see META tags)









[INET'98] [ Up ][Prev][Next]