Using META Tag-Embedded Indexing for Fielded Searching of the InternetPhilip COOMBS <pcoombs@wln.com>
Washington State Library
USA AbstractFull-text searching on the Internet has run its course. A new
approach adding fielded searching is vital to the effectiveness
of information discovery and retrieval in the years ahead. This
paper presents the results of one year of operation of a statewide
government locator service employing indexing embedded in META tags,
common attribute schema, and combined full-text and fielded searching
applications. It provides evidence that author-indexed information
is practical, viable, and powerful when embedded into the source
files available on the Internet. This method has drawn interest
and acclaim from governments and industry as it demonstrates the
critical role META tags will play in the Internet of the next few
years. ContentsPublic demand for access to government information has existed
throughout the years. However, with the creation of the Internet
and modern search and retrieval tools, the public has turned up
the pressure for agencies to make more information available.
Washington State created a Public Information Access Policy Task
Force in 1995 to determine the action needed to support public
discovery and retrieval of government information. The 1996 legislature
directed the creation of a pilot project to assist public access
of state and local electronic government information. Washington's
Government Information Locator Service (GILS) project began in
late 1996. The project staff considered conventional cataloging
and index capture techniques, but rejected them in favor of embedded
metadata in Web pages and use of META tag-sensitive harvesting
robots. Why did the project team reject traditional "full-text"
indexing schemes? To the majority of citizens around the world, the Internet presents
the opportunity to explore an almost unlimited depth of facts,
directions, multimedia presentations, and near-facts -- in short,
information. That is the blessing of this powerful tool. But to
these same people, the Internet will not yield its treasures without
a struggle. The discovery of significant, relevant information
requires the patience and precision of a surgical operation. Each
day more characters, pixels, and sounds are added to the collective
cache called the Internet. Each day, the challenge of separating
relevant data from irrelevant data becomes more daunting. What
was once an amusing game of thinking of homonyms, synonyms, and
antonyms foils our attempts to locate appropriate information
resources on the Web. Given the open license authors have to express
themselves using all available nouns and adjectives, it is truly
remarkable if a searcher's chosen term matches a related concept
in a document in the Internet. Yet, this is the current strategy
for discovering relevant information on the Web. What tools are presently used by searchers? The major search engines are categorized as directory, full-text,
or abstracting. "Directory search tools provide subject headings for
navigation, usually created by a humans. No text is taken from
the page for indexing; rather, the pages are examined and classified
into a subject heading hierarchy. Examples include Yahoo and parts
of Magellan, Excite and Lycos. Full-text search tools index every word on every page of
the database. Alta Vista and Open Text fall into this category.
Full-text tools are not good for general subject searching, which
is one of the greatest frustrations for users. Abstracting search tools take a selected portion of the
target site for indexing. These tools use some type of algorithm
to select and index frequently used or prominent words on a page.
Examples include Excite, Lycos, Magellan, Web Crawler, and Hotbot.
They are good for general subject searching to generate clusters
of related citations.(1) However derived,
the index created by full-text and abstract tools is based upon
a machine-aided compilation of searchable terms. They employ a
concordance list of terms contained in documents discovered during
robot "spidering" of Web sites and pages. Searchers
must locate relevant sources by matching search terms to words
contained in the concordance files. How accurate is this process? Many studies have evaluated resource searching on the Internet,
concluding that even under the best conditions, locating the most
timely, accurate and relevant resources is difficult. Most searchers
will not use all the tools available, such as Boolean syntax or
multiple, related terms in their query. "Rarely or inconsistently
used keywords, for example, may turn up only a few hits, while
search criteria that are too broadly defined can return cumbersome
heaps of hits."(2); This leaves the searcher
to "refine" a large result list further by choosing
search terms and observing outcomes -- not a pretty process. And
search precision is not always guaranteed. In one study, typical
of discovery and retrieval analysis, researchers found that at
least one quarter of searchers did not appear to retrieve useful
citations. Thankfully, other search methodologies and tools exist. One has
been in existence for more than a century-- the card catalog system
in libraries. A catalog contains various attributes and values
to abstractly describe the contents of a document, media products,
work of art, or other physical objects. The catalog concept is
based on the use of specific fields and a discipline or structure
to assign values to them. More precise searching is possible using
multiple-indexed fields to pinpoint sources in a method similar
to triangulation in navigation. By selecting complementary or
contrasting index terms, a searcher can use the power of Venn
subsets to include and exclude clusters of subjects and specific
words. Searches using a combination of terms such as title, subject,
and author quickly narrow the results and reduce irrelevant citations
(3). Could this concept be applied to the Internet? Creating a catalog of information on the Internet has become a
goal for many organizations and individuals on the planet. Several
significant issues challenge our achievement of this objective. The card catalog works well in the environment of printed material
or media because the summary or metadata is separate from the
object described, thus insuring the index remains intact while
the object travels, i.e., is checked out of the library. In the
world of electronic objects, however, such remoteness becomes
problematic. The contents of a referenced object can change through
revision or wholesale replacement. The location of the object
can be moved from one Internet address to another, rendering a
URL citation obsolete. We thought the remedy was to dynamically
link an abstract or index record to its referenced source. The process of creating an abstract of a source of information
has evolved into a profession in libraries, complete with a body
of knowledge and internationally accepted standards. Consistency
and accuracy in cataloging ensure reliable discovery and retrieval.
The primary impediment to cataloging the Internet is the training
and experience needed to comply with the standards. But accuracy
is a relative term. Few webmasters and authors have training in
cataloging principles. Could the concept of "good enough" be applied to
an Internet index? Fielded searching is most powerful when there are limits to the
range of choices of possible values. This is accomplished using
a controlled vocabulary where options can be reasonably limited
to a commonly accepted set of values. While this increases the
probability of a match during searching, it likewise limits the
cataloger to the values available. Given the scope of topics potentially
encountered in the Internet, any controlled vocabulary covering
subjects or themes must be fairly comprehensive. Existing authority
sets such as the Library of Congress classification system or
the Dewey Decimal system presents a daunting array of choices,
far too many for the untrained cataloger. Indeed, for Internet
use and general public searching, a simpler vocabulary is needed. A related issue is the challenge of segmenting a seemingly continuous
spectrum of topics into a discrete list that can answer the searcher's
question. For example, a city is uniquely different from a county,
except when they share certain services such as police or street
maintenance. Therefore, a Web site covering local government street
repairs might be cataloged into both jurisdictions. This distinction
challenges a cataloger of Internet resources. If an Internet resource is to be searched and discovered using
pattern matching on descriptive words, the choice of the cataloged
word or words is critical. But the choice is not limited to words
used in the text of the resource. Many more good adjectives that
often aren't found in the text of the resource can be used to
properly describe it. This distinguishes the abstract process
from the full-text indexing process. The most effective index
uses specific but commonly used words. Use of cataloger-indexed
values rather than a concordance list of words found in a resource
allows additional choices, but also, perhaps also an element of
subjectivity. Could non-trained catalogers do an adequate job of choosing
keyword and subject terms? Initially, the GILS Project Team considered using professional
catalogers to create the catalog of Washington state and local
government Internet Web sites and pages. This quickly became a
tedious process with no real chance of valid scope and depth coverage.
We needed something to accelerate the pace of cataloging. Additionally,
the catalogers needed time to become familiar with the content
and purpose of each resource. The product of their effort was
mostly not used. The most accurate assignment of index values
comes from the author. However, providing summary information
prior to publishing was not an established discipline. Would the originators participate substantially in cataloging
their information to satisfy index requirements? Once a dynamic resource like a Web site is cataloged, the work
is not done. Each revision to the resource may justify an update
to the related index or abstract record. It is often an effort
to get an Internet resource initially cataloged, let alone, to
update the index on a regular or as-needed basis. Was there a way to maintain accuracy with minimum effort? Once a resource has been cataloged, the next issue is how to collect
the summary records into a searchable database. Traditional manual
capture mechanisms employ forms or CGI scripts to direct the cataloger
in the reporting process. The information is then transmitted
(or worse, rekeyed) into a file server to manage the sorting and
retrieval. This is the present process of U.S. federal government
agencies complying with the federal GILS program, and it has spawned
multiple databases over scores of Internet accessible servers.
Specially trained GILS coordinators within federal agencies promote
the inclusion of "significant" government information
and assist program staff in completing the GILS index record.
While this approach has built a few thousand, quality records,
"[federal] GILS implementation has not achieved the vision
of a virtual card catalogue of government information nor have
the majority of agency GILS implementations matured into a coherent
and usable government information locator service" (4).
The project team wanted to avoid a similar outcome in Washington
state. Initially, the Washington State GILS Project followed along this
federal methodology. However, within six months, it was apparent
that any significant volume of resources would not be collected
using this approach. Other competitive information sources, such
as the major Internet search engines, using full-text or abstract
index spidering, could easily create more records than the much
smaller GILS databases. To acquire sufficient records for depth
of coverage, a blend of full-text (concordance) file spidering
and specific attribute-value spidering was needed. During the development of HTML, the standard made a provision
for attribute-value pairs in the <HEAD> portion of the page.
Initially, it was used primarily for browser control and limited
content description. While the variety of uses continued to grow
and evolve (5), the META tags have been mostly
ignored as a vehicle for robust indexing of Web information. Several
major search engine tools, such as Alta Vista, do assign a higher
relative weight to text found in the META tags. But, there is little
encouragement for Web masters to use META tags for more than simple
"keyword" and "description" attributes. The project staff asked, "Could this feature work for
wholesale cataloging of state and local government Web pages? It was not enough to extract metadata about Internet-based resources.
Search engines lacked the ability to discover and index non-Internet
resources such as printed publications and personal contacts.
Further, many Web pages did not provide adequate contact information
for the reader to pursue. The more robust attribute set used by
the GILS program supported entry of additional useful data and
contact information to aid searchers. We concluded that indexing
both Internet and non-Internet resources was needed to assure
that citizens could both find and retrieve information. Many server applications have been created to index Web sites
and present a database on the Internet for browser access. Only
a few applications, however, are available that support harvesting
META tag values for fielded searching. And these are marketed towards
intranet use. For the GILS project to succeed, the application
used had to harvest META tags from 2,700+ state and local government
Web sites on 500+ separate servers. Did such a tool exist? Were it not for the discovery and application of META tags, Washington
state's locator service project would have ended in 1996 for lack
of a suitable capture and index vehicle for catalog data. The
project became invigorated with the discovery that attribute values
embedded in the Web source page addressed many of the Internet
cataloging challenges. The Washington State GILS Project staff initially populated the
test database with "stand-alone" META tagged HTML pages.
Each record carried the full GILS attribute set in the META tag
text strings and replicated the values in the <BODY> in
visible text (6). This allowed each page to
act as a resource descriptor, similar to the federal GILS record.
Netscape's Catalog Server software was purchased, configured,
and deployed on an NT-based platform to spider government Web
servers, capture the META tag values and import them into an Internet-accessible,
browser-searchable database. This eliminated any additional keying
or conversion for data capture. By July 1997, a few state and local government agencies had volunteered
to "META tag" major nodes in their Web sites for the
project. Over 300 HTML pages were indexed using Washington state's
version (7) of the federal GILS attribute
set. The Netscape Catalog Server software, though initially designed
for an intranet environment, had demonstrated an ability to build
an index of Internet resources. Since it would be impractical to ask agencies to "META tag"
all pages on their sites, major nodes were targeted first. This
meant that the majority of information on the Internet would not
be immediately searchable using the GILS attribute set. This negatively
impacted the measure of "recall" of information from
agency servers. So, to satisfy the objective of increasing access
to the information, full-text review and indexing (concordance)
were also applied during META tag spidering. The combined index
contained full-text indexing for all server pages in addition
to the specific GILS attribute set indexing for the META tagged
pages. The final challenge was to provide general full-text searching
(popularly referred to as "keyword" searching) and more
specific attribute searching of the GILS database. With modifications
to the Netscape product's searching algorithm, the final process
presented both options to the searcher (8). To address non-Internet resources, the "stand-alone"
GILS record was promoted. Using a Microsoft Word 6.0 template
/ macro file on a floppy disk, agency staff created HTML pages
that contained both the hidden META tags and a visible array of
the field values in the <BODY>. Many such records were written
and loaded in government servers to await the visit and capture
of the values by the GILS spider. After a year in operation, the project evaluated the success of
using META tags to carry index data. Some general observations
are drawn: - Simple text-matching searches appeal to the public because
the average citizen searching for government information will
not use advanced search tools such as Boolean syntax and synonym
sets.
- As judged by the searcher, the most effective searches result
in relevant resources. This has been the finding of many
studies, such as the series of University of Pittsburgh studies
on information retrieval performance (9).
While the full-text search often returned many "hits,"
relevancy was dependent upon the quality of the choice of search
terms and the range of possible meaning of the terms. For example,
a search on "welfare" produced more irrelevant hits
than "public assistance." The choice of terms needs
to meet the searcher's knowledge or even (inaccurate) understanding
of the information they seek.
- Full-text searching provided the greatest recall of potential
resource records. Initially, the volume of "hits" was
acceptable. Once the index grew to over 100,000 records, the value
of additional pages of hits was unproductive.
- The number of Web pages carrying the GILS
META tag index has
grown, but is still less than 10% of the total state and local
government pages spidered. While the volume is a concern, the
percent of total cataloged records is growing strongly. Over 100
new government pages are META tagged weekly. - The majority of
META tagged pages are "node" points
in a government Web site. This means the precision of fielded
searching is applied to connect searchers to branching or launch
points, not the final files or documents. - The use of values applied against specific fields produces
exact matches. For example, the search of a subject within the
jurisdiction of "city" returns the set of records only
involving city governments. The only factor of error was in the
accuracy of what values were initially assigned during cataloging.
- At first, webmasters, public information officers, and program
managers were curious or uncertain about the GILS request to add
index data to major Web pages on their site. After some promotion
and training, it became a non-issue for them. In fact, many webmasters
expressed interest in the concept and willingly provided leadership
within their agencies to get pages "tagged."
- The assignment of values to the attribute fields was performed
by agency webmasters or document authors. Errors in judgment or
a lack of understanding of the content of a page can generate
gross inaccuracies in cataloging. However, while some errors in
content or syntax were discovered, the quality reviews performed
by librarians validated the overall great work accomplished by
the government agency "catalogers." The values assigned
to the attributes were "close enough" to foster resource
discovery by the general public. The greatest issues were that
some mandatory fields such as "keywords" or "description"
were not completed.
- No abuse of the power of
META tags was discovered. Some vendors
(10) have expressed concern that commercial
webmasters will subvert the value of META tag indexing to improve
a site's chance of discovery, perhaps unfairly. That belief influenced
some major search engines to "soft peddle" the importance
of using META tags for keywords and descriptions. However, government
webmasters did not attempt to manipulate the placement of words
in a document (e.g., redundant text, such as white text on a white
background) or assignment of excessively broad subjects to increase
the probability of retrieval. - One search engine company, Alta Vista, actively promoted adding
META tagged index data during their presentations to state and
local government managers and technicians. They assured agencies
that the probability of their page moving up to the beginning
of an Alta Vista result list of "hits" was significantly
improved if the search words were found in a META field. This
compatibility of rewards has been a major contributor to the success
of META tagging in Washington state.
The success of improved precision in searching Internet pages
using META tagged values is directly influenced by the percentage
of all pages that carry this additional information, but not all
pages must be META tagged. Therefore, the author or webmaster's
effort for META tagging is limited to a manageable number of pages.
The greatest motivator is the improvement in visibility in state
and international search engines. Also, file discovery and retrieval
within an agency's intranet is improved where they deploy META tag-sensitive
spidering software. Agency participation in cataloging their resources using META tags
continued to grow during 1997. From the start of the use of META tags
in April 1997 until August 1997 the collection of META tagged government
HTML pages grew to 350 records. In September 1997, the decision
was made to add non-META tagged records to the database through
full-text spidering. It allowed full recall of resources while
providing the precision of fielded searching of the major nodes
that were META tagged. At no time during the project did the staff receive complaints
from government agency staff about the effort required to apply
META tags or the nature of the process. The only apparent challenge
was merging META tagging into the routine of publishing material
on the Web. The process of META tagging fit naturally into their
ongoing efforts of Web page creation and was no perceived as an
additional burden. Author- or webmaster-generated META tag indexing provided greater
depth of narrative and description when compared to Machine Readable
Cataloging (MARC) Standards. More, robust keywords and descriptive
text were provided for each referenced object. It appeared that
once an author or webmaster accepted the process, they attempted
to do the best possible job of generously describing the content
and contact information. Webmaster or author assignment of values to fields was judged
"good enough" to support fielded searching. For uncontrolled fields, the government agency webmaster or author
assigned many terms, even some not otherwise included in the text.
Particularly important was the choice of "also known as"
values, such as in the case of the popular name for a statutory
program or law. This factor has great potential for enhancing
search and retrieval success for the average person. By contrast,
searching full-text or abstract indexing search engines using
popular terms not found in the concordance file would fail to
locate resources unless the engine employed an extensive, "street-wise"
external thesaurus. For controlled fields, such as subject and government type, the
government agency catalogers had to choose between imprecise terms.
However, they appeared to choose values that closely matched those
that citizens would use in searching. In several cases, they assigned
the most comprehensive set of values. For example, on one page
(11) a Department of Health webmaster assigned
nine different subjects from the most detailed level of the GILS-controlled
vocabulary (12) to ensure a good cross-reference
and discovery process. Discovery of relevant pages on the Internet is significantly improved
using schema-controlled attribute-value indexes. When used by
a searcher, the power of "triangulation" between complementary
terms in fields can pinpoint the needed information far better
than "keyword" (concordance list) searching. During tests of META tagged resource discovery by customers, the
following results were observed: - All government agency Web pages using
META tagged values achieved
a high relevancy rating, thus, moving higher on the result lists.
This increased the page's probability of discovery and retrieval. - The display of the common attributes such as title, description,
and URL did not always reveal the contents of the cited resource.
When the searcher viewed the full GILS record, including subjects,
keywords, originator jurisdiction, and point of contact, a better
selection was made of links to pursue.
- The assignment of a subject classification in the
META tags
allowed similar topics to be located. This was considered a desirable
feature by many searchers.
Establishing a self-maintained environment for author or webmaster
submitted index values requires an organizational commitment,
structure, and human effort. The project team addressed the following
issues during program development: The most challenging task of the project was to get the word out
about META tagging and the process to participate in Internet indexing.
A large stakeholder group of over 500 members from state and local
agencies, law firms, newspapers, universities, and libraries gave
advice and feedback during system development Agency Guide to GILS pamphlets were distributed to explain
the GILS mission and the role agencies could play in creating
META tagged sites. Flyers were distributed to citizens' organizations explaining
how the public could benefit and how they could access the GILS
site. References to the GILS search engine were published in the 40,000
copies of Citizens Guide to Locating Government Information
and distributed to government counters, libraries, and newspaper
offices statewide. The enabling legislation directed the State Library Commission
to establish content-related standards for common formats and
agency indexes for state agency produced information (13).
This was an important step, one that established the validity
of META tagging. It was essential to build a network of organizations willing to
promote and support the increased benefit of META tagging. Also,
several statewide technology groups pledged to comply with the
indexing standards. The GILS Project staff assisted several agencies and jurisdictions
with organizing and indexing their Web content. Patrons visiting
the GILS search site were presented with an option to send e-mail
to the library for assistance if they could not find the information
they needed. This service is a rare but increasingly important
aspect of large search services. Incentives for META tagging The value of using HTML META tags becomes apparent when the Web
pages are given additional weighting by the major search engines.
Government agencies associated the cataloging effort with Web
visibility. Rather than consider the cataloging task as compliance
with some abstract format standard, agencies considered it "enhancing
their pages for improved discovery and retrieval." Recognition is a prime motivator in gaining acceptance of new
concepts. Several agencies demonstrated support and commitment
to META tagging early in the project. These agencies were honored
in recognition ceremonies that emphasized their dedication to
providing government information to the public. Other inducements for volunteer organization participation during
development included: - Creation of a unique icon for Web pages that comply with GILS
standards. Display of this icon on their Web pages became prestigious
and a symbol of their progressive, public-oriented philosophy.
- Potential to search for cataloged documents within their intranet.
By deploying a low-cost
META tag-sensitive search server, any organization
could create an internal document discovery engine. Any document
so cataloged could be posted to the Internet without additional
index effort.
Several tools were developed to assist agencies with creating
good metadata and complying with META tag syntax (14)(15)(16). Additional "after-the-fact" analysis applications are
under development to check the quality of metadata during spider
and import processes. Library staff members, on an "as needed" basis, do some
quality assurance over submitted content. A smaller Focus Group was formed from the larger stakeholder organization
to provide direct guidance to the development effort. "Hit" statistics are available to agencies to analyze
the popularity of their Web pages. Also, a "mailto:"
function is provided on every result list page. Many comments
are submitted by searchers and are forwarded to the appropriate
agencies for action. The GILS Project was accomplished with minimal financial expense.
About $200,000 U.S. per year was spent on all objects. The project
team employed one full time project manager, two half-time senior
librarians, and one part-time technician providing webmaster,
programmer and systems administrator support. This team ran the
project for the 24 months of development (July 1996 until June
1998). The project became a permanent program of the state in
July 1998. Additional volunteer assistance came from state and local government
agency staff, private sector professionals, and citizens of the
state. The GILS project is using (as of February 1998) Netscape's Compass
Server software (version 3.0) (17). This is running on a Pentium Pro-based server with 128 MB of RAM
and two hot-swappable 4 gb drives (mirrored). The operating system
is Microsoft NT, version 4.0. The server is connected to the Internet
through an Ethernet connection to the Internet Service Provider
(18). There will always be naysayers and doomsday prophets for all new
ideas. Commitment to explore, err, and triumph was needed for
the project to succeed. This included legislative willpower, political
influence, and risk-taking. Substantial effort was expended to
advise, influence, promote, encourage, and support government
leaders involved with such a large endeavor. As the project phase drew to a close, promotional effort increased.
With mechanical and procedural issues fairly resolved, alerting
the public to this search tool has become the primary mission.
An aggressive promotional campaign is underway. The GILS Web site
and service must be advertised just like any other Internet service
competing for customers. While it remains a free tool, state funds
are expended and must be justified. Success will continue to be
measured by popular interest and use of the tool. On the drawing boards are interactive applications to check META tagged
sites for compliance with GILS attribute schema and syntax. These
include programs to verify accessibility for disabled citizens
and attribute analysis. GILS uses software originally designed for intranet applications.
While there are similarities, searching the Internet presents
more challenges. Volumes of resource records in the millions are
expected. Spidering software must be very efficient and quick.
Graphical interfaces must be flexible as requirements change frequently
(e.g., non-English GUIs). Finally, the structure and rules governing the Internet are constantly
changing. The work that was begun on our site using HTML as a
vehicle for catalog information will yield to the new architecture
of XML and RDF (19). Conversion to the next
standard is expected, but Washington state believes the transition
will be easier because it has invested in META tags and created
a discipline of indexing government information. 1 Nicholson, Scott, "Indexing
and Abstracting on the World Wide Web: An Examination of Six Web
Databases," Information Technology and Libraries,
(June 1997):73-81 2 Molly Joss and Stanley Wszola,
"Search Engines that Can," CD-ROM Professional,
(June 1996):31-48 3 Ferl, Terry Ellen and Millsap, Larry "The Knuckle-Cracker's
Dilemma," Information Technology and Libraries, (June
1996): 81-91 4 Moen, William and McClure, Charles,
An Evaluation of the Federal Government's Implementation of
the Government Information Locator Service (GILS): Final Report.
U.S. Government Printing Office (Stock Number 022-003-01190-1)(1997),
also http://www.unt.edu/slis/research/gilseval/gilsdocs.htm 5http://www.vancouver-webpages.com/META/ 6 http://www.wa.gov/wsl/gils/gilstest/ 7 http://www.wa.gov/wsl/gils/metadesc.htm 8 http://wagils.wln.com:100/ 9 Su, Louise. "Value of search
results as a whole as a measure of information retrieval performance."
ASIS '96 Proceedings of the 59th ASIS Annual Meeting.
33, (1996):226-237. 10 http://www.excite.com/Info/listing.html#anchor4877066 11 http://www.doh.wa.gov:80/Publicat/94_PHIP/94phip.htm
(View HTML source code to see META tags) 12 http://www.wa.gov:80/wsl/gils/gilstree.htm 13 http://leginfo.leg.wa.gov/pub/rcw/title_27/chapter_004/rcw_27_04_045 14 http://www.wa.gov/wsl/gils/gilstmpl.txt 15 http://www.wa.gov/wsl/gils/metamakr.htm 16 http://www.wa.gov/wsl/gils/subjbldr.cgi 17 http://home.netscape.com/comprod/server_central/index.html 18 http://www.wln.com/ 19 http://w3c1.inria.fr/TR/WD-rdf-syntax/ |