An End User's View of Mining the Web: Focused and Satisficed Internet Search and Retrieval Strategies

Wallace C. Koehler, Jr. <willing@usit.net>
University of Tennessee
USA

Abstract

More attention has been paid, perhaps, to the development of the Internet, the expansion of the Internet, and the creation of Internet materials than has been paid to advanced searching and retrieval of Internet materials. As the Internet grows more complex, and if it is to serve as both a coherent and competent vehicle for information transfer and retrieval, it is imperative that search and retrieval strategies be developed. The purpose of this paper is to suggest that search techniques for the World Wide Web (WWW), together with commercially available software and online search engines, can be combined to improve search results.

It is critical to understand that there are a variety of search strategies, each designed to satisfy specific information needs and requirements. There are at least two general types: focused and satisficed. Search strategies and search tools will vary as information needs vary. A focused search seeks information meeting specifically defined criteria. A satisficed search seeks information adequate to satisfy an information need.

This paper considers briefly the Web environment in which searches take place. It then discusses the two general search approaches and some of the available options. It concludes with a call for authors, search engine designers, and searchers/retrievers to assist in the rationalization and standardization of the process.

Introduction
The World Wide Web as uncharted territory
Search tools
- Search parameters
- Tag types
Search strategies
Conclusions
References

Introduction

This paper addresses research performed at Information International Associates and at the University of Tennessee to develop a coherent search and retrieval methodology for two general approaches to Web searches. A great deal of attention has been paid to development of the Internet, the expansion of the Internet, and the creation of Internet materials. If, however, that information and those materials are to be retrieved and used effectively, more attention needs also be paid to advanced searching and retrieval of Internet materials.

The Internet, and with it the WWW, daily grows more complex. A brief description of the complexity of the WWW is necessary because that very complexity affects search/retrieval strategies. A number of Web characteristics are also described because those characteristics can be the subjects of search strategies. For example, the structure of URLs (uniform resource locators) can be used as search terms. This paper describes a number of techniques based on WWW characteristics and upon various search engines, metasearch engines, and other techniques we have explored.

It is also important to remember that search and retrieval needs vary and that any search strategy should be developed to satisfy the retrieval needs of the end user. There are at least two general search approaches, and the type of search needed depends upon the type of information sought. I have called these focused and satisficed WWW searches. A focused search is designed to find and retrieve a specific set of documents providing a specific set of information. For example, one might search for the review of specific software by a specific reviewer, at a specific time, in a specific format, in a specific online journal. On the other hand, a satisficed search results in hits that are "good enough." The specific source is of less importance than access to the information. If one is searching for the name of a national capital, for example, it matters little if that datum came from the online Encyclopaedia Britannica, the country's Lonely Planet travel page, the CIA World Factbook, or the home page of an undergraduate college student.

Techniques and strategies are suggested that address the underlying dynamics as well as the idiosyncrasies of the two search approaches. These include the use of search engines and metasearch engines, citation analysis, site mapping, link exploration, and URL shaving.

A number of commercial off-the-shelf software (COTS) and online search products are referenced throughout this paper. This mention does not constitute an endorsement of any such product. Moreover, given Internet product development dynamics, any or all of these products could be in the short term rendered obsolete. They are referenced as examples. Many products function well and perform similar tasks. The choice of which to use is often one of personal preference.

The World Wide Web as uncharted territory

This paper is written from the perspective of the information scientist, the "cyberlibrarian" if you will. Much effort has been expended to bring traditional library tools to understand, to catalogue, and, in the end, to organize the array of information available. Examples include work by the U.S. Library of Congress and by OCLC to develop and/or enhance the management of Internet information. The online directory search resources like Yahoo! are an extension of the concept. NetFirst is an explicit attempt by FirstSearch to employ library standards on the WWW.

This section is introductory. Much that is stated here has been elaborated in far more detail and far more eloquently elsewhere. These considerations have import for search and retrieval strategies; they define the environment in which the search retrieval process takes place. I offer them here because WWW searching is very much searching uncharted waters; it is an excursion into the unknown, too often with very inadequate navigation tools. Many of us also approach Web searching with preconceived notions. Some of those notions should be dispelled. I remind myself daily, for example, that Web pages and print pages are not the same thing: Web pages are print pages, only more so.

Web parallels with print

Is the WWW environment an extension of the "traditional environment?" By traditional environment, I mean paper as well as other local storage and transmission media: film, videotape, audio recordings, CD-ROM, etc.--"print" for short. If so, can the WWW be captured and used in the same way as the traditional world? As usual, the answer is "yes and no." Both print and the WWW transmit information. That information may be profound or perfidious, transitory or eternal, insightful or superficial. But there the similarity ends. Print tends to be "permanent" while the WWW is "ephemeral." Print works may be edited, revised, updated, reprinted, and so on. But the original remains intact alongside the changes.

This coupling of the old and the revised is not so true of the WWW. Once a document is changed or eliminated, the superseded original usually ceases to exist. Certainly, there are proposals to provide for updating and revision of Web documents while maintaining the integrity of the original. These all, by necessity, include archiving in some form. That is certainly theoretically feasible. It is, on its face, a simple proposal. Except that today we must speak in terms of terabytes of data, and tomorrow of googolbytes. Cataloging or creating records of WWW materials is much different from their print counterparts. First, as NetFirst and other Internet catalogs of the Internet demonstrate, citations to the record point not to representations of the record, but to the native document itself [1]. At the same time, the fact that the citation points to the native document compromises the citation. Once the native document is changed, the citation to that document may no longer be valid.

Web dimensions

Is the WWW growing at an exponential rate? Probably not, but its explosive growth creates problems for the searcher. New, as yet unindexed, and sometimes unlinked information is constantly being added by any number of formal and informal publishers. An effective search/retrieval strategy must address this material.

How big is the WWW? How much information is there? There are a variety of often inconsistent estimates using any number of definitions. There are also a number of good reasons to suggest that all estimates understate the actual size of the WWW, however measured. The WWW continues to grow. It was estimated in the last half of 1996 to consist of some 60 million documents [2] on 12 million hosts [3] and 600,000 servers [4], up from 9 million hosts and 250,000 servers at the beginning of the year.

Each discipline needs to capture the data that best describe the needs and requirements of that discipline and the public it serves. For example, the cyberlibrarian would be less interested in the server software supporting a site than in the server domain name of that site. Server software becomes an issue to the cyberlibrarian if that software in some way effects the search and retrieval of information located at the site.

Web currency

Many of the indexed search engines claim primacy of timeliness and of depth of coverage. Most claim to sweep the WWW periodically and frequently to update and renew their indexes and directories. Yet, it is not uncommon to encounter "File Not Found" following a search session, suggesting that the renewal process may not be so frequent as desired. This again, must be addressed and managed in any search/retrieval strategy.

One search engine reports the number of documents indexed, another the number of hosts its robot visits. However good the search engines, we do know that neither the indexed nor the directory search engines are or can ever be truly current. New documents, sites, and domains are constantly being added. Existing documents, sites, and domains are being modified and eliminated. There has yet to be much research on mortality and modification rates for Web "entities." One study [5] suggests that the life of all HTML objects is, on average, 44 days; text objects, 75 days; and images, 107 days. Tentative results of work I have in progress suggest that that server-level domains have a mortality rate of about 10 percent per year, but that documents farther down the directory structure on those domains have much shorter half-lives. The farther down one goes, the shorter the half-life.

Change is much more difficult to capture. Documents may change haphazardly or by design. That change may be subtle and in no way affect the message, or it may be substantial. As cyberlibrarians and end users, we may be less concerned with minute changes, but we must be cognizant of and respond to those substantial changes. Search strategies can and have been developed to address those changes. COTS software is available to help map those changes. As of this writing, while that software is quite sophisticated, it not sufficiently sophisticated to document or filter the significant content changes from the insignificant. I suspect that by the time this is read, that will have changed.

Search tools

The directory- and index-based search engines and metasearch engines provide an excellent first-level search and retrieval mechanism. However, each has its strengths and weaknesses. The purpose of this paper is not to examine and critique each of the search engines. By and large, the various search engines perform as advertised. If, for example, a search engine is limited to searching HTML title fields, it is not a criticism of the search engine that it cannot retrieve relevant documents when the document author has failed to appropriately use the title field. Similarly, the full-text indexed search engines cannot be faulted for failing to retrieve documents containing variant spellings, accents, and diacritical marks. The searcher must anticipate and employ appropriate terminology. The former problem may be considered "author error"; the latter, "searcher error." It is rather to suggest that search techniques, together with commercially available software, and online search engines can be combined to improve search results.

A combination of tools as well as search techniques is required to improve data retrieval. Other techniques include "URL shaving" (moving up the file structure by "shaving" file and subfile names from the URL) and "link exploration" (the purposeful surfing of links from one document to another in search of desired information).

Search parameters

As Internet users interested in search/retrieval issues, we should understand that it is possible in principle to search on any characteristic that can be differentiated among "knowledge products." That is, searching may be performed on any identifiable element in any document or record of the document. That information may be inherent in the document or added by an indexer. Some of those characteristics are implicit in the document creation process: the words (character strings) used and the order presented, the language the document is written in, the number of words, media type, and so on. Other characteristics may be provided later by indexers. These may be added descriptor and identifier fields. Or it may include preselection. These concepts are developed further below.

Those elements or tags may vary and the ability of any given search engine to search and retrieve on those tags may also vary. The search engines may rise to higher levels of sophistication by permitting searching on combinations of those tags. The more tag variety a search engine supports and the more combinations of tags it allows, the more useful the search engine for narrowing or defining search parameters. In general, the search engines provided by the online commercial database vendors perform more of these tasks than do the Internet search engines. The capabilities of these latter search engines, however, continue to expand.

Tag types

Document type is searchable and so is content. The term "tag" is used broadly here. There are two types of tags that may be attached or be inherent in a Web document. There are those tags created by the author or publisher of the document and those created by the indexer of the document. Author-defined tags may result from an explicit decision by author: use of keywords, selection of language, what links to make. Format and other decisions may result in implicit tag creation. Documents may be HTTP or Gopher. They may contain text, graphic, audio, visual, FTP, Telnet or mail links, or any combination thereof.

Indexer-defined tags are created after the document is written and frequently after the document has been posted to the Web. These tags may be human or machine generated. The directories use post hoc human indexers, while the search indexes employ spiders or robots to search and index. The concept of indexer is a broad one. Any post hoc activity that in any way serves to mark a document serves to index that document. These activities include not only traditional indexing functions but the use or pointing to of documents by others. A hypertext link or citation to a document is as much an index term as an index-added abstract or descriptor terms. Bibliometric citation analysis is a time-honored technique, and one supported by several Web search engines.

Finally, "hit statistics," the ultimate post hoc tag, might also be considered indexer tags. Some search engines and many webmasters publish hit statistics for the search engine or the server. These statistics list how often a page, site, or domain is accessed; when they are most frequently accessed; and by whom they are accessed. It is possible to identify "readers" at the IP address level, but most often these data are provided at the regional or national level. Thus, it is possible to identify "popular" Web documents or sites for search and retrieval purposes.

The following illustrates some of the tags that can be provided either by authors or indexers:

Author-Defined

Content: character strings, date, media type (and mix of media), language, author, spamindexing.
Structure: links, levels, domain, size, underlying software (Adobe, GIF, JPRG, etc.), mirrors, format (HTML, Gopher, newsgroups, listservs, etc.).
Access: password, browser preference, firewalls.

Indexer-Defined

Value added: indexer-added fields (descriptor, indicator, abstracts).
Preselection: limited area search engines, directories.
Other user-defined: links to, citations.

Search strategies

There is an extensive literature addressing strategies for searching the commercial online databases [6, 7]. Different vendors, e.g., Dialog, Westlaw, Nexis/Lexis, STN, offer sometimes very different search engines on top of the same or similar database arrays. Almost all database vendors offer access, for example, to ERIC, while others maintain proprietary databases. Each, however, has its own proprietary search engine or engines.

These commercial online database search engines share several attributes that their WWW counterparts do not. Most can subset and those subsets can be saved and iterated. Web engines cannot. Complex Boolean constructs with multiple nestings are supported, but not on the Web. The advanced versions of several of the Web engines permit limiting parameters (for example, date, location, media). The commercial databases offer more. Most of the commercial database engines allow the searcher the option of which field(s) to search: author, date, document type, document title, descriptor, indicator, etc. There are limited field search options supported by some Web search engines.

It must be stressed again that there is a key difference between Web search engines and commercial online search engines. Web engines search indexes or directories as do their commercial counterparts. But, those Web indexes or directories point directly to native, changing documents whose numbers vary from day to day and hour to hour. Access to those native documents, which are at the same time the "record," may be blocked for a variety of reasons. The host may be down. Power may have failed. Infrastructure destroyed [8]. There may have been political or economic interference. The native document may have been changed.

The commercial database engines sit atop records, records that are representations of the native document, but that are not the native documents themselves. The search engines point to the representations, which are stable. Moreover, the representations cite "stable" media. The document representations records, in commercial databases are standardized and follow strict guidelines. This is possible but highly improbable on the Web.

An extensive literature has been developed to describe the range of search strategies that the commercial database engines permit. That range is not nearly so large for the Web engines. However, there are approaches and techniques which can be employed by the searcher to improve recall, precision, pertinence, and relevance.

There are underlying assumptions searchers make concerning the quality of the return set in both focused and satisficed searches. The national capital example above can be used to illustrate the difference. Recall that several countries have moved their capitals since the 1950s (Brazil and Belize come to mind) and others are in the process of changing (Germany). New national capitals have proliferated. Several cities have lost that status. In a number of countries (South Africa, the Netherlands, Bolivia) the legislative and administrative capitals are split. In at least one case (Israel), the recognition of the location of the national capital is a matter of international dispute.

A satisficed search on "national capital" and Israel or South Africa may yield contradictory results. Results of the satisficed search will be accepted, however, if the source of the information is accepted as both authoritative and timely. A search on "United States" and "national capital" could yield three different "accurate" answers but with "timeliness" problems: Washington, Philadelphia, and New York. A reply of "Little Rock" might raise questions of authority.

A focused search would by necessity be structured differently. "According to official documents, what did the Palestinians claim Israel's capital to be in April 1995?" Or, "According to leading historians, where was the American capital in 1777?" Authority and timeliness are much more rigidly defined by the searcher or end user for a focused search than for a satisficed search.

All searches have in common the objective to return valuable material while limiting the false hit rate. Satisficed searches are designed to throw as broad a net as possible to return as much as possible about the subject, while limiting the number of false hits. By throwing the broad net, it is inherent in the satisficed search that false hits will be returned. A well crafted focused search should have a shorter list of relevant and pertinent documents and the number of false hits should equally be reduced. The "ultimate" but rarely achieved focused search has but one hit returned--the document sought--with no false hits whatsoever.

Finally, most search engines offer online documentation for their basic and advanced versions. Most of these, however, are inadequate. The following are too often unanswered: Which Boolean operator takes precedence in a search equation? How near is NEAR in a proximity search? How often are new materials added to the index or directory and how often is it weeded? How big is the index or directory and how representative of the WWW as a whole is it? In sum, does the index or directory contain the material sought, and if so, can it be retrieved? An index or directory full of dead documents should be avoided as should those whose search operators function none too well.

Satisficed searches

The online search engines and online and front-end metasearch engines can be employed in a satisficed search. The searcher should seek to limit, refine, or focus the first search iteration somewhat to avoid searching "everything." In subsequent search iterations, additional search terms can be added or deleted to better define the search pool. Again, it should be stressed that Web searches cannot be set, therefore each search is a unique event unassociated with those which preceded it.

Search engines

Most online search engines provide good if not excellent satisficed service--the very premise that underlies the search directories. If the search engines can be faulted, it is not only that most return too much, but that they also return too little. If one performs the same search on several different search engines, one is struck by the vastly different return sets offered by those engines. There are, of course, as many reasons for the different return sets are there are search engines: different indexes of directories searched, different search algorithms, different interpretation of Boolean and other operators, and so on.

The search engines may also present the hit sets in different ways. Many seek to offer relevance ranking. That differs. Some return textual material hits only, others include graphics. The Web engines also differ in the amount of the identified return set they report. Some search engines permit access to all hits, others place a cap on the number that the searcher can touch. Thus, one search engine may report that, for example 10,000 documents meet the search criteria, but only allow the searcher to see and access the first 1,000. For most searchers and most searches, the first 1,000 hits may be more than sufficient to meet most search needs, but not always.

The search indexes tend however, to overwhelm the user with irrelevant recall or false hits. There are any number of potential sources for false hits. As the number of languages and the number of Web pages written in that variety of languages increases [9], the number of false hits, noise, will increase. For example, search on the country "Reunion." A search on that term will bring back a multitude of hits for family, school, military unit, and other reunions. One option to eliminate or reduce the noise is to couple the term, in this case "Reunion," with another using the Boolean "AND." That, however, may have the deleterious effect of sharply reducing the return pool by excluding relevant material.

The limited area search engine (LASE) is one possible response to reducing both false hits and the overall size of the search pool. The LASE is a precoordinate subsetting mechanism designed to reduce the searchable pool, and therefore the search engine index, to only those Web documents determined based on pre-established standards to meet inclusion criteria. It only indexes and searches pre-selected Web documents -- for example, the ANANZI search engine is restricted to material in or about South Africa. There are thousands of LASEs on the Web, and they are used to search for everything from pornographic content to academic literature in the classics.

The growth in the number of Web documents in languages other than English will pose new challenges. I have argued elsewhere that English dominates the Internet because most Internet material emanates from English speaking places. It is also clearly the second language of the Internet [9]. But, English is not the only language of the Internet. These challenges will cause us to address not only searching on accent and diacritical marks, but also on non-roman characters.

Metasearch engines

One solution to the problem of multiple search engines is to use metasearch engines, engines that feed search terms to multiple online search engines. These metasearch engines can be found both on the WWW or as front-end, hard-disk resident COTS.

I often monitor sites or repeat searches. As a result, the online metasearch engines are not particularly useful for me, since they cannot support search refinement or modification and because they can only feed the "basic" versions of their listed search engines. These metasearch engines also cannot be programmed to run during off-peak hours, nor can they be programmed to repeat searches at specified times. They, like their front-end cousins, tend also not to support Boolean, temporal, or proximity operators. Set building is not possible. But for one-time-only satisficed searches, the online metasearch engines are more than satisfactory.

A number of front-end metasearch engines have been introduced recently, among them Surfbot, WebCompass, WebFerret, and WebSeeker. These can be used to improve relevance, manage recall, as well as maintain a currency check. These and other software packages can also be used to develop local archives or document pools for offline searching and evaluation.

For complex and continuing searches, these packages can be combined to provide the researcher with practical data management schemes. WebCompass 2.0, for example, has a "blocking" feature. Once a URL is deleted from a return set, it remains blocked. One can, for example, search a subject, select from the return set those documents that may be relevant, store those documents in HTML format for later search and evaluation, delete the return set, and repeat the search using the same (or modified) terms later. Periodically repeating the search returns uncovers not previously retrieved, but new material as well.

Focused searches

If the metasearch engines are the vehicle for satisficed searching, the advanced versions of the search engines are appropriate for focused searches. Focused searches are designed to limit the sphere or arena searched by the search engine. By using these limiting techniques, the creating of search sets can be approximated. Searches can be made more focused using the advanced versions of several of the full-text index-based search engines. These include AltaVista Advanced, HotBot Expert, and Open Text Power.

It is neither appropriate nor necessary to provide instructions in the specific use of these or other search engines here. Suffice it to say that all three support Boolean, proximity, and temporal operators. Each of the three support searching on domain name fragments. They can also discriminate, to varying degrees, among media types. The ability to discriminate among domain names may contribute most to each of these search engines' ability to set. Quasi-set building can be approached in HotBot for example, by setting the geographic option to top-level functional and geographic (ISO 3166) domains and to second-level and lower domain names. Open Text provides a drop-down menu to accomplish the same end. AltaVista uses the "URL:<domain name fragment>" syntax. Thus, searches can be limited to specific domains, ranging from the top-level domain through the server level.

A negative variation can be used ("Boon't" from Boolean Not) in AltaVista and HotBot. Boon't allows the searcher to specify a domain not to be included in the search. Care must be taken with Boon't searching, since the search engine will exclude documents containing the domain name fragment anywhere.

Because the full-text index search engines build indexes of keywords as they appear in the search documents, searches can be tailored by using appropriate spellings, accent marks, diacritical marks, alphabets, symbols, and other language constructs. A search on Peru and Perú will yield the same results in a directory, but very different ones in a full-text index. Thus, if I were searching for official Peruvian government documents, I might set the geographic limiter in the search engine to ".pe," the ISO 3166 top-level domain name for Peru, and search on Perú and gobierno, as well as other appropriate search terms. For Uruguay, because they follow a second-level standard, I could further subset my search by entering ".gub.uy" as a functional and geographic limiter.

Other search techniques

There are several other techniques that are used to search the Internet. These are important tools because no matter how effective the search engines are or become, there will always be Web material not yet accessed or indexed by those search engines. These techniques can be used to improve the recall of that unindexed data.

Some of the techniques are relatively new and are based on new COTS. Others are "time tested" and ubiquitous. Neither link exploration nor URL shaving are new techniques. All WWW searchers have employed both of these from one time to another. But both are useful techniques for search and retrieval strategies. These, together with citation analysis and site mapping permit the searcher to find material not yet included in the search directories or indexes. This has the effect of increasing the potential overall recall of relevant materials.

Citation analysis

Citation analysis is a bibliometric technique used to map the impact of one intellectual product on another. In principle, if a work is cited by another work, the first is perceived to have had an influence on the other. The use of citation analysis in the print world is well established. In general, more recent documents cite older ones, at least in the print world. In the cyberworld, older documents can cite newer ones with the addition of a hypertext link. That may confuse somewhat the apparent order of influence and the evolution of ideas.

Citation analysis can be used by the searcher in two ways. The first is the traditional cycle of influence. The second application is to use citation analysis to find more documents "like" the one at hand. This is based on the assumption that document authors will point to like documents, cite those other documents as justification for their findings and assumptions. At least two search engines, WebCrawler and AltaVista provide mechanisms for what WebCrawler terms "backward searching." The two search engines, once the appropriate syntax is used, will find other documents that provide "out-links" to the specified document.

Link exploration

Link exploration is a more purposeful, more focused use of Internet surfing. It is a variant of citation analysis. The assumption is made that following links from a document can lead to additional like documents or other valuable information. The searcher follows links from the target document to other related documents of interest and may move through any number of Web pages. The technique differs from surfing in that the process is explicitly nonrandom. Only those links that, for whatever reason, appear to offer the potential for useful information are followed. Should the vein be in vain, the searcher can return to the original page and begin the process again.

This process and URL shaving are more important in the search process than first may appear. The processes can lead and have led to the discovery of Web pages not yet indexed by the search engines of value to the researcher. As a general rule, I find the link exploration process to be most fruitful after extensive use of the search and metasearch engines and after I am reasonably well acquainted with the body of literature captured. Thus sensitized, the exploration process tends to yield better results than when not.

URL shaving

URL shaving is a straightforward and precise process. The typical URL takes the form "http://aaa.bbb.ccc/xxx/yyy/zzz.htm". The aaa.bbb.ccc portion of the example URL is the domain name at the server level. The /xxx/yyy/zzz portion is the directory/file structure of the Web document on that server. By shaving off slices between the single slashes, the searcher can move up the document, ultimately to the server home page. By shaving up the document, it is possible to explore all documents on the server as well as to link trek from those documents.

Domain shaving is a variant of URL shaving and requires the use of search engines. The domain name can be shaved from the server name up the domain name to the top level. These fragments can be searched in those search engines that support the technique to identify other subdomains within the domain of potential interest to the searcher.

Site mapping

Site mapping is similar to URL shaving, in that the structure of the Web site is explored. Rather than moving up the file structure by shaving pieces from the URL in a browser, a map or diagram of the site is created using a software application. There are several COTS available. They provide a variety of map styles and depth of detail. Selection of mapping software should be based on specific searcher requirements.

The simplest of the choices and least expensive provides a list of all documents found under the target or "propositus" document. Propositus comes from the language of genealogy and means the center or target person, the one to whom and from whom the genealogical chart is drawn. This list usually includes the page title, size, and URL. Mosaic 2.0 will generate a list on the fly.

A second, more complex map provides the tree structure and page type as well as a list. Frequently, the last page modification date is given. CyberPilot and ClearWeb perform these functions.

A third, more complex presentation, offers lists, tree structures, and ring maps of Web documents. WebAnalyzer and FlashSite offer ring maps.

Many of the more sophisticated products will download and store Web documents locally. They also act as URL maintainers, in that they will automatically revisit Web pages and report on changes, if any, to those pages since the previous update.

A detailed site map can be an invaluable tool to the Web researcher. The map provides the researcher with a diagram as well as a list of document contents, size, location, last update, media type, and other pertinent information. The researcher is provided with a context, and may select Web documents based on specific criteria. For example, the researcher may have a need for graphic-rich resources or for extensive textual material. The site map can provide useful guidance in the selection process.

Conclusions

The Internet continues to grow and to grow more complex. Most attention has "traditionally" been focused on its engineering rather than on search retrieval issues. I suggest that as the Internet grows more complex, it is necessary to turn our attention to enhancing systems to retrieve that information as well as maintaining excellence in the technical sector.

The number of tools available to search and retrieve information from the Internet, and particularly the WWW continues to increase. Each new generation of tools also increases in sophistication and flexibility. No matter how sophisticated those tools become, they are no better than the strategies employed to use them and the structural integrity of the material searched.

Two generic types of search strategies were suggested: focused and satisficed. The search tool selected depends in large part on the strategy to be followed. In general, the index and directory search engines provide excellent satisficed service. The quality of that service can be improved using the metasearch engines.

Focused searches require a more focused approach. The full-text advanced versions of several online search engines support quasi-set building. Each of the search engines offers different functions, and searchers should vary search engines as search needs vary.

URL shaving, link trekking, and site mapping are suggested as additional search approaches. By combining these various tools, the outcome of searches can be improved.

A final note. Index search engines rely on numerous tags to create their search indexes. I would urge authors and publishers to incorporate as many standardized tags as feasible to facilitate searches. I would also urge search engine developers to expand the number of tags their search engines will search. Searchers should test the limits of available search engines and metasearch engines and share new insights with the Internet public.

References

[1] Koehler, Wallace, and Danielle Mincey, "FirstSearch and NetFirst Web and Dialup Access: Plus ça change, plus c'est la même chose?" Searcher, 4, 6 (June 1996), pp. 24-28.

[2] HotBot provides an estimate of the number of documents it accesses at http://www.hotbot.com.

[3] Network Wizards, Host Distribution by Top-Level Domain Name, http://www.nw.com/zone/WWW/dist-bynum.html.

[4] TheNetcraft Web Survey Survey, http://www.netcraft.co.uk/survey/.

[5] Chankhunthod, Anawat, Peter B. Danzig, Chuck Neerdaels, Michael F. Schwartz, Kurt J. Worrell, Object Lifetimes, in A Hierarchical Internet Object Cache, 1995. http://excalibur.usc.edu/cache-html/cache.html.

[6] Dalrymple, P., and N. Roderer, "Database Access Systems," Annual Review of Information Science and Technology, 29, 1994, pp. 137-178.

[7] Tenopir, Carol, "The Same Databases on Different Online Systems," Library Journal, 116 (May), 1991, pp. 59-60.

[8] Swank, Kris, Susan Lubbe, and Lesley Heaney, "Introducing NIT to an Historically Disadvantaged Institution in South Africa." Proceedings of the 9th Information Conference, New Information Technology. Pretoria, 11-14 November 1996. Ching-chih Chen, ed., West Newton, Massachusetts: MicroUse Information, 1996, pp. 283-292.

[9] Koehler, Wallace, A Descriptive Analysis of Web Document Demographics: A First Look at Language, Domain Names, and Taxonomy in Latin America." Proceedings of the 9th Information Conference, New Information Technology. Pretoria, 11-14 November 1996. Ching-chih Chen, ed., West Newton, Massachusetts: MicroUse Information, 1996, pp. 159-170.

An End User's View of Mining the Web: Focused and Satisficed Internet Search and Retrieval Strategies

Abstract

Contents