INET Conferences


Conferences


INET


NDSS

Other Conferences


[INET'98] [ Up ][Prev][Next]

The Effect of Centroid Size on Precision and Recall of Web Search Engines

Pedro GONÇALVES <pfg@di.ufpe.br>
Jacques ROBIN <jr@di.ufpe.br>
Thiago SANTOS <tlvls@di.ufpe.br>
Oscar MIRANDA <ogm@di.ufpe.br>
Silvio MEIRA <srlm@di.ufpe.br>
Universidade Federal de Pernambuco
Brazil

Abstract

We present a controlled experiment that measures the effect, on retrieval effectiveness and efficiency, of varying the maximum keyword number used to represent documents in Web search engines. We define a general methodology to quantitatively evaluate the impact of a given design option on the retrieval effectiveness-efficiency trade-offs and apply it to this particular case of centroid size. The resulting data suggests that in the general case, the most appropriate centroid size is 100.

Contents

Introduction

The exponential growth of the Web makes more crucial every day the development of powerful, user-friendly tools to cope with information overload. The most widely used of such tools, and perhaps the only ones to have made inroads in the everyday life of the general public, are Web search engines.

While these engines may significantly vary in the details of the Information Retrieval (IR) techniques they use, the great majority of them nonetheless share the same basic principles. One such principle is the use of keyword sets to both

  1. Represent the content of each document (i.e., Web page) at indexing time; and
  2. Represent the content of each information need at querying time.

The engine then returns a given document for a given query if their respective keyword sets match. Another pervasive principle is the ranking of retrieved documents by decreasing (assumed) relevance. Ranking allows the user to quickly skim search results, instead of exhaustively checking them. It thus improves the engine's practical precision, i.e., the proportion of relevant documents among the highest ranked retrieved ones.

The two principles just mentioned are independent of both

  1. The particular IR techniques used to select the representative keywords (e.g., Term Frequency Inverted Document Frequency [Paijmans 94]); and
  2. The particular Natural Language Processing (NLP) techniques (if any) used to abstract the selected keywords into more conceptual representations (e.g., morphological reduction to root words and query expansion by thesaurus look-up).

Since keyword-based, ranked approaches are so pervasive in search engines, an almost universal design issue is to define the size of the keyword set, or centroid, used to represent documents in the index base.

At one end of the size spectrum, the centroid can be limited to a few words extracted from the document (e.g., the content words in the document title). At the other end, it may consist of all the document's content words. Nouns, adjectives, and verbs are called content words because they alone convey almost all the information in a text. They are opposed to function words such as articles, pronouns, prepositions, conjunctions, and auxiliaries, whose presence is merely required by syntax. Since these function words contribute virtually no information, IR systems typically include all of them in a stop-list of nondiscriminant words to exclude from centroids.

One knows intuitively that the larger the centroid, the more queries it will match, and thus the higher the engine's recall, i.e., the proportion of relevant documents it manages to retrieve. On the other end, the larger the centroid, the larger the database necessary to index the same number of documents and the more expensive the retrieval. Choosing a centroid size is thus trading off effectiveness for efficiency.

In this paper, we present empirical data on which to base this trade-off. This data was obtained from a controlled experiment measuring the effect of centroid size on

  1. Retrieval effectiveness, taking into account both precision and recall; and
  2. Computational cost, taking space into account.

We first present a general methodology to quantify the contribution of a given design option to the retrieval effectiveness and computational cost of IR in very large document collections. We then apply this methodology to the particular case of centroid size and discuss the results obtained. In other papers [Gonçalves-Robin 98a], [Gonçalves-Robin 98b], we are reusing the same methodology to quantify the trade-offs involved with other search engine design options, in particular the incorporation of various shallow NLP techniques.

The present experiment provides insights for both

  1. Search engine developers trying to make the most of the computational resources available within their budget; and
  2. Computer scientists creating the novel Web-based information system designs needed to prevent the emergence of computational bottlenecks on the Internet.

General evaluation methodology

Our general methodology to quantify the effect on Web search of a given IR system design option consists of the following steps:

  1. Run an indexing robot on a subset of the Web, representative of pages frequently visited by average Web surfers, to build a control index base;
  2. Derive from this control index base a set of test index bases, one for each option being compared;
  3. Have a set of Web surfing subjects write a set of information needs (in natural language) for which they will search the Web;
  4. Have these subjects write up to Nq queries for each of their information needs;
  5. Run the search engine with each different index base, keeping statistics on the size of each base and the run time of each search;
  6. From those statistics, compute the average of space and time requirements of each search set;
  7. Keep the Mr most relevant results of each search;
  8. Have the subjects partition these results into relevant pages and irrelevant pages, for each information need;
  9. From the cardinal of these partitions' elements, compute the precision, recall, and F-measure of each search, using the formulas at the top of Fig. 1;
  10. Compute the averages of these three parameters over each search set by instantiating the formula at the bottom of Fig.1 with parameter = precision, recall, and F-measure (respectively).

Set definitions:

T = set of search types, including the control search and the various test searches
I = set of all information needs
Q(i) = set of all queries for information need i
retrieved(t,i,q) = set of all retrieved documents for search type t, information need I, and query q
relevant(t,i,q) = part of retrieved(t,i,q) marked as relevant by evaluating subject

Parameter definitions:

Fig. 1. Defining formulas of information retrieval parameters used by evaluation methodology

The formulas of Fig.1 are customized to the particular problem of searching very large document collections. They differ from classic IR formulas in two main ways: relative recall and retrieval cut-off.

Relative recall

The size of very large document collections precludes relying on the repeated manual inspection of the entire collection, once for each information need used in the experiment. This turns document not retrieved by any query de facto inaccessible for evaluation purposes. It thus precludes measuring absolute recall, which involves counting the relevant but not retrieved documents. However, for purely comparative studies such as ours, the relative form of recall defined in Fig. 1 can be used instead as suggested by [Tague-Sutcliffe 92]. It measures the proportion of relevant documents retrieved with one given design option among all relevant documents retrieved with any design option.

Retrieval cut-off

Using relative recall limits the evaluation to the set of automatically retrieved documents. However, an IR system with decent recall running on a very large document collection may still return too many documents for exhaustive repeated manual examination. This precludes measuring absolute precision.

However, for IR using document ranking, cutting off retrieved documents considered for the evaluation to the Mr best matches of each query yields a good approximation of precision. The precision formula of Fig. 1 assumes the use of such cut-off.

F-measure

The F-measure is a popular combination of precision and recall into a single parameter. It is used in major evaluation conferences (including the Message Understanding Conferences [MUC]) and has many properties, discussed in detail in [Van Rijsbergen 79], that make it an attractive measure of overall IR effectiveness. These properties include

  1. The fact that it goes up with precision and/or recall only when the two remain close together, thus giving low scores to design options that trivially obtain high precision by sacrificing recall or vice versa; and
  2. The fact that in its original definition, using absolute recall and precision and giving them equal weight, the F-measure corresponds to the size of the intersection between the relevant document set and the retrieved document set, normalized by the cumulated size of both, i.e.,

Since an ideal IR system retrieves only relevant documents and all of them, this intersection is precisely what any IR system attempts to maximize.

We opted to keep equal weight definition for F-measure with relative recall and approximate precision, because while recall values get boosted by its relative definition, precision values also get boosted by the use of retrieval cut-off.

Centroid experiment

We used the general evaluation methodology just presented to measure the effect of centroid size on IR effectiveness and space requirement. Effectiveness was measured in terms of precision, recall, and F-measure, while space requirement was measured as the total number of entries in the index base.

We ran the experiment using Bright (BRazilian Internet Guide in HyperText, http://www.bright.org.br), our Java/JDBC software integrating a Web indexing robot, a search engine, and an evaluation environment to carry out IR measurements. The current version of Bright has the following characteristics:

  1. It uses term frequency to pick the words of the centroid;
  2. Its index base consists of triplets containing (1) a keyword K member of the centroid of a document D, (2) the number of occurrences of K in D, and (3) the URL of D;
  3. It allows queries in the form of individual keywords joined by either disjunctions or conjunctions; and
  4. It uses the number of query keyword occurrences to rank retrieved documents.

The document collection for the experiment consisted of 87,000 Web pages sampled from 4,600 Brazilian Internet servers. It contained 3.9 million occurrences of 253.054 content words, with an average 44 different content words per page. We set Nq to 10 and Mr to 20 and had 5 subjects write a total of 267 queries for 37 information needs. An example of information need (translated from Portuguese) used in the experiment, together with its corresponding queries, is given in Fig. 2. The result evaluation required checking 1,984 pages for relevance.

Starting from the observation that less than 0.03% of the collection pages contained more than 100 different content words, we ran the queries for centroid sizes 100, 80, 60, 40 and 20.

Title: Brazilian race drivers
NL Description: Document describing the career or providing recent news about several present or past Brazilian race drivers.
Queries
& Brazilian Formula 1 drivers
& Brazilian F1 drivers
& Brazilian auto racing
& Brazilian motorsport
& Brazilian Indy drivers
& Brazilian CART drivers
& Senna Piquet Fittipaldi
& Brazilian drivers GP
Fig.2. An example information need with associated query set used for the experiment.

Results

The results of the centroid experiment are summarized in Tables 1 and 2. Table 1 gives the average relative recall, precision, F-measure, and space requirement (in millions of entries) measured for each centroid size. Table 2 gives, for each parameter, the percentage of change resulting from each centroid size increment.

Table 1. Absolute averages.

Centroid Size
20
40
60
80
100
Recall
7.5%
11%
14%
16%
18%
Precision
26%
30%
32%
34%
35%
F-measure
9.7%
14%
16%
18%
20%
Space
1.43
2.49
3.31
3.85
3.90

Table 2. Change percentages per size increment.

Centroid Size
20-40
40-60
60-80
80-100
Recall+48% +24%+19% +12%
Precision+17% +5%+8% +1.8%
F-measure+40% +18%+15% +9%
Space+74% +33%+16% +1.5%

The first important result of the experiment is a consequence of the statistic, mentioned in the previous section, that 99.97% of the sampled Web pages contained fewer than 100 content words. This indicates that there is no point in considering a centroid larger than 100 when designing a keyword-based search engine.

The second important result is that the data empirically confirms the intuition that increasing centroid size brings a sizable increase in average recall without hindering average precision. To the contrary it also brings a simultaneous increase in average precision (albeit lower than that of recall). Therefore, augmenting centroid size up to 100 words can only improve the average effectiveness of a search engine.

Comparing the change percentages in average F-measure and space requirements at the bottom of Table 2 brings further insights. They suggest that average performance gains follow a law of diminishing returns in terms of centroid size. However, because space requirements simultaneously follow a law of diminishing overhead, there are no clear empirical reasons to stop anywhere between centroid sizes of 20 and 100. In particular, the fact that increasing centroid size from 80 to 100 comes virtually for free and yet allows a 12% gain in average recall suggests that centroid size 80 be excluded from consideration.

Overall, reducing centroid size from 100 down to 20 allows one to save 63% space, but at the cost of losing 51% of average retrieval performance. This clearly shows that varying centroid size does not allow one to escape the space requirement problem of Web search engines and that more powerful techniques such as NLP or distributed index base [Gonçalves-Meira-Salgado 97] should be considered.

Conclusion and future work

The contribution of this paper is twofold:

  1. A general evaluation methodology to quantify the contribution of a given design option, to the retrieval effectiveness and computational cost of IR systems in very large document collections; and
  2. Empirical data derived by applying this methodology to the issue of varying centroid size to trade-off retrieval performance for index base space requirements in keyword-based Web search engines.

The data suggests that, unless space resources are really scarce, a centroid of 100 is the best choice.

In future work, we intend to reuse the same evaluation methodology to measure the contributions of various IR and NLP techniques to retrieval effectiveness, as well as their associated computational costs.

References

[Gonçalves-Meira-Salgado 97] Gonçalves, P.F., Meira, S.L.R. and Salgado, A. C. A Distributed Mobile-Code-Based Architecture for Information Indexing, Searching and Retrieval in the World-Wide Web. Proceedings of the 7th Annual Conference of the Internet Society (INET'97), Kuala Lumpur, Malaysia, June 1997. www.di.ufpe.br/~bright/inet97.html.

[Gonçalves-Robin 98a] Gonçalves, P.F. and Robin J. Measuring the Contribution of Inflectional Morphology to Web Search Engine Precision and Recall. In preparation.

[Gonçalves-Robin 98b] Gonçalves, P.F. and Robin, J. Measuring the Contribution of Thesaurus-Based Query Expansion to Web Search Engine Precision and Recall. In preparation.

[MUC] The Message Understanding Conference Scoring Software User's Manual.www.dcs.shef.ac.uk/~martins/MUC_scorer3.2/manual.html.

[Paijmans 94] Paijmans, J.J. Relative weights of words in documents. Proceedings of the STINFON Conference, pp-195-308, Noordman, L.G.M and De Vroomen, W.A.M. Eds.

[Tague-Sutcliffe 92] Tague-Sutcliffe, J. The pragmatics of information retrieval experimentation, revisited. Information Processing and Management, 28, 467-490, Elsevier, Oxford, UK.

[Van Rijsbergen 79] Van Rijsbergen, K. Information Retrieval, (2nd Ed.) Butterworths, London. www.dcs.gla.ac.uk/Keith/Preface.html.

[INET'98] [ Up ][Prev][Next]