The Effect of Centroid Size on Precision and Recall of Web Search Engines
Pedro GONÇALVES <email@example.com>
We present a controlled experiment that measures the effect, on retrieval effectiveness and efficiency, of varying the maximum keyword number used to represent documents in Web search engines. We define a general methodology to quantitatively evaluate the impact of a given design option on the retrieval effectiveness-efficiency trade-offs and apply it to this particular case of centroid size. The resulting data suggests that in the general case, the most appropriate centroid size is 100.
The exponential growth of the Web makes more crucial every day the development of powerful, user-friendly tools to cope with information overload. The most widely used of such tools, and perhaps the only ones to have made inroads in the everyday life of the general public, are Web search engines.
While these engines may significantly vary in the details of the Information Retrieval (IR) techniques they use, the great majority of them nonetheless share the same basic principles. One such principle is the use of keyword sets to both
The engine then returns a given document for a given query if their respective keyword sets match. Another pervasive principle is the ranking of retrieved documents by decreasing (assumed) relevance. Ranking allows the user to quickly skim search results, instead of exhaustively checking them. It thus improves the engine's practical precision, i.e., the proportion of relevant documents among the highest ranked retrieved ones.
The two principles just mentioned are independent of both
Since keyword-based, ranked approaches are so pervasive in search engines, an almost universal design issue is to define the size of the keyword set, or centroid, used to represent documents in the index base.
At one end of the size spectrum, the centroid can be limited to a few words extracted from the document (e.g., the content words in the document title). At the other end, it may consist of all the document's content words. Nouns, adjectives, and verbs are called content words because they alone convey almost all the information in a text. They are opposed to function words such as articles, pronouns, prepositions, conjunctions, and auxiliaries, whose presence is merely required by syntax. Since these function words contribute virtually no information, IR systems typically include all of them in a stop-list of nondiscriminant words to exclude from centroids.
One knows intuitively that the larger the centroid, the more queries it will match, and thus the higher the engine's recall, i.e., the proportion of relevant documents it manages to retrieve. On the other end, the larger the centroid, the larger the database necessary to index the same number of documents and the more expensive the retrieval. Choosing a centroid size is thus trading off effectiveness for efficiency.
In this paper, we present empirical data on which to base this trade-off. This data was obtained from a controlled experiment measuring the effect of centroid size on
We first present a general methodology to quantify the contribution of a given design option to the retrieval effectiveness and computational cost of IR in very large document collections. We then apply this methodology to the particular case of centroid size and discuss the results obtained. In other papers [Gonçalves-Robin 98a], [Gonçalves-Robin 98b], we are reusing the same methodology to quantify the trade-offs involved with other search engine design options, in particular the incorporation of various shallow NLP techniques.
The present experiment provides insights for both
Our general methodology to quantify the effect on Web search of a given IR system design option consists of the following steps:
T = set of search types, including the control search and the
various test searches
The formulas of Fig.1 are customized to the particular problem of searching very large document collections. They differ from classic IR formulas in two main ways: relative recall and retrieval cut-off.
The size of very large document collections precludes relying on the repeated manual inspection of the entire collection, once for each information need used in the experiment. This turns document not retrieved by any query de facto inaccessible for evaluation purposes. It thus precludes measuring absolute recall, which involves counting the relevant but not retrieved documents. However, for purely comparative studies such as ours, the relative form of recall defined in Fig. 1 can be used instead as suggested by [Tague-Sutcliffe 92]. It measures the proportion of relevant documents retrieved with one given design option among all relevant documents retrieved with any design option.
Using relative recall limits the evaluation to the set of automatically retrieved documents. However, an IR system with decent recall running on a very large document collection may still return too many documents for exhaustive repeated manual examination. This precludes measuring absolute precision.
However, for IR using document ranking, cutting off retrieved documents considered for the evaluation to the Mr best matches of each query yields a good approximation of precision. The precision formula of Fig. 1 assumes the use of such cut-off.
The F-measure is a popular combination of precision and recall into a single parameter. It is used in major evaluation conferences (including the Message Understanding Conferences [MUC]) and has many properties, discussed in detail in [Van Rijsbergen 79], that make it an attractive measure of overall IR effectiveness. These properties include
Since an ideal IR system retrieves only relevant documents and all of them, this intersection is precisely what any IR system attempts to maximize.
We opted to keep equal weight definition for F-measure with relative recall and approximate precision, because while recall values get boosted by its relative definition, precision values also get boosted by the use of retrieval cut-off.
We used the general evaluation methodology just presented to measure the effect of centroid size on IR effectiveness and space requirement. Effectiveness was measured in terms of precision, recall, and F-measure, while space requirement was measured as the total number of entries in the index base.
We ran the experiment using Bright (BRazilian Internet Guide in HyperText, http://www.bright.org.br), our Java/JDBC software integrating a Web indexing robot, a search engine, and an evaluation environment to carry out IR measurements. The current version of Bright has the following characteristics:
The document collection for the experiment consisted of 87,000 Web pages sampled from 4,600 Brazilian Internet servers. It contained 3.9 million occurrences of 253.054 content words, with an average 44 different content words per page. We set Nq to 10 and Mr to 20 and had 5 subjects write a total of 267 queries for 37 information needs. An example of information need (translated from Portuguese) used in the experiment, together with its corresponding queries, is given in Fig. 2. The result evaluation required checking 1,984 pages for relevance.
Starting from the observation that less than 0.03% of the collection
pages contained more than 100 different content words, we ran
the queries for centroid sizes 100, 80, 60, 40 and 20.
Title: Brazilian race drivers
The results of the centroid experiment are summarized in Tables 1 and 2. Table 1 gives the average relative recall, precision, F-measure, and space requirement (in millions of entries) measured for each centroid size. Table 2 gives, for each parameter, the percentage of change resulting from each centroid size increment.
Table 1. Absolute averages.
Table 2. Change percentages per size increment.
The first important result of the experiment is a consequence of the statistic, mentioned in the previous section, that 99.97% of the sampled Web pages contained fewer than 100 content words. This indicates that there is no point in considering a centroid larger than 100 when designing a keyword-based search engine.
The second important result is that the data empirically confirms the intuition that increasing centroid size brings a sizable increase in average recall without hindering average precision. To the contrary it also brings a simultaneous increase in average precision (albeit lower than that of recall). Therefore, augmenting centroid size up to 100 words can only improve the average effectiveness of a search engine.
Comparing the change percentages in average F-measure and space requirements at the bottom of Table 2 brings further insights. They suggest that average performance gains follow a law of diminishing returns in terms of centroid size. However, because space requirements simultaneously follow a law of diminishing overhead, there are no clear empirical reasons to stop anywhere between centroid sizes of 20 and 100. In particular, the fact that increasing centroid size from 80 to 100 comes virtually for free and yet allows a 12% gain in average recall suggests that centroid size 80 be excluded from consideration.
Overall, reducing centroid size from 100 down to 20 allows one to save 63% space, but at the cost of losing 51% of average retrieval performance. This clearly shows that varying centroid size does not allow one to escape the space requirement problem of Web search engines and that more powerful techniques such as NLP or distributed index base [Gonçalves-Meira-Salgado 97] should be considered.
The contribution of this paper is twofold:
The data suggests that, unless space resources are really scarce, a centroid of 100 is the best choice.
In future work, we intend to reuse the same evaluation methodology to measure the contributions of various IR and NLP techniques to retrieval effectiveness, as well as their associated computational costs.
[Gonçalves-Meira-Salgado 97] Gonçalves, P.F., Meira, S.L.R. and Salgado, A. C. A Distributed Mobile-Code-Based Architecture for Information Indexing, Searching and Retrieval in the World-Wide Web. Proceedings of the 7th Annual Conference of the Internet Society (INET'97), Kuala Lumpur, Malaysia, June 1997. www.di.ufpe.br/~bright/inet97.html.
[Gonçalves-Robin 98a] Gonçalves, P.F. and Robin J. Measuring the Contribution of Inflectional Morphology to Web Search Engine Precision and Recall. In preparation.
[Gonçalves-Robin 98b] Gonçalves, P.F. and Robin, J. Measuring the Contribution of Thesaurus-Based Query Expansion to Web Search Engine Precision and Recall. In preparation.
[MUC] The Message Understanding Conference Scoring Software User's Manual.www.dcs.shef.ac.uk/~martins/MUC_scorer3.2/manual.html.
[Paijmans 94] Paijmans, J.J. Relative weights of words in documents. Proceedings of the STINFON Conference, pp-195-308, Noordman, L.G.M and De Vroomen, W.A.M. Eds.
[Tague-Sutcliffe 92] Tague-Sutcliffe, J. The pragmatics of information retrieval experimentation, revisited. Information Processing and Management, 28, 467-490, Elsevier, Oxford, UK.
[Van Rijsbergen 79] Van Rijsbergen, K. Information Retrieval, (2nd Ed.) Butterworths, London. www.dcs.gla.ac.uk/Keith/Preface.html.