Internet Information Retrieval: The Further Development of Meta-Search Engine Technology
Wolfgang SANDER-BEUERMANN <firstname.lastname@example.org>
This paper first describes the state of the art of the meta-search technology. It defines criteria for evaluating such applications and investigates existing meta-searchengines. Secondly it outlines our approaches to solve some problems of Internet information retrieval, which were undertaken at Hannover University. We have been running high-traffic meta-searchengines for nearly two years (http://mesa.rrzn.uni-hannover.de/ and http://meta.rrzn.uni-hannover.de), and we are describing our experiences and the developments we have made to gain a higher degree of completeness and quality of Internet information retrieval.
One of the mainstream ideas we are following is the combination of the overwhelming mass of Internet data with manually reviewed information sources and own ranking algorithms. Combining these leads toward high quality results based on an Internet search as complete as possible.
The Internet is the richest source of information humanity has ever developed. Finding a particular piece of information on the Internet however is still a major problem -- retrieval of high quality information is even more difficult.
The difficulties of Internet information retrieval consist of two main challenges:
The "quality" of the results is judged by the user only: if the customer is satisfied with the results of the search, we will call it a "good quality." That means that the designer of any search engine interface is forced to understand the way his users are thinking.
Like any database retrieval, the Internet search should be complete. The word "complete," however, in an Internet sense is difficult to define -- we can never really search "the whole Internet." What we can do is search parts of the Internet nearly completely: We will mainly consider the World Wide Web (WWW) as the information source. If we confine ourselves to that part only, one might ask the following question.
1.2 Could one search engine solve the problem?
To answer that question, we must have an estimate of the total amount of data on the WWW, and the maximum amount one single search engine can index. We start with an estimate of the total amount of data of the World Wide Web.
According to http://www.nw.com/zone/WWW-9707/firstnames.html, we had 754,716 Internet hosts in July 1997, with a name starting with www. So we are on the safe side of a guess in assuming that we have at least approximately 750,000 WWW servers on the Internet. If we now do a rough guess of the average amount of data on an average Web server, we know the order of magnitude of the complete Web data. We analyzed lots of servers, and we found approximately 10 MByte of data per server. So we might have had about 7,500 GByte of Web data in July 1997. Extrapolation to January 1998 (when this paper was written) leads us to an order of magnitude of about 10,000 GByte of Web data. We consider this a conservative estimate, because the number of servers is definitely more than those starting with the name www and the average amount of data on a Web server is probably higher than our guess of 10 MByte.
On the other hand, we have to consider how much data can be indexed by the most advanced search engine technology. If we take Altavista (http://www.altavista.digital.com/) as an example of such technology, we might draw the following conclusions:
According to Altavista's own statements, their search engine is indexing 100 million Web pages. The indexable part of a Web page which we consider nowadays is of course just the textual part. So we must have an estimate of the average text content of a Web page. Analyzing our proxy-caches at University of Hannover, we found that the average Web page contains about 4 KByte of text. So Altavista is indexing something in the order of 400 GByte. Estimating the portion of the total Web text data then demands knowledge about the text to non-text (binary) relation of the average Web page. That relation is probably the most difficult part to guess. Analyzing our Hannover University proxy-caches shows a relation between 10% and 50% of text per Web page.
With all these estimates in mind, we can calculate the indexed portion from:
(number_of_pages_indexed * text_data_per_page) / (total_web_data * text_to_binary_relation)
Executing this calculation with the above values results in Altavista indexing between 8% and 40% of the total amount of text in the Web. Although the above calculations are only rough estimates, this fits together well if we compare the result with the those from querying different search engines.
Meta-search engine efficiency
We can calculate a meta-search engine efficiency by comparing the results found by the meta-service to those found by that search engine which delivers most results (the "best" search engine). If we define:
an efficiency might be evaluated by:
eff = (allHits - duplicates)/bestHits or
eff = 1 + (allHits - bestHits)/bestHits - duplicates/bestHits
If we look at the duplicates in detail, we encounter at least two difficulties: first we have to set up a good algorithm for duplicate recognition, and second, duplicates might be produced by search engines on their own (because their duplicate recognition algorithm might be not as good as the one of the meta-service). If we however look at the rate of duplicates in practice, we find that they range in the order of just 10 to 30 percent. For a first estimate we are just interested in orders of magnitude and factors. Therefore we might neglect the duplicates for a first estimate.
No matter how we calculate the efficiency in detail -- realistic searches lead to values in the range of two to five, meaning that a meta-search engine will deliver two to five times more results then the best single search engine.
Therefore the question "Could one search engine solve the problem?" can nowadays be answered by a clear no. If even a search-automat like Altavista is not able to index more than 40% of the Web, it is very certain that manually maintained databases like Yahoo (http://www.yahoo.com/) can never be able to cover significant parts. Even the problem of keeping the data up-to-date is not solvable: Investigation of our proxy-cache data shows that within half a year about half of the Web addresses are outdated.
If one search engine cannot solve the problem of Internet information retrieval, we obviously have to query several engines. If we do this in an automated way; we call the resulting automat a meta-search engine.
Before we discuss meta-search engines we should have a common understanding of the terminology used.
As the last example shows, these definitions are not in common sense yet. Often the catalogues (or directories) like Yahoo are called search engine, too. Although it is misleading, the usage of that terminology is already widespread, so we will not make an effort to reverse this. To make things even more confusing, we have search services which use a search engine combined with a directory, like Lycos http://www.lycos.com/.
We will not go further into this discussion; we will now focus on meta-search engines only.
First we will look at the client-based meta-search engines. These suffer from two shortcomings:
The last-mile problem addresses the fact that the "last mile" of the Internet connection from the provider to the user is the part with the lowest bandwidth. On the other hand, every meta-search creates high downstream dataflows from the search engines. From these dataflows about 50% or more is just thrown away by the meta-search postprocessing (due to removing multiple hits from different engines, and due to removing "useless information," like advertisements and other data not related to the search itself).
The update problem results from the fact that the search engine maintainers tend to change their output format rather often. With every change of that format, the postprocessing software of the meta-search engine needs to be updated. From our experience, this happens at least once a month. So an update has to be made every month. Because this is impractical for the end-user, we feel that client-based meta-search engines will play no major role in Internet information retrieval, and we will not consider them here anymore.
From the user's point of view, the server-based meta-search engine looks just like any other search engine. Before we list the existing meta-search engines, we will discuss some criteria to distinguish and rank these.
We can now list the existing meta-search engines and evaluate them by these seven criteria:
Only those services which have a "yes" in each column fulfill the criteria of being a real meta-search engine. As we can see, at the moment (Jan. 1998) there are just two of those: Highway61 and our MetaGer.
If we examine the boolean operators of each meta-search engine precisely, we can see that many of them will not perform a consequent AND; they sometimes switch to OR (to be precise, some of the underlying search engines do that if they cannot find anything matching the AND search, and the meta-service does not filter that out). This happens without any warning or notification to the user. Such behavior might be acceptable for a search engine, to give the user at least some results. But it is unacceptable for a meta-search engine, because their results might then be mixed up with true AND hits. Even Highway61 shows this inadequate behavior, so that presently just one meta-search engine remains which fulfills all criteria in a strict manner. We will describe MetaGer http://meta.rrzn.uni-hannover.de/ in the following section.
The idea of building our own meta-search engine was born while we were having lunch one day at the Cebit fair in 1996. We had the first prototype of our engine running some months later. It gathered results from AltaVista, Infoseek, Lycos, Yahoo and others. When we were ready to present our engine to the outside world, we learned that Erik Selberg and Oren Etzion of the Computer Science Department at the University of Washington had already launched the MetaCrawler, a similar device, several months earlier. We felt that there was no need to offer a similar service twice. However, at the same time, the idea of Internet searching became more and more a topic of interest to people in Germany. As a consequence, we concentrated our efforts on providing a meta-search for German services, which were not served by the MetaCrawler.
Another problem came to our attention in spring 1997: finding people's e-mail addresses seemed to be a common problem. We solved that by taking the framework of MetaGer and implementing MESA, the MetaEmail SearchAgent for international meta-search.
The contact with our users right from the beginning was very important to us: We tried to learn what their needs were, and we tried to incorporate their ideas into our search engine. A link checker was implemented, and we added the option for an international search by querying Highway61. After several months of usage we carefully analyzed our users interests and demands and responded by inventing the so-called "QuickTips" (see 4.1).
The software primarily runs on a Unix machine (ReliantUNIX Version 5.43) sponsored by Siemens-Nixdorf, RM600, having 2 CPUs R4400, 512 MB RAM and 100 GB disks on a 34-Mbps-ATM network interface. When implementing the software, we tried to use that programming language which is most suited for each task: we are using C, perl, awk, sed, Tcl/Tk and Bourne-Shell. At the time of writing this paper, our RM600 machine runs at it's maximum capacity, and we are presently implementing a load distribution, which automatically transfers user queries during high load periods to background Unix machines (SunUltra, Solaris 2.5.1).
The principle of a meta-search engine can be described by seven steps:
Some of the most common problems of the above steps are described in the following section.
One of the main technical problems of running a meta-search engine is the changing of output formats by the search engine maintainers. If the format of that data changes, the postprocessing software has to be adapted. When we realized that this happens quite often, we developed an administration tool just for this purpose. Additionally we have the requirement that the postprocessing software be robust, even if the format of the results has changed -- and that may happen at any time -- the meta-search output must still be well formatted.
Converting the query into the correct syntax for every underlying search service reveals another problem. Each search service uses a different query language and even more important: each service offers different options. If a meta-search engine wants to be transparent (i.e., does a true search engine hiding), it can only offer options that each of the underlying services offers (e.g., not every service gives us the ability to perform a string search). Furthermore, it is sometimes difficult to collect the HTML form parameters that are necessary to get the results (e.g., hidden parameters with undetermined values). But even these efforts are sometimes not successful because the service maintainer wants exactly the information in the HTTP request for the results that, e.g., Netscape Navigator uses. The tool webtee (http://www-cache.dfn.de/Cache/Software_webtee.html) is suited well for analyzing such situations.
Waiting for the results from the underlying services is another topic worth looking at. How can we keep a user waiting? For MetaGer, we specified a default maximum search time of 40 seconds. During this time we regularly give out information on how much time remains and why the user has to wait. Users are more willing to wait if they know how long and why.
The most serious problem for any meta-search engine, however, is an economic problem: All results presented are drawn from the resources of the search engine maintainers. These companies earn money by renting space for advertising on the search engine webpages. The meta-search engines cut these ads out, and give the pure information to the user. So it is understandable that a search engine maintainer might not be too pleased with being queried by a meta-search engine.
After launching MetaGer, we found that the reaction was opposite: maintainers asked us to add their service to MetaGer. This might be because the German search services were pretty new at that time, and they expected some advertisement effects if they show up by us as an independent university organization. A few services (e.g., the e-mail search service Four11 and the German catalog web.de) solve the advertisement problem in their own way (which is good for them, not for us): they do not include the original Internet addresses in the result pages but offer a link to another address which will reveal the correct address. This of course makes it impossible for a meta-search engine to combine these results but enables the service maintainer to show an advertisement under any circumstances.
The counting procedures for Web sites are another problem. Some of the services queried by MetaGer are using the counting service of the German IVW, which is a member of IFABC (International Federation of Audit Bureaux of Circulations). IVW is an independent organization which provides measured numbers of PageImpressions (PageViews) und Visits. It gives the customer who places his advertisement on a webserver a certain guarantee that it will be seen by the measured number of clients. The IVW counting relies on the download of a small image. So we agreed to download
This procedure now gives the benefits to both sides: The advertisement on the search engines is seen by all users of the meta-search engine too, and this view is counted by the measurement of IVW. We feel that this is a well balanced compromise.
Any search engine will be outdated, if not continuously improved and developed, just as the Internet as a whole is continuously developing.
One of the mainstream ideas we are following is the combination of the overwhelming mass of Internet data and manually reviewed information sources. We decided to rely on two sources of such information:
If someone does a query at our meta-search engine, we check these two sources. We decided to incorporate the DNS after analysis of our logfiles: a lot of inexperienced users are searching for terms which can easily be found in the DNS. This holds especially for the queries concerning companies: most companies have a webserver named www.Company.com etc. For two and more word queries we are looking for combinations of the searchwords, like www.word1-word2.com etc. To increase the speed of response, these DNS lookups run parallel to the meta-search. From our users' feedback we can conclude that about 75% of them are really happy with the so-called QuickTip-search. If a DNS lookup leads to a useless entry (e.g., someone has reserved a name without using it), we can exclude these flops by a manually maintained stoplist file.
Our own local database however relies on a different strategy. We know that we do not have the manpower to maintain a catalog like Yahoo. So we decided to put only those entries into our database which have been searched for with a certain frequency. On the other hand we found from our logfiles that even frequently searched words have a very low share (about 0.5%). What we can do, however, is to react to current events. These events show up as queries in our service, like the landing of Pathfinder on Mars or a heavy snowfall in Germany. When we realize we have queries related to such phenomena, we put entries into our database which lead to corresponding webpages.
The QuickTips are mainly a help for the inexperienced user. On the other hand, we have users who make really sophisticated queries. After checking our log files, we estimate that the portion of such users is just in the range of a few percent. Even so, these "power users" are our multipliers: If they spread knowledge of our service being "good," then their word counts and brings us many new users. For the experienced user we proceed as described in the following section.
Experienced users often do searches in their fields of specialty only. They know the terminology in these fields -- much better than we do. Therefore, it seems to be adequate to give these users the means to build their own special purpose services, dedicated solely to them and their working-group. For example, let us consider a working-group which does research on VRML and related techniques. This group has much experience in doing Internet searches. Their problem with information retrieval is that they are usually overwhelmed with the data found, and that they have now the cumbersome job of finding those pieces of information which they are really interested in. If this group had a special purpose search engine that looked for VRML and related topics only, they would have a valuable tool for their work.
This is exactly the project we are working on: we are in the process of building a tool which automatically generates special purpose search engines following input keywords given by the users. We are aware that such a capability may result in really heavy network load. Therefore, this tool will never be open to the general public. Every single user of this technique has to have a validation from us, and we will be very cautious with such validations. The project is sponsored by the Verein zur Förderung eines Deutschen Forschungsnetzes e.V. - DFN-Verein within the DFN-Expo Project. When we call a meta-search engine a second-order engine (the normal search engines are first-order engines in that terminology), we might call this type of search engine a third-order engine. We therefore named it "Level3."
Some new search engines are now based on the download of a Java applet, like
The first engine is supposed to be an international meta-search engine, but we found that service to be out of order most of the time. So we stopped considering it for our work. The latter is a German service, querying German sources only. That service started with a pure Java interface. But after realizing that many users do not have a Java capable browser, or that they have switched it off, the Java applet is now offered only as an option. Does the use of Java offer any advantages for the search engines?
We answer this question with a clear no. The idea of downloading a Java applet is similar to the idea of the client-based meta-search engines: the local system should do the workload. But this again does not help here: the search engines have to do most of their work by extracting data from their database. That must be done on the server. The meta-search engines have to do most of their work by extracting data from the search engines over the net. The Java philosophy forbids that the client applet connect to any other server, except to the one it originated from. Only the postprocessing part (duplicates filtering, ranking etc.) could be done by the client. We would, in fact, avoid the manual update problem (discussed in 2.2 for client-based meta-searcher) because every download of the Java applet would load the latest version. But first, the postprocessing is the smallest part considering the total workload, second we have the same last-mile-problem which we experienced with the client-based meta-searcher, and third the download of long Java applets is time consuming.
Although these two topics seem to be pretty far apart at first glance, they are the most important ones from the user's point of view.
During the operation of our meta-search engines we learned that the design of user-interfaces is a continuous process which never ends. Both sides (the user and the maintainer) learn over time. A year ago, about 75% of our users queried with a single searchword. This is often not suitable, because a single word cannot sufficiently describe the problem in many cases. Now just 50% ask with a single word, the other half uses two and more words to describe their search.
We have also found that about 95% of the users do not change any of the default options. We did not expect this to happen. Therefore, in the beginning, we created numerous options, allowing the user choose his optimum environment. When we saw the users were not doing this, we had to change our defaults so that they fit to most of the queries. An experienced person can often tell what the user wants to know just by looking at the user's query. A really good user interface should react to the user's question in a natural language. An optimum interface should lead "by itself" (i.e., by creating a dialog with the user) to the desired results. Presently we are in the process of negotiating an offer to incorporate such software.
The quality of the results we deliver yet are gained by two means: the QuickTips described above, which are under our direct control. Secondly we do not rely upon the ranking of the underlying search engines only, but in addition combine these with our own ranking. Our ranking is based on word counts within title, URL and description of the hits. We mix up these numbers by our ranking algorithms und present the results in the order of ranking numbers within five categories, marked by different colors. Especially the usage of the colors to distinguish the quality (the more red, the "hotter"/better the quality) was accepted well by our users.
This all lead us to the statement: Only the combination of the two factors completeness and quality will result in Internet information really searched for by the user.
The authors would like to thank: