Estimating Web Properties by Using Search Engines and Random Crawlers

Nobuko KISHI <kishi@tsuda.ac.jp>
Tsuda College
Japan

Takahiro OHMORI <ohmori@rsch.tuis.ac.jp>
Seiji SASAZUKA <sasazuka@rsch.tuis.ac.jp>
Tokyo University of Information Science
Japan

Akiko KONDO <m99kondo@tsuda.ac.jp>
Tsuda College
Japan

Masahiro MIZUTANI <mizutani@rsch.tuis.ac.jp>
Tokyo University of Information Science
Japan

Takahide OGAWA <ogawa@tsuda.ac.jp>
Tsuda College
Japan

Abstract

The rapid growth of the Web has made it impossible to learn various properties of the entire Web directly. Thus, we need to use statistical methods to estimate the properties of the Web. S. Lawrence and Giles proposed two different methods for estimating the number of Web pages [1,2]. The first method used search engines as a random sampling method. The second method used random sampling of Internet Protocol (IP) addresses and Web servers.

We have applied these two methods to Japanese Web pages in the JP domain to see if we can apply these two methods to a subset of Web. The first method gave an estimate of 88 million pages as a lower bound on the size of the Japanese indexable Web; the second method gave an estimate of 17 million pages. These results mean that these methods do not measure exactly the same Web pages. They also suggest that Japanese search engines cover only a part of Japanese indexable Web pages, and that the Web pages in the JP domain are less connected compared with the whole Web.

1. Introduction
2. Estimates by search engine coverage
3. Estimates by random IP address sampling
4. Discussions
5. Conclusions
References

1. Introduction

The rapid growth of the Web has made it impossible to learn various properties of the entire Web directly. Thus we need to use statistical methods to estimate the properties of Web, such as the size of Web (the number of Web pages), the amount of Web (the number of bytes of Web pages), the number of links, the coverage of search engines, the proportion of various data types, and so on.

S. Lawrence and Giles [1,2] proposed two different methods for estimating the number of Web pages. The first method used search engines as a random sampling method. They gave an estimate of 320 million pages as a lower bound on the size of the indexable Web in December 1997 [1]. The second method used random sampling of IP addresses and Web servers and gave an estimate of 800 million pages as the size of the Web in February 1999 [2].

The purpose of this study is to see if these two methods can be applied to a subset of the Web. As the diversity of Web users grows, we need new approaches to measure the properties of a subset of the Web written in various languages and in many countries. We have applied the two methods to Japanese Web pages in the JP domain. The first method gave an estimate of 88 million pages as a lower bound on the size of the Japanese indexable Web; the second method gave about 28 million pages.

These results suggest the followings.

The two methods do not measure the same set of Web pages.
The number of Japanese indexable Web pages is much larger than the number currently known.
The Web pages in the JP domain are less connected compared with Web pages in the entire Web.

In section 2, we explain the first method proposed by Lawrence, which uses search engines' coverage to estimate the size of the indexable Web. We then describe how we adapted these methods for Japanese Web pages in the JP domain. After giving an estimate of the number of Japanese indexable Web pages, we describe several statistics, such as the distribution of Japanese Web pages in and outside of the JP domain. In section 3, we explain the second method also proposed by Lawrence, which uses randomly selected IP addresses, and gives an estimate of the number of Web servers and Web pages. In section 4, we discuss the possible reasons for the difference between the two estimates.

2. Estimates by search engine coverage

2.1. Method outline of estimates by search engine coverage

Lawrence's first method is to estimate Web pages that contain text indexable by search engines. These pages need to be accessible and not resticted by

Firewalls and by Web servers' access control mechanism.
Robot exclusion mechanism.

Let U denote a set of indexable Web pages, and let the number of Web pages be the size of U. Let A be a set of Web pages that a search engine, S_A, has collected and B be a set of Web pages that a search engine, S_B,has collected. Let Pr(A) and Pr(B) be the probabilities that a page is collected by search engine, S_A and S_B, |A| and |B| be the numbers of Web pages collected by search engine, S_A and S_B. Then we can define the probability Pr(A) as follows.

Pr(A) = |A|/|U|

|U| = 1/Pr(A) |A|

If A and B are independent, then Pr(A) can be obtained from Pr(A B), the probability of the intersection of A and B, i.e., the probability that a page is collected by both S_A and S_B as follows.

Pr(A)Pr(B) = Pr(A B)

Pr(A) = |A B|/|B|

Thus, |U|, the number of indexable Web pages, can be calculated as follows, if we know |A|, |B|, and |A B|, the size of overlap between A and B (figure 1).

|U| = (|B|/ |A B|) |A|

Figure 1: Computing overall size from overlap size

Figure 2: Estimating overlap size by random sampling

However, it is not feasible to know |A B| unless two search engines have all the Web pages available to the public. Instead, we obtain a random sample A' from A, and B' from B by obtaining search results for the same query set from the search services S_A and S_B, then we approximate the Pr(A) as follows.

Pr(A) |A' B'| / |B'|

Then the size of U can be estimated from |B'|, |A' B'|, and |A| as follows.

|U| (|B'| / |A' B'| ) |A|

The size of U can also be estimated by

|U| (|A'| / |A' B'| ) |B|

We obtain an estimate of |U| by averaging the results of the above two equations.

Note this method's accuracy depends on the following assumptions:

Search engines S_A and S_B collect Web pages independently.
The query set is chosen so that both A' and B' are selected at random.
The sizes of samples |A'| and |B'| are large enough to reach an appropriate statistical confidence level, e.g., 95 percent.
The size of overlap |A' B'| is accurate by excluding mirrored pages, nonexistent pages, and pages that do not contain query terms.
The number of Web pages that a search engine has claimed to have collected, |A|, is accurate.

2.2. Lawrence's experiment with search engine coverage

Lawrence and Giles [1] analyzed the search results of six major search engines by using the queries obtained form the Web access log of NEC researchers. The 302 queries were chosen that meet the following conditions:

More than 50 documents are returned by every search engine.
Less than 600 documents are returned by every search engine.
Less than 600 documents are obtained after removing duplicates from the combined results from all search engines.

Lawrence and Giles retrieved all the documents in the search result and checked the existence of query terms. They then computed the size of the overlap between the results of the two largest search engines, AltaVista and HotBot, and obtained estimates of 320 million with a 95 percent confidence interval of 34 million.

2.3. Our experiment with search engine coverage

There are several difficulties in duplicating Lawrence's approach to estimate the number of Web pages that contain Japanese text. One of the difficulties is caused by the fact that we need Japanese terms as queries to obtain Web pages that contain Japanese text. Another is that the search engines based in the United States do not to cover as many Japanese Web pages as the engines based in Japan. We used Japanese query terms selected from a keyword index from newspaper articles published in the Mainichi Shinbun between 1997 and 1998.

We used the following four major search engines in Japan.

Goo (http://www.goo.ne.jp)
Lycos Japan (http://www.lycos.co.jp)
Excite Japan (http://www.excite.co.jp)
Infoseek Japan (http://www.infoseek.co.jp)

We found 597 query terms that meet the conditions described in section 2.2. Then we retrieved all the documents in the search result, and checked the existence of query terms from 27-29 December 1999. Table 1 shows the number of pages found and the overlap size among four services. To estimate the number of Japanese indexable Web pages, we need the numbers of Web pages indexed by search engines. We used the following numbers: 35 million pages for Goo and 30 million pages for Lycos Japan, which were publicly known at the time of our experiment [3,9]. By computing the size of the overlap between the results of these two largest search engines, we have obtained an estimate of 88 million pages with a 95 percent confidence level interval of 1 million.

**Table 1: Estimated size of the Japanese Web by analysis of the overlap between pairs of search engines**
Search Engines		Probabilities		Indexable Web (millions of pages)
S_A	S_B	Pr(A)	Pr(B)	Indexable Web (millions of pages)
Lycos	Infoseek	0.46	0.37	56
Goo	Infoseek	0.37	0.34	74
Goo	Lycos	0.35	0.40	88

This estimate, 88 million pages, is larger than other estimates currently known in Japan. A white paper published by the Ministry of Post and Telecommunications, Japan, lists an estimate of 29 million pages in 1999. The size of the largest search engine in Japan is about 40 percent of our estimated size of the Japanese Web. Figure 3 shows the relationship between our estimate and other statistics.

Figure 3: Estimated size of Japanese Web pages

2.4. Additional statistics

Figure 4 shows the relative coverage of search engines in Japan, and the ratio between the pages in and outside of the JP domain. It shows that less than 10 percent of Japanese Web pages are outside of the JP domain.

Figure 4: Relative coverage of search engines in Japan

Figure 5 shows the percentages of invalid URLs and invalid pages in the search results. Invalid URLs are those for which we could not retrieve the corresponding pages. Invalid pages are those we retrieved, but that did not contain the query term used in the search request. This figure suggests that two search engines, Infoseek Japan and Goo, crawl more often than the other two search engines.

Figure 5: Invalid URLs and invalid pages in the search results

3. Estimates by random IP address sampling

3.1. Methods outline of estimates by random IP address sampling

There are currently 256⁴, about 4 billion, possible IP addresses. By obtaining a random sample of IP addresses and testing for a Web server at a standard port, we can estimate the number of Web servers at a standard port. Furthermore, we can estimate the number of Web pages if we know the distribution of the number of Web pages among Web servers.

There are several reasons that this method's estimate is different from the estimate in the previous section.

Name-based virtual hosts. An IP address can have several host names, and a single machine can act as more than one Web server using a virtual host mechanism, supported by many Web server software applications. However, a web server does not return pages under a name-based virtual host when it receives a request with IP address in a host field.
Web servers at nonstandard ports. A single machine can also provide more than one Web server by using nonstandard ports. Because only a standard port ( i.e., a TCP port 80) is tested, Web pages hosted by servers at nonstandard ports are not counted in this method.
Isolated pages. The number of Web pages on a Web server is counted by following the links from a root document "/". If a Web page is not linked from any Web pages in the same host, that page is not counted.
Nonindexable contents Some Web pages contain very short text such as "Test page" and "Under Construction." These pages do not contain words to be indexed by search engines.

3.2. Lawrence's experiment with random IP sampling

Lawrence and Giles chose 3.6 million IP addresses at random and tested for a Web server at a standard port. They found a Web server for one in every 269 addresses, and estimated the total number of Web servers as 16 million. After excluding Web servers with empty contents, they have estimated 2.8 million as the total number of public Web servers. Then they have observed the number of indexable Web pages of 2,500 Web servers, chosen from the Web servers found in the above samples. They found the mean number of Web pages to be 289 and produced an estimate of 800 million Web pages.

3.3. Our experiment with random IP sampling

Among 256⁴ IPv4 addresses, about 2.8 million IP addresses are currently managed by JPNIC [7]. From these IP addresses, we chose at random 28,000 IP addresses. We tested for a Web server at a standard port and got 335 responses. Among 335 responses, there were 175 successful responses. Examining these 175 Web servers, we found that 85 servers hold indexable Web pages. We obtain 85,000 as an estimate of the total number of public Web servers obtained by random IP sampling of JPNIC address space. We then retrieved indexable web pages from these 85 servers, found the mean number of Web pages to be about 200, and obtained an estimate of 17 million Web pages in the IP address space managed by JPNIC.

Although the estimate for the number of Web pages, 17 million, seems too small, we believe the estimate for the number of Web servers in the IP address space managed by JPNIC, 85,000, is reasonable. The Netcraft Web Server Survey reports 70,851 Web servers in the JP domain in May 1999 [5]. The WWW-in-JP Server Survey by Hitachi Seibu Software, Ltd., reports 78,015 Web servers with the name in the form of www.*.*.jp in November 1999 [6]. We are aware of the fact that all IP addresses allocated by JPNIC are not necessarily in the JP domain, and that some names in the JP domain are assigned to IP addresses not managed by JPNIC. However, by studying IP address of the host names found in the search results in the previous section, we found that less than 10 percent of the names in the IP address space managed by JPNIC are not in the JP domain. We also found that less than 5 percent of the names in the JP domain are outside the IP address space managed by JPNIC. Thus, we believe that the number of Web servers in the JP domain is close to the number of Web servers in the IP address space managed by JPNIC.

4. Discussions

In the experiments by Lawrence and Giles, the value estimated by the first method, 320 million, is smaller than the value by the second, 800 million. This result seemingly suggests that the first method might only give an estimate of a subset of Web pages that the second method can estimate. After all, the first method is based on the sampling of Web pages that can be retrieved by English query terms, while the second method is based on the sampling of Web pages regardless of the relevance of their contents. However, when our first methods was applied to the JP domain, it gave the estimate of 88 million, which is much larger than the 17 million estimated by the second method. This differnce suggests that the two methods do not measure the same set of Web space.

One reason for the difference between the results by Lawrence and by us might be found in an explanation that the Web pages in the JP domain are less linked to each other than Web pages in the entire Web observed by Lawrence and Giles. More precisely, the pages in the JP domain are more likely to be isolated from the root document of a Web server. This situation can be understood by an example. Assume a user, X, has placed his Web pages at a provider, Y. The user X's home page, http://www.Y.ne.jp/~X/, is usually not linked from the root document http://www.Y.ne.jp/, either directly or indirectly. If the user X registers his home page at some directory services, these pages will be crawled eventually by various search engines' crawlers and will become searchable.

At the moment, we made an observation that about one-third of the URLs that were collected in the experiment in section 2.2 cannot be reached by following the links from the root document. To understand the cause of the difference between Lawrence's experiments and ours, we still need to collect more data on back links, i.e., which pages have links to a given page. Eventually we need models for describing the connectivity of the Web, such as the small-world networks proposed by Watts [8].

5. Conclusions

We have applied Lawrence's two methods to estimate the number of Japanese Web pages in the JP domain. The first method gave an estimate of 88 million pages, as a lower bound on the size of Japanese indexable Web pages. This method showed that the number of Japanese Web pages is much larger than existing statistics and Japanese search engines cover only a part of the Japanese Web. The second method gave an estimate of about 17 million pages, suggesting that Web pages in the JP domain are less connected compared with Web pages in the overall Web, although we need more research to explain the difference between these two methods.

References

Lawrence,S. and Giles, C.L. : Searching the World Wide Web, Science 280, pp. 98-100(1998)
Lawrence,S. and Giles, C.L. : Accessibility of information of the Web, Nature 400, pp. 107-109(1999)
NTT-ME Information Xing, INC: News Release, October 5,1999. (http://www.goo.ne.jp/help/n_991005.html)
Ministry of Post and Telecommunications, Japan: Tsuushin Hakusho 1999, White Paper on Telecommunications, http://www.mpt.go.jp/policyreports/japanese/papers/99wp/99wp-0-index.html
NetCraft: The NetCraft Server Survey, May 1999-Japan http://www.netcraft.co.uk/survey/Reports/9905/bydomain/jp/
Hitachi Seibu Software Ltd.: The WWW-in-JP Server Survey http://www.hitachi-ns.co.jp/pub/w3survey/latest/
Japan Network Information Center: IP Addresses, October 31, 1999 http://www.nic.ad.jp/jp/regist/dns/doc/jp-addr-block.html
Watts, D.J.: Collective dynamics of "small-world" networks, Nature 393, pp.440-442(1998)
Lycos Japan: Press Release, May 17, 1999. http://www.lycos.co.jp/help/info/press06.html

Estimating Web Properties by Using Search Engines and Random Crawlers

Abstract

Contents