This study compares the statistical patterns of size and connectivity of the global domains (as in ".com" and ".uk") to the geographical distribution of the global population. As the development of Web sites represents the cutting edge of the new global economy, their sizes and contents are likely to reflect the distribution of population and the urban geography of the real world. There is widespread evidence that population and other socio-economic activities at different scales are distributed according to the rank-size rule and that such scaling distributions are associated with systems that have matured or grown to a steady state where their growth rates do not depend upon scale. In this paper, we advance the hypothesis that the growth of Web pages in different domains is not yet stable. This is supported by our analysis that shows that the most mature domains with the most pages follow near rank-size relations but that countries that are much less advanced in their development and use of Internet technologies show size relations which, although scaling, do not conform to rank-size. Our speculation is that as the Web develops, all domains will ultimately follow the same power laws as these technologies mature and adoption becomes more uniform. As yet, we are unable to support our hypothesis with temporal data; but the structure in the cross-sectional data we have collected is consistent with a system that is rapidly changing and has not yet reached its steady state.
Keywords: hyperlink, population, power law, rank-size rule, Web site size.
Rapid deployment of information technologies and the exponential growth of the World Wide Web are beginning to generate a new geography within the wider structure of cyberspace (Batty 1993, Ludwig 1996). Various attempts at measuring and interpreting the structure, size and connectivity of this space have been made but its growth and evolution generate a constant need for new measurements and interpretations (Abraham 1996, Bray 1996, Pirolli et al. 1998, Pitkow 1998, Adamic 1999).
In general, as Web sites clearly form an integral part of social and economic development, their sizes and contents are likely to reflect the distribution of population and the urban geography of the real world (Gorman 1998, Mitchell 1999). Recently, it has been predicted that, despite its apparent arbitrariness, the sizes of Web sites and hyperlinks between them follow known distributions of growth phenomena such as those observed for cities and regions (Albert et al. 1999, Faloutsos et al. 1999, Huberman and Adamic 1999).
We begin by reviewing these recent investigations, and we then extend this to Web sites that are distributed geographically in real space. Through a comparative study of the sizes of global domains and national populations, we argue that the sizes and frequencies of Web sites follow those well-known scaling distributions first catalogued for a variety of different social phenomena by Zipf (1949), and subsequently widely applied to city size, income, word frequency, and firm size distributions.
Although the Internet has only become significant during the past 10 years, it has already attracted a number of researchers who have conducted various investigations and surveys of its distribution and size. A vast amount of statistical resources and numerous theoretical contributions to interpreting the growth of the Internet in general and the Web in particular exist; and among the many studies conducted thus far, four approaches to Web analysis can be identified. We will list these by way of setting the context to our work.
The most obvious yet vital method for grasping the overall impression of the Internet is to collect its statistical information. A number of institutes have attempted to capture the state of the Web through a survey on the number of various Web sites, active servers, users of the Internet and the growth rate of each of these (Gray 1995, Bray 1996, Coffman and Odlyzko 1998, MIDS 1999, OCLC 1999, ISC 2000). However, due to the exponential growth rate of the Internet and its increasingly complex structure, most of these figures inevitably consist of estimated values, or the rough indicators of its scale (ISC 2000).
Most of the services provided by the Internet such as the World Wide Web are of metaphorical content and have no physical entity. Various cartographic and geo-information techniques are being applied to visualize this virtual domain from a variety of perspectives. Some focus on the pattern displayed by search queries (Carriere and Kazman 1999), while others depict the topological connectivity of hyperlinks (Shiode and Dodge 1999). Visualization, if properly applied, can provide persuasive, intuitively comprehensible outputs. However, such approaches are usually self-conclusive and often limit the possibility of further exploration of content.
In contrast to the statistical approach, data mining typically focuses on a single local spot or on a particular point of interest and carries out in-depth analysis to comprehend the exact impacts and effects at lower levels. Examples include local traffic distance measurement (Murnion and Healey 1999) and IP address distribution at the district level of a country (Shiode and Dodge 1998). The only limitation is that while such methods can be applied to a local or specific aspect of the Web, it is practically impossible to maintain the level of detail if the entire Web needs to be searched as we invariably wish.
This final approach aims to understand the Internet by constructing a model of its structure. In particular, there is an extensive collection of studies on its connectivity and topological structure (Abraham 1996, Kleinberg 1997, Wheeler and O'Kelly 1999). Among these studies is the application of a social network concept that reflects the "small world" assumption (Watts and Strogatz 1998). The underlying idea is that for a variety of global network phenomena, all objects or people are connected to one other within a chain of six acquaintances, which is popularly known as the "six degrees of separation." Albert et al. (1999) have applied this concept to measure the degree of connectivity of the Web, predicting that Web pages are separated by an average of "19 clicks." This connectivity measurement is closely linked to the idea of power laws describing networks where "the probability of finding documents with a large number of links is significant, as the network connectivity is dominated by highly connected Web pages." (Albert et al. 1999).
Based on this last approach, we will conduct a rank-size analysis of the global domain based on countries and Web page hyperlinks within and between them. We will then compare these distributions with conventional social and economic indices of the real world; namely, national population and real GDP. First, however, we will explain the basis of the power laws we will use, noting their relationship to rapidly growing systems such as the Web that we seek to model.
Distributions in nature and economy which are composed of a large number of common events and a small number of rarer events often manifest a form of regularity in which the relationship of any event to any other in the distribution scales in a simple way. In essence, such distributions appear to arise through growth processes which may not favor the common or rare events and which involve random additions to the set of events or objects. Typically, the size of an event P(x) scales with some property of the event x in the formwhere K is a constant and some parameter of the distribution. Such distributions are scaling in that the size of the event is proportional to the size of the property; that is, if the property grows by , then the size scales as From this it is clear that which has a particularly simple form when . These relationships can be formulated either in simple frequency form or in cumulative frequency form, usually as a rank-size type relationship, which is preferred in this case, when the focus is on the rarer or larger events that dominate the distribution.
The best known of these scaling laws is the rank-size rule which was first popularized by Zipf (1949) for cities, word frequencies, and income distributions. Zipf's Law, as it is called, has the general form P(r) = Kr-q where P(r) is the size of the event, in the case of cities -- the population, r is its rank in descending order of size where P(r) > P(r+1); q is some parameter of the distribution and K is a scaling constant. Sometimes the relation is presented as P(r) rq = K for any r which implies some form of steady state consistent with the growth process. The relevance of such simple scaling to city-size distributions has been known for over 100 years. Auerbach (quoted in Carroll 1982) proposed that the exponent q was 1 in 1913, while Lotka (again in Carroll 1982) suggested that q = 0.93 in 1925. Zipf (1949) and many others since then (see Krugman 1996) have confirmed this "iron law" of city sizes. The usual way of fitting such distributions to data (which we follow here) is to perform a linear regression of log[P(r)]on log[r] where the parameters log K and q are the slope and intercept of the curve log P(r) = log K - q log r, respectively.
There is considerable debate as to whether the systems and their size distributions modeled with power laws of this form are best represented by such log-linear relations (Okabe 1977). In fact, the Yule and log-normal distributions generated by various growth models and even stretched exponential, parabolic fractal and related forms might be preferable for distributions with fat, heavy or long tails (Okabe 1987, Laherrera and Sornette 1998). Here, however, we will develop the rank-size model largely because it represents a first attack on the problem of measuring the size of the Web, and there are good stochastic models that are consistent with the kinds of distributions that we observe. In particular, Simon (1957) has developed a growth model based on three assumptions that appear to fit many natural and social systems. First, new events or objects are created at a regular but random rate and of the smallest size. Second, the growth rate of all existing events is essentially random; and third, the rate is independent of the size of objects, but with average actual growth proportional to size. As the number of events grows, their distribution converges to the steady state P(r) = Kr-q with where is the average growth rate of events which in the steady state converges to zero. This is a very useful interpretation; when the growth rate is near to 1, it means high value of linear correlation and hence, indicates that the system is in its immature early stages, akin to that, for example, associated with the Web. As we will show below, our null hypothesis is that the system is already in the steady state with but that deviations from this (which we will see in the rank-size plots), will indicate how far different domains (countries) in the system are from the steady state.
We are also aware of several other models that might be as appropriate as the rank-size. Simon's (1957) model is indeed equivalent to those that generate the Yule and log-normal distributions where the short tail of the distribution does not accord to the rank-size relation. In fact, most applications of scaling laws to these kinds of distribution "conveniently forget" the short tail, fitting the model to the long tail, on the assumption that the size of events has to pass a certain threshold before the maturity of rank-size takes effect. It should be noted that there is a huge argument on the theoretical validity of Simon model (Okabe 1977). Our interpretation suggests that the Simon model is compatible with explaining the short tail as well, although we will only briefly explore this point in this introductory paper.
As far as we are aware, the strict rank-size rule has not been applied to the distribution of Web pages and their hyperlinks for different country domains. However, Albert et al. (1999) use pure scaling to measure the frequency distributions of the numbers of in-degrees and out-degrees of links from Web sites, with implied values of and q = 1.1 respectively for the associated rank-size relations. Faloutsos et al. (1999) have examined out-degrees from a couple of Internet domains at three points in time in 1997-1998, and show that the equivalent q exponent varies from 0.81 to 0.82 to 0.74 for the rank-size rule and from 1.15 to 1.16 to 1.20 for the same data fitted in its simple frequency form. However, because these contributions stress connectivity, both works are almost entirely associated with hyperlinks found between a subset of the Web that is, at one level, comparable to the air route network in the real world, as opposed to the Web sites being the equivalent of city sizes.
In fact we would argue that the fundamental concept of the power law performs at its best when ranking a non-directional, agglomerative or accumulative set of events (or objects) that are spatially dispersed over a certain area. This accords with the developments of scaling laws in physics as well as in biology. Moreover, in order to comprehend the Web in a geographical context, it is essential to compare various distribution patterns associated with the size of the Web with those of the real world. In this light, we will measure the size of domains at the global level as well as hyperlinks observed within and between them. We will then compare them with the distribution patterns of national population and GDP. This not only contextualizes Web size with a real geography, but also helps further to ground the earlier results obtained by the Albert and Faloutsos groups.
For this analysis, we obtained data for population, GDP, Web site size and hyperlinks, the full listing of which is given in Appendix A. Using the AltaVista search engine, we obtained the total number of Web pages registered under 180 global domains that represent a nation, region or a large set of organizations of similar characters (e.g., "mil" as in the U.S. Military). At the same time, we obtained the number of hyperlinks within and between these domains. Real GDP in billions of $US for 1998 (at 1990 values) and total population for 1994 were taken from IMF World Outlook (IMF 1999) and the GIS package Map/Info Professional, respectively. Although we initially obtained data set for 180 global domains, we immediately excluded some of the data, conducting all our analysis with 150 data points for the following reasons:
The domain size ranged from the super-scales of "com (commercial)," 48,284,554 pages, and "net (network)," 7,467,435 pages, down to small country domains such as "cg (Congo)," 109; and "tp (East Timor)," 106. Figure 1 presents a histogram of the domain sizes where over 25 percent of them fall within the intervals from 5,000 to 10,000 pages.
Figure 1. The Distribution of Domain Size
The number of links between the 180 global domains was also investigated. We used script commands for generating multiple queries, n2 separate queries for n number of sites, and counted the number of hyperlinks between each sub-domain by applying the syntax "+url: <sub-domain1>.uk +link: <sub-domain2>.uk"(Dodge 1998). Within the 16,111 possible combinations, we observed a total of 76,735,152 links, of which 16.1percent (12,318,346 links) were found between "com" and "net." Whether the database of AltaVista search engine actually reflects an unbiased sample of the Web sites or not remains an open question. Nevertheless, it is considered to be one of the most comprehensive indices of Web pages publicly available (Sullivan 1999), containing over 150 million Web pages (as of 1 February 1999). Thus, we assume that the AltaVista data reflect the actual state of Web and can be relied upon.
Correlations between Web size and the total number of links assigned to domains regardless of direction (that is, both incoming and outgoing links) are shown in Appendix B, together with those based on population and GDP. It is not surprising to find an r2 for Web size and hyperlinks of 97 percent, but this simply confirms consistency in the average number of links per page. The overall average was 3.92, much lower than the 7 obtained by Albert et al. (1999). This may be partly explained by the differences in the methods of data collection. Albert's group counted the number of pages at some specific sites such as those of their own research institutes as well as the White House whereas our data, while globally obtained, depends on a commercial search engine.
We have ranked in descending order the Web site, demographic, and economic data. This is measured respectively by the number of Web sites for each domain, number of incoming links into each domain (in-degrees), number of outgoing links (out-degrees), total links associated with each domain (in-degrees and out-degrees and inter-domain links), real GDP in billions of dollars US, and national population. In Figure 2, we present a complete graphical analysis of this data, plotting the distributions on logarithmic scales, visually associating various data, and computing idealized and actual rank-size relations.
Click on each image to obtain a full size chart.
(a) Rank Size of GDP (billions US$) and Web Site Size
(b) Rank Size of Population and Web Site Sizes
(c) Rank Size of Population and Web Site Size (same as (b) but with a bented trendline for the ideal rank size distribution)
(d) Rank size of Web site size and hyperlinks
(e) Rank size of the number of in-coming, out-going and total of hyperlinks
(f) Rank size of population, GDP, Web site size and hyperlinks
Figure 2: Rank-Size Data, and Power Law Relationships Governing Web Size
None of the distributions follow the classic linear rank-size form, for all distributions are concave to the origin. The largest sizes do appear to conform to simple power laws but the smaller sizes would be radically over-estimated using these power laws. It is immediately clear from this analysis that the distributions of population and GDP are much closer over their larger size range to rank-size than any of the Web data. The rank-size is classic for the population of the largest 100 or so countries (out of 150) with GDP the same for over half (75). We consider that the smaller than expected (from the rank-size rule, that is) sizes of country in these data is probably as much due to unusual boundaries as to higher growth rates amongst these groups. In contrast, only the first 20 or so domains accord to rank-size when Web page size is examined. This is a classic demonstration of a system undergoing very rapid growth amongst most of its objects with an implication that as one examines successively lower and lower ranks, growth rates would rise inexorably. Of course we have nothing other than Simon's (1957) model to convince us of this, but in terms of more mature systems such as population, the notion is consistent with the data and with our intuition.
Examining the number of links is more problematic. The total and outgoing links conform strongly to rank-size, at least for the largest 100 domains measured by these linkages, but incoming links is the least like rank-size of any data in our analysis. Again, there is a plausible explanation that outgoing links constitute most of the links in Web pages to date (and maybe forever), and these tend to reflect our perceptions of size while incoming links reflect our ability to link with others. These distributions are quite different and asymmetric in that we tend to know more than proportionately about bigger places than the smaller. This too should change as systems mature. The rank-size relations fitted to these six distributions are shown in the table where we list the intercept, the slope, the correlation squared, and the ratio of the top ranked site's predicted size P'(1) (from the rank-size rule) to its observed value P(1):
|Distribution||Intercept log K||Slope -q||Correlation r2||P'(1)/P(1)|
|No. Web Pages||21.22||2.91||0.90||35.84|
These results are statistically rather good but in terms of their actual fit, the evidence of primacy in the top-ranked sites for Web data and for GDP, and the substantial deviations in the short tail for the Web data particularly, reveal that rank-size is only a theoretical ideal which might be attained in the steady state when all domains have been subjected to growth for a long period. To illustrate these points more clearly, we have computed idealized rank-size distributions for each set of data based on P"(r) = P(1)r-1 where P"(r) is the idealized (pure) value at rank r and P(1) is the largest observed value in the set. This equation generates a straight line on the log-log plots and shows how near or far the actual distribution in question is from the steady state. These, in fact, indicate that the largest sizes do conform well in all cases to rank-size with the shorter tails departing substantially in terms of the slope. For the total Web pages at each site, we have computed two regimes based on the pure rank-size: the first based on the above equation, the second based on P''' (r>27) = P''(27) r-4.25 which better mirrors the data in the lower ranges.
Finally we have broken each data set into two ranges by eye and have fitted rank-size relations to each (sample image shown in Figure 3). These are shown below.
Figure 3. Application of a bent line on the log-log plots.
|Distribution||Slope -q1 for upper ranks||Correlation r2 for upper ranks||Slope -q2 for lower ranks||Correlation r2 for lower ranks||w2q2 / w1q1|
|No. Web Pages||0.88||0.97||4.25||0.98||31.05|
The fifth column shows the weighted ratio between the upper ranks and lower ranks where w1 and w2 are the weight of data counted into upper and lower ranks, respectively. These results suggest that there is substantial change still to work itself out within the World Wide Web as the lower ranked sites gradually grow towards the more mature sites at the upper levels of the range, as is already the case with the distribution patterns of population and GDP to some extent. None of this explores how sites change their rank during this process, which is yet another matter for future research.
Our analysis of the size distribution of global domains and its comparison with the real geography of economic and demographic distributions is the first step in a wider exploration of the shape and structure of cyberspace which promises to enrich our understanding of the information society. The correlations that we found between the size of the Web and population was low, although that between the Web and GDP was much higher with an r2 over 70 percent, confirming our general intuition that the economic development of a domain is all the more important in explaining its size. We anticipate that in time, as the global information society matures, the size of the Web will come to reflect the population size of nations much more than it does at present -- although by then, there may be other specialist Web-like resources that will depend more on the economy than on indicators of demographic size.
Moreover, as the overall rank-size patterns of the Web, its links, and GDP are quite similar, it is perhaps reasonable to conclude that the distribution of Web domains and their links broadly reflects existing economic activity patterns, albeit differences in the distribution pattern of population and Web services. We also expect that Web-based services are carried out at locations remote from places at which these services are initially registered, and we would expect such differences to be reflected in the flows of information between domains -- the trade in information between countries. Although our link data contains this, we have not yet been able to explore the patterns contained therein in ways that would confirm this speculation.
The power law relations that we have examined all display the tendency for the number of small events -- Web sizes, links, populations, and GDP of small countries -- to be less than what the rank-size rule predicts but with a Simon-type model (1957), this can easily be explained by the smaller domains having not yet reached maturity. We did not go as far as to compute growth rates or exponents for every level of rank, but we did illustrate the plausibility of the hypothesis that the largest domains approximate the rank-size rule while the smaller domains are growing towards this steady state. The differences in power law that we computed between these two sets confirms this notion. In future work, we will explore these ideas further but to do this, we will require much better data at more than one point in time. This analysis based on a single time-point essentially forms a first step in an interpretation of how Web space is developing. There are many other issues and possibilities that need to be addressed herewith. As well as implementing a time-series analysis, we need to clarify definitions of domains in spatial as well as sectoral terms, and we need to consider suitable spatial and temporal aggregations which affect our analysis.
A major problem is still the definition of the U.S. domain. Super-national level domains such as "com" and "org" require careful estimation as to the extent of their contribution by the U.S. firms and those based in other countries. Some of these large domains were omitted in this study, but their inclusion would significantly alter the value of Web size assigned to the U.S. domain, which in turn would cause significant changes to the distributions. However, it is our belief that the pattern of rank-size would not be markedly altered by such changes, and an essential next step is to see how robust this kind of analysis is to changes in time. Only then we will be in a position to make some tentative predictions as to the future form of cyberspace.
We are grateful to Martin Dodge (1998) who originally collected the data on Web size and hyperlinks from AltaVista (1999).
Sources: AltaVista (1998), IMF World Outlook (1999), MapInfo (1999).
|No.||Country||Domain||Population||GDP||Domain Size||Incoming Links||Outgoing Links||Total No. of Links|
|4||Antigua and Barbuda||ag||64794||0.409||871||742||179422||179789|
|19||Bosnia and Herzegovina||ba||3707000||4.465||632||1001||123422||124039|
|55||Holy See (Vatican)||va||1000||0.021||2107||1209||195522||196640|
|71||Korea, Republic of||kr||43663405||500.410||1325365||1828271||898557||1952830|
|102||Papua New Guinea||pg||3727250||9.733||1053||1114||272317||273152|
|113||Sao Tome and Principe||st||117504||0.136||423||768||431341||431931|
|133||Trinidad and Tobago||tt||1227443||12.342||3501||4292||102605||105806|
|139||United Arab Emirates||ae||862000||42.901||5969||4805||86040||89262|
Sources: AltaVista (1998), IMF World Outlook (1999).
Correlation between population and the Web size (R2=0.24).
Correlation between population and hyperlinks (R2=0.09).
Correlation between GDP and the Web size (R2=0.74).
Correlation between GDP and hyperlinks (R2=0.70).
Correlation between population and GDP (R2=0.82).
Correlation between the Web size and hyperlinks (R2=0.97).