Unveiling the Origin of Web Surfing Regularities

Jiming Liu, Ph.D.

Shiwu Zhang

Department of Computer Science, Hong Kong Baptist University

Kowloon Tong, Hong Kong

Jiming@comp.hkbu.edu .hk

The World Wide Web has evolved into a dynamic, distributed, heterogeneous, complex network, which is hard to control [1][2]. To many people, whether Web developers or researchers who are concerned about the dynamics of complex systems such as Internet, human community, and ecology, it has become imperative to truly understand and interpret (in addition to merely observe) the strong regularities emerged from the `messy' universe of the World Wide Web. Up till now, there have been few efforts on describing various Web surfing regularities [2][3][4]. However, the underlying cause and interrelated elements of the observed regularities still remain unknown.

We have synthesized and validated a model of user surfing behavior that takes into account Web topology, information distribution, and user interest profile to simulate user surfing behavior and explore the origin of regularities in the World Wide Web surfing. In our experiments, we have discovered that it is the unique distribution of user interest that leads to the regularities in user surfing behavior, i.e., a power law distribution of user surfing depth. The Web topology can only influence the shape parameters of the distribution without changing the nature of the distribution. Also discovered is that the power law of link click frequency is largely due to user purposeful surfing behavior. Our work shows that the regularities in the Web are interrelated and not artifacts of a particular surfing process. A summary of the results is shown in Figures 1 and 2.

The unique distribution of user interest creates the regularities in user surfing behavior, i.e., a power law distribution of user surfing depth:Ý In our simulation studies, we have constructed a virtual Web space that reflects key topology properties revealed in the World Wide Web. We assume that the Web space covers a number of different domains or topics. Information contained in a node is represented and initialized in an information vector, based a statistical distribution. Each item in the information vector corresponds to the information quantity on a topic. If a node belongs to topic i, the ith item in its information vector will add a normally distributed random number to strengthen the relevance to this topic. The link between two nodes corresponds to the similarity between two nodes; we assert a link if the similarity/distance between any two nodes exceeds a predefined threshold that is termed as accessibility (r). Each user has his/her own interests and motivations that are represented in interest vectors. A userís interest vector defines a profile of the user interests in different topics. It is initialized according to a statistical distribution. We assume that the user begins his/her surfing from a Web site that contains links pointing to different topics. The user interest vector will be updated based on the information he/she has retrieved. The support, S, that drives the user to surf is computed according to S(t+1)=S(t)+DM+DR, where DM denotes the user motivation lost for latency that is defined by a log-normal distribution [5] and DR denotes the reward to the user, which is proportional to the information that the user has received. If S¦[Smin, Smax], the user will stop surfing (i.e., fatigue when SŁSmin or satisfied when SSmin). Smin and Smax are computed according to the user's initial interests.

In our experiment (Experiment 1), we first adopted a power law for both the user interest distribution and for the information distribution over the Web structure. The simulation results indicated that the distribution of user page-surfing steps follows exactly a power law, while the distribution of user domain-surfing steps follows an exponential distribution. These findings are consistent with the empirically observed real data sets (see Figure 1). It was also revealed from our experiment and real data sets that the distribution of link click frequency, denoting the number of times users traverse a link, also follows a power law. This is something that has never been revealed before. In order to explore whether these regularities are influenced by the information distribution over the Web structure, we conducted the following experiment (Experiment 2): We adopted a normal distribution for the information distribution while keeping other parameter unchanged. The result shows that the originally revealed regularities still exist only with little changes in shape parameters. This indicates that information distribution does not influence the order in user surfing behavior.Ý However, when we adopted a normal distribution for the user interest distribution, the distribution tail of user surfing depth will become an exponential function (Experiment 3). This interesting result confirms that a power law of user surfing depth distribution is determined by the user unique interest distribution.

Userís content-prediction ability determines the power law distribution of link click frequency: ÝAlso in our experiments, we have studied three categories of users, according to their interests and familiarity with the Web: Random users who have no obvious intention in Web surfing, rational users who have goals to achieve but are not familiar with the Web structure, and recurrent users who have specific intents and are very familiar with the Web structure. The ability to predict the content at the next-level nodes becomes stronger when moving from random to recurrent users (see Figure 2). The result of simulations with respect to the three user categories showed that the regularities of user surfing depth on pages and domains still remain the same, while a power law of link click frequency distribution will disappear as we move from recurrent users to random users. This result shows that the order existing in link click frequency comes from userís content-prediction ability, that is, whether or not a user can determine his/her next step according to his/her own interests and the names of hyperlinks.

In the experiments, we have found that in order to get the best performance in Web surfing (measured by the multiplication of user average surfing depth and user satisfaction rate), the accessibility (r) of the Web structure should be 0.7~0.8 (0<r<2). That corresponds to the average number of links inside a Web page, 9~17. This is consistent with our observed number, 13, from the real NASA data set.

1.    R. Albert, Hawoong Jeong and A.-L. Barabasi. Diameter of the World-Wide Web. Nature, 410:130-131, September 9, 1999.

2.    B. A. Huberman and L. A. Adamic. Growth dynamics of the World-Wide Web. Nature, 410:131, September 9, 1999.

3.    B. A. Huberman, Peter L.T. Pirolli, James E. Pitkow and Rajan M. Lukose. Strong Regularities in World Wide Web Surfing. Science, 280:96-97, April 3, 1997.

4.    A.-L. Barabasi, R. Albert. Emergence of Scaling in Random Networks. Science, 286:509-512, October 15 1999.

5.    D. Helbing, B. A. Huberman, and S. M. Maurer. Optimizing traffic in virtual and real space. In: D. Helbing, H. J. Herrmann, M. Schreckenberg, and D. E. Wolf (eds.) Traffic and Granular Flow '99: Social, Traffic, and Granular Dynamics (Springer, Berlin), in print, 2000.



 



Figure 1Ý (Large) Distribution of user surfing depth obtained from experiments:Ý Red, green, and blue dots denote Experiments 1, 2, and 3, respectively. Red and green lines are obtained by applying a linear regression, with the slopes of -1.65 and -1.9, respectively. It can be observed that user surfing depth distribution in Experiments 1 and 2 follow a power law, while that in Experiment 3 is an exponential distribution. (Small-top) Empirical lab data from Systems realization laboratory, Georgia Institute of Technology and (Small-bottom) empirical NASA Data: Distribution of user surfing depth. Both empirical data sets follow a power law.


 


Figure 2 Distribution of link click frequency, separated by user categories. It can be observed that the power law becomes more obvious from random users, rational users, to recurrent users.