What the Internet Is Telling Us About Itself

Tim O'Reilly <tim@ora.com>
O'Reilly & Associates, Inc.
USA

Abstract

As a subject for market research, the Internet is unique. Unlike other areas, where trends can be derived only by statistical sampling or potentially inaccurate anecdotal information, the infrastructure of the network can actually be measured. We can use advanced network examination techniques to track technology, performance, and trends. You might say that it's possible to get the Internet to tell us about itself, rather than asking its users to tell us about the Internet.

Contents

Introduction

As a subject for market research, the Internet is unique. Unlike other areas, where trends can be derived only by statistical sampling or potentially inaccurate anecdotal information, the infrastructure of the network can actually be measured. We can use advanced network examination techniques to track technology, performance, and trends. You might say that it's possible to get the Internet to tell us about itself, rather than asking its users to tell us about the Internet. Not surprisingly, network examination reveals a picture very different from the prognostications of analysts who are, at bottom, only guessing where things are headed.

The purpose of this paper is to outline some of the many areas that are accessible to network examination, as well as to introduce some key concepts.

WebCrawlers, robots, and spiders

The basic techniques used for network examination have a great deal in common with the techniques used by Internet search engines such as Alta Vista, Infoseek, or Lycos. A program generally known as a robot goes out over the network, examines the publicly accessible contents of a site, and returns with what it has found. In the case of the typical search engine, what is searched is content, and what is constructed is a collection of pointers indexed by various keywords. This index can be searched far more quickly than the millions of pages from hundreds of thousands of sites that were sifted to make it up. However, searching for content is only one application of robot technology. It is also possible to search for such things as the penetration of particular technologies, patterns of relationships between sites, performance across the network, and certain types of aggregate user information.

Server market share

The Netcraft server survey is perhaps the best-known example of a site devoted to tracking technology penetration on the Net. On a monthly basis, Netcraft sends out a robot to query every visible Web site on the Net with a simple question: what server implementation are you running? The resulting market share figures are tracked eagerly by anyone involved in the Web server market.

Netcraft's figures show that whatever the advertising or company hype might argue, the top-three commercial server vendors are Netscape, Microsoft, and O'Reilly, and that the most widely used and fastest growing server is actually not a commercial offering at all but the Apache freeware server. As of November 1996, Apache held 41 percent of the Web server market. Netscape was in second with about 13 percent, the freeware NCSA server was in third place at 12.63 percent, Microsoft was in fourth with 9.5 percent, and O'Reilly's Web site (http://website.ora.com) was in fifth with 3.77 percent of the servers detected.

Netcraft's figures also show the relative market share of Unix and NT in the Web server market. Unix remains by far the dominant Web server platform, but NT actually has a dominant share when only commercial servers are considered. When freeware servers are disregarded and Netcraft's figures are recast for commercial servers only, Netscape holds about 40 percent market share, Microsoft about 30 percent, and O'Reilly about 12 percent, and the remaining 18 percent is divided among several dozen other companies. Netscape reports that NT makes up about half of their server sales. The Microsoft and O'Reilly products run only on NT or Windows 95.

A great deal of the Web server activity (and targeted customer interviews support this inference) still appears not to be under the aegis of MIS departments, which generally frown on using unsupported software.

Penetration of secure commerce capabilities

Using similar techniques, in December 1996, Netcraft and O'Reilly began a baseline study of the use of secure Web technology in the form of Netscape's Secure Sockets Layer (SSL) protocol and of authentication through certificates from companies like Verisign. Over 650,000 sites replied to our queries. Of these, approximately 65,000 were running SSL-enabled servers--but only about 3,000 of those also had a valid, matching certificate authenticating their site.

The count of SSL-enabled servers generally matched the numbers that are popularly cited. However, the relatively tiny number of certificates indicates either that there is a much lower degree of commerce readiness than is generally believed or that certification is just not that important to many of the sites offering secure commercial services. The study also identifies such factors as the rate of certificate uptake and certification authority market share.

The point here is not to review our specific findings, however, but to point out the kind of data that is available via network examination.

Gathering such data is not a trivial undertaking, and analyzing the data, establishing baselines, and teasing out patterns requires us to think in fresh ways about how to interpret the clues that are embedded in the Net. We are without doubt taking an extremely useful approach, which can at times contradict and at others complement traditional market research, suggesting directions for further examination by more familiar methods.

Java vs. Active/X vs. Perl

We can also analyze, on an ongoing basis, the uptake of various technologies such as Java, JavaScript, and Active/X, or Microsoft and Netscape's competing extensions to HTML. Why take Microsoft or Netscape's word for it when we can search through the millions of pages on the Web and actually determine the incidence of pages using each of these technologies? While we have not yet done this study, we believe that the search would show that CGI programs, mostly written in Perl, remain by far the dominant method for "activating" Web sites.

What will happen to the hype masters of the computer industry when the emperor stands naked, and the robots searching the Web haven't been trained to avert their eyes?

Browser market share

Because browsers don't sit out there on the network answering queries in the same way that servers do, browser market share must be established in a more roundabout way. Each time a browser contacts a server, it announces its type and version number, which is logged, along with the domain name or Internet Protocol (IP) number of the machine on which the browser resides.

The most advanced sites commonly use this data to automatically adapt the site to the user. C|net, for example, decides whether to advertise Macintosh or Windows products by determining the platform on which the user's browser is running.

Sites such as Interse have analyzed their own logs and published the results: As of November 1996, Netscape still held the lead, with 60 percent of those who visited the site using its browser. Microsoft held 31 percent market share, and 9 percent went to miscellaneous other vendors. The lag in technology uptake can also be observed. Even though Netscape 3.0 was released in March, it didn't pass the previous version in rate of use until September, and a large number of people still using the older browser.

While such single-source data is subject to manipulation by knowledgeable (and unscrupulous) vendors and may not be representative, it would be fairly easy to establish a short, anonymous list of sites whose logs would be used to track this data, much as the New York Times bestseller list is derived from sales data of a relatively small (and secret) list of bookstores.

Log file analysis

In general, log file analysis is something that most sites do privately, using everything from home-grown programs to powerful commercial log analyzers like O'Reilly's Statisphere (see http://statisphere.ora.com).

Log file analysis can give such obvious information as the following:

Other less obvious but nonetheless significant facts can be gleaned. For example, a site might find that it attracts a large number of people using older browsers and that all the work designers are putting into using the latest technology is wasted because customers can't see it. Web sites can optimize the effort spent on the site by learning immediately what its customers actually read, what links they follow, and whether they are actually drawn in by that special offer on the first page. Such capability is a heaven for direct marketers. Offers can be tested and revised in days instead of weeks or months. Effort can be focused where it does the most good.

When data is integrated with that of other sites, the data can be mined for additional significance. How does traffic compare to traffic of others in the same market segment? Do traffic peak times say anything about the consumer vs. business-to-business mix of visitors to the site? Are a large number of visitors from a given company or industry?

Note that the data is not necessarily logged in terms that are ideally suited for analysis. For example, a Web server logs raw "hits," but what is really important are user visits. Thomas Novak and Donna Hoffman of Vanderbilt University define a visit as "a series of consecutive Web page requests from a visitor to a Web site." (Their paper, "New Metrics for New Media: Toward the Development of Web Measurement Standards," is required reading for anyone interested in the field of Web measurement.)

Novak and Hoffman go on to say that "Hits have been widely criticized as a measure of Web traffic. While the definitions of hits are quite consistent, the weakness of hits as a valid measure of traffic to a Web site is quite evident. Since hits includes all units of content (images, text, sound files, Java applets) sent by a Web server when a particular uniform resource locater (URL) is accessed, hits are inherently noncomparable across Web sites. Other than ignorance of the meaninglessness of hits, the only reason we feel a Web site would report numbers of hits is that this is typically a large and very impressive sounding number."

The translation of hits to visits illustrates the kind of data translation that is required to get meaningful information out of the raw data. First, the site must filter out the various units of content on each page to establish the number of "page views" and then extrapolate a "visit" by tracking consecutive page views from the site within a given period (say 30 minutes).

Internet pioneer Dale Dougherty of Songline Studios (http://www.songline.com) points out in addition that the Web server shouldn't be considered in isolation. It's essential to synchronize Web logs with e-mail logs, transactions, and other data to measure the total interactivity of the user with the site.

Of course, the browser-server dialogue opens up opportunities for additional data gathering via tokens called cookies, which the server can instruct the browser to maintain on its behalf. Using cookies, sites can track in greater detail the repeat behavior of visitors.

Collaborative filtering

Even more advanced techniques such as collaborative filtering can be used both to create a more effective user experience and to gather unique market intelligence about a customer base. Demonstration sites such as Firefly or The Movie Critic (http://www.moviecritic.com, created by O'Reilly affiliate Songline Studios) have applied this technology to movie ratings. A customer rates a small number of movies; the system uses powerful statistical analysis techniques to group that customer with like-minded customers and make recommendations.

While other areas are not as simple to manage as movie preferences, which are almost entirely taste based, collaborative filtering can be combined with objective reference criteria to produce recommendations for other areas such as books, music, or even mutual funds.

In addition to making recommendations or linking customers with like-minded people, collaborative filtering can be used to tackle such thorny problems as community standards. Rather than simply asking a consumer whether he or she enjoyed a movie, collaborative filtering can be used to collect more detailed information, such as whether the level of violence or sexuality was offensive. Because standards of offensiveness vary from person to person, the real trick is for the customer to get recommendations from people with similar views.

More important in the present context, collaborative filtering provides an incredibly rich source of customer data to mine, data that beggars any previous source in its ability to fine tune efforts at marketing or customer satisfaction. Small wonder that any company concerned with electronic commerce is beating down the doors of collaborative filtering vendors.

Of course, from the user point of view, the collection of marketing data is perhaps an undesirable feature of Internet technology. Users thrive on the ability of such sites to adapt themselves and their data to user preferences, to make the computer "smarter," so to speak, but they have concerns about how else data about them might be used.

Organizations like the Electronic Frontier Foundation have examined the legal and social ramifications of new technologies. For example, its E-Trust project is working to establish standards for how customer data will be used and shared.

What about the Intranet?

Firewalls, proxies, and other mechanisms designed to hide systems or networks from each other complicate the task of interrogating the network completely. Nonetheless, the same techniques that can be used to query the external network can be fruitfully applied to the Intranet. In addition, there are some fascinating possibilities for network examination within a single corporate site.

For example, consider the traditional organizational chart. Now analyze your e-mail logs to find out the real organization of your company--who talks to whom and how often.

So much attention has been focused on the possibilities of commerce via external Web sites that many companies have missed the enormous opportunity for process improvement within the company or between a company and its customers. The first step to improving a process is understanding it; as with debugging a computer program, the art is finding out the difference between what you're really doing and what you think you're doing.

Network performance

Much has been made in the press of the possibility of "Internet collapse"--overload of the network, leading to widespread outages. Yet this is not something we need to speculate about. Will the Internet collapse? View the data, watch the trends, and decide for yourself.

For example, Matrix Information and Directory Services (MIDS) publishes a regular "Internet weather report" detailing overall network performance. While the information on the MIDS site is unnecessarily impenetrable to the casual observer, it does demonstrate how much information is available. Look for network performance measurement to become a regular feature of the newspaper, along with the performance of the Dow Jones Industrials, interest rates, and inflation.

A given site can easily do some simple performance analysis on its own, tracking the average round-trip times from its site to various well-known sites (or customer sites), and develop personalized metrics for performance in its local corner of the Net.

In conclusion

I have only begun to hint at the possibilities before us. The trend towards interconnectedness continues unabated. The time is easily foreseeable when not only every computer network but also electronic devices not now connected will be accessible to each other.

When not only computers but also home electronics, telephones, and devices not yet invented depend, for their very usefulness, on being connected to each other, the network will inevitably contain ever more information about itself, about us, and how we use it than we have begun to fathom.

This world we are creating truly justifies Huxley's overused cognomen, Brave New World. The potential for abuse is enormous, but so is the potential for creating networks and devices that are increasingly responsive to our needs. The network of the future will increasingly tell us things about itself so that we can make better use of it.

References

  1. Netcraft server survey, http://www.netcraft.co.uk/survey.
  2. Perl Language home page, http://www.perl.com.
  3. Interse Web Trends Page, http://www.interse.com/webtrends.
  4. Novak, Thomas, and Donna Hoffman, New Metrics for New Media: Toward the Development of Web Measurement Standards, http://www2000.ogsm.vanderbilt.edu/novak/web.standards/webstand.html.
  5. Electronic Frontier Foundation, http://www.eff.org.
  6. E-Trust, http://www.etrust.org.
  7. Matrix Information and Directory Services, Internet Weather Report, http://www.mids.org/weather/.