Evaluating Web Filters: A Practical Approach

Eyas S. Al-HAJERY <alhajery@kacst.edu.sa>
Badr Al BADR <badr@kacst.edu.sa>
King Abdulaziz City for Science and Technology
Saudi Arabia

Abstract

The Internet is becoming a significant source of all types of information to all people. This has made Internet censorship a major and controversial issue. While many people believe that the use of content filtering products is against free speech, there are others, especially parents and librarians, who are concerned about the negative effects of Internet pornography on minors. Many libraries and schools are mandating filtered access to the Internet. In this paper, we evaluate the performance of six major Internet content filtering products: SmartFilter, WebSense, CyberPatrol, SurfWatch, N2H2, and I-Gear. The performance evaluation is based on a log of several tens of thousands of uniform resource locators (URLs), which has been collected from an Internet service provider (ISP). This ISP provides unfiltered Internet access to several thousands of customers. Several performance measures have been investigated to compare the performance of these products. These measures include the blocking rate of pornographic materials and the false alarm rate (the blocking rate of nonpornographic material). Furthermore, the method we propose to evaluate the filters has the added advantage of being practical.

Introduction

One of the biggest complaints people have about the Internet concerns the proliferation of pornography. To guard minors and conservative communities from pornography, many products appeared in the market with the goal of filtering Internet access and hence restricting access to pornographic sites. Another important application for Internet filtering is resource management, where an organization wishes to ensure that its Internet connection is properly used for legitimate business activities during office hours, so nonbusiness sites are blocked.

The definition of the problem is as follows: Assuming that an organization has its own definition of what is suitable and what is not, the organization must find the filter that best satisfies its needs in allowing access to the maximum number of sites it deems suitable, while at the same time blocking the minimum number of sites it deems unsuitable.

This work attempts to measure the effectiveness of several commercial filters in blocking access to pornography. Filter effectiveness is measured as minimizing the blocking of suitable sites and maximizing the blocking of unsuitable sites. It is important to note that this work focuses on filtering Web-based traffic (which is in fact the vast majority of Internet traffic).

It is believed that the results of this work would benefit organizations (e.g., schools, public libraries) wanting to deploy pornography filtering software. The contribution of this work is not only in assessing filter effectiveness but also in outlining a practical procedure by which to test filtering software. The procedure can be applied to new filters, for different filtering objectives, or with different suitability standards in mind.

Usually, to assess a detection task the decisions of the detector are compared to the actual identities of the objects to detect. In our case, the objects to detect are URLs of pornographic Web objects. This means that a set of Web pages is needed, with each page being prelabeled as pornographic or not. Hence a major contribution of this work was in evaluating the performance of the filters without having to manually label all URLs in the test set as pornographic or not, by using voting among filters and taking unanimous decisions as absolute labels. The savings in the labeling step was on the order of 67%, as we had to manually label only one third of the URLs.

The work relies on the principles of statistical decision theory in evaluating the filters. The major components of this work are the following:

  1. Modeling user requests by finding a list set of URLs that mimic the requests of the target user community. This constitutes the test set on which each filter would be tested.
  2. Automatically running all URLs through each filter and checking whether or not the filter blocks the URL.
  3. Labeling each URL in the test set as pornographic or not, based on the unanimous decisions of all filters, and when the filters disagree then by manual checking.
  4. Statistically analyzing the results for each filter after counting the number of misdetections (pornographic sites that were not blocked) and the number of false alarms (nonpornographic sites that were blocked) and assessing the operation.

Contents

Overview of filtering methods

Filtering software blocks content in two primary ways: blocking by URL and blocking by the content of retrieved pages.

  1. URL-based blocking: Using this method, the filtering software employs a "black list" of unwanted URLs. The list is normally classified into different categories (e.g., sex, drugs, cults, gambling). A user is given the ability to choose the categories that he or she wants to block. Also, most of the address-based blocking software provides the capability to augment the black list with additional URLs the user wishes to block. Furthermore, users can exempt URLs from the black list. The list should be updated periodically to include new URLs and remove inactive URLs.

    As an alternative approach to the black list, some filtering software uses a "white list." The user is permitted to access only URLs that are included in the white list. This is intended mainly for school students or closed communities.

  2. Content-based blocking: The filtering software analyzes the content of the retrieved pages to check for unwanted patterns. The simple method of this type of blocking is to block by words. The filter will block retrieved content if it encounters a word that matches its list of banned words. More sophisticated software will employ some artificial intelligence algorithms to analyze the retrieved content.

Content-based blocking is widely criticized for its ineffectiveness. A block on the word "breast" might block pages about breast cancer. Address-based blocking is preferred since it is less prone to errors. Moreover, sites in different languages are hard to detect. Newer image-based content filters are emerging; they have yet to gain widespread acceptance. However, address-based blocking is more expensive because of the overhead incurred in the frequent updating of the black list.

Filtering software can also be classified based on its location within the network into two classes: client based and server based. In client-based filtering software, the filter resides at the client side. It interacts with browsers installed on the client machine to employ filtering functionality while a person surfs the Internet. Because it is installed on the client machine, it is considered to be voluntary. The client can choose to uninstall it.

In server-based filters, on the other hand, the filter is installed on a server within a network. It is managed by the network administrator; therefore, filtering can be forced upon all network clients. These filters are widely used in corporations and large organizations. The filter can be a plug-in to a known Web proxy (e.g., Netscape, Microsoft, or Apache) or a stand-alone proxy.

Products and settings

In this paper, the performance of six filtering products was evaluated. All selected filters use the black list technique. Furthermore, all filters except N2H2 are server based. At the time of the experiment, N2H2 Inc. did not distribute its software but provided filtering solutions for ISPs.

Table 1 shows the filtering products, vendor names, and product versions.

Table 1. Filtering products evaluated

Filter Name Vendor Name Version
SmartFilter SecureComputing Co. SmarfFilter for Netscape Proxy
SurfWatch SurfWatch Software Inc. Professional Edition
WebSense WebSense Inc. 3.01
I-Gear Symantec Co. I-Gear for Solaris
CyberPatrol The Learning Company 2.10
N2H2 N2H2 Inc. N2H2 for ISPs

Experiments

Test data set

The first step in testing the filters was to construct a sufficient and representative test set of Web pages or URLs that adequately mimics the target user population. The target user population in this case is assumed to be the casual home user accessing the Internet through a dial-up connection to a public ISP. For that the test set was chosen to be a large set of 54,681 page requests (URLs) from actual users, registered in the proxy log of an ISP. The data was collected during a 24-hour period during the summer of 1998.

At that particular ISP, Internet access was provided through a proxy that cached frequently requested pages. However, the proxy did not block access to any sites. As a side effect of using the proxy, a log was automatically produced that specified for each user request the destination URL (the address of the page that was requested) among lots of other detailed information. The URLs were collected from the proxy log and were used after removing all source IP addresses (the IP of the requestor of the URL) and other unnecessary data. (It should be noted that a URL addresses a Web object and not a whole page, so pages with multiple objects such as images would have multiple URLs in the test set.)

Two data sets were prepared: (1) the original set with all URLs and (2) the distinct set, which is the data set after removing all duplicate URLs (URLs that were requested more than once during the data collection period) and keeping only distinct URLs. The size of this set was 40,100, meaning that more than 25% of the log is duplicated. Note that the second test set is a proper subset of the first and hence does not require any extra effort in performing the experiment. Only the analysis stage is affected. Error statistics on the original set are more indicative of the user populations because misdetections or false alarms in a URL requested multiple times will be reflected in the final error rate. On the contrary, the distinct set lists each URL only once, and hence an error will be counted only once.

The original set

First we will describe the experiments performed on the original set and analyze the results. In the next sections, we will address the distinct set.

Recording filter decisions

In this step all URLs in the data set were run through each of the filters to determine each filter's particular decision about each URL.

The test machine was a Sun Altra 10 that runs Solaris 2.6h. Netscape proxy server version 3.5 was installed on the test machine. The experiment was done between December 1998 and January 1999.

We can describe the steps of the experiment as follows:

  1. Trial versions of I-Gear, WebSense, SmartFilter, SurfWatch, and CyberPatrol were installed on the test machine. All except I-Gear are plug-ins to the Netscape proxy server. Since N2H2 is a service rather than distributed software we were not able to get a copy of the software. However, a dedicated server was set up for our experiment at N2H2 Inc.
  2. All filters were configured to block only sex-related categories.
  3. A script was written to run each filtering product through the whole data set to determine the set of blocked URLs within the test set and the remaining URLs (i.e., the set of retrieved URLs).
  4. For each URL, an indication of whether it was blocked or retrieved by each particular filter was entered into a database.

The result of running all the filters on the data set is summarized in the following table. The total number of URLs tested was 54,681. Table 2 shows the number of URLs blocked by each filtering product. Figure 2 shows the percentage of total URLs that were blocked by each filtering product. As can be seen from the table, the filters agree to a certain extent on the number of URLs that are blocked from among the total test set.

Table 2. Number of URLs blocked by each filter

Filtering Product Number of URLs Blocked % of Total
SmartFilter 22,642 41%
SurfWatch 24,917 46%
WebSense 22,901 42%
I-Gear 18,171 33%
CyberPatrol 23,578 43%
N2H2 23,161 42%

As can be seen from the table, the number of blocked URLs for all filtering products fall within a small interval except that of I-Gear.

Labeling the data set

The next step was to use the filter decisions as a basis for labeling each URL as pornographic or not. The method used here was conceptually simple but saved a lot of effort practically. The method was to trust the unanimous decisions reached by the filters. So, all URLs with unanimous "block" decisions were considered to be pornographic, while all URLs with unanimous "retrieve" decisions were considered to be nonpornographic.

The remaining URLs with different decisions were labeled manually. All URLs with different filter decisions were manually checked and labeled to be pornographic or not based on the usual U.S. cultural standards for pornography.

Table 3 summarizes the labels. The first two rows show the number of URLs unanimously blocked and retrieved, respectively; the third row shows the URLs with different decisions, which were manually labeled. The final two rows show the summary of labels of all URLs in the database, where the pornographic row includes sites unanimously agreed to by the filters plus the URLs deemed pornographic from the manual check from among the URLs with different filter decisions.

Table 3. Summary of URL labels for the original set

  Number of URLs % of Total
Unanimously blocked 12,072 22%
Unanimously retrieved 24,027 44%
Different decisions 18,582 34%
Pornographic 23,955 44%
Nonpornographic 30,726 56%

From the table we see that two thirds of the URLs were unanimously labeled, while one third of the URLs were manually labeled, as all filters agreed to most of the URLs.

Error analysis

The last step in this evaluation is to analyze the error rates of each filter. In any detection task two possible types of error by filters are possible. In our case the two types of error are (1) the filter not blocking a pornographic URL, which is called a misdetection error, and (2) the filter blocking a nonpornographic URL, which is called a false alarm error. The misdetection rate for each filter was calculated by calculating the conditional probability that a URL was labeled pornographic but was not blocked by the filter. The false alarm rate for each filter was calculated by calculating the conditional probability that a URL was labeled nonpornographic but was blocked by the filter.

(Probability of misdetection by filter) = (Probability of pornographic URL but not blocked by filter) / (Probability of pornographic URL)

(Probability of false alarm by filter) = (Probability of nonpornographic but blocked by filter) / (Probability of nonpornographic URL)

The probability of pornographic URL is calculated as the number of pornographic URLs divided by the total number of URLs in the set. The probability of pornographic URL but not blocked by filter is calculated as the number of pornographic URLs that were not blocked by the filter divided by the total number of URLs in the set. The other probabilities are calculated similarly.

In calculating the error rate of each filter, we give equal weight to each of the two types of errors. Table 4 shows the errors of each filter.

Table 4. Error analyses of filters

Filtering Product Misdetection False Alarm Error Rate
SmartFilter 15% 7% 11%
SurfWatch 12% 7% 10%
WebSense 17% 9% 13%
I-Gear 36% 10% 23%
CyberPatrol 16% 7% 11%
N2H2 14% 7% 11%

As can be seen from the table, all products have error rates that are close to one another, except one. The values range from 10% to 13% for the top five filters. In this experiment, SurfWatch turned out to be the filter with the lowest error rate, but it was closely trailed by SmartFilter, CyberPatrol, and N2H2. Figure 1 shows the results graphically.


Figure 1. Error rates for the original set

The distinct set

Similar processing and analysis was done on the distinct set. It is important to note that the most resource-consuming tasks in the experiment (running the filters on the URLs and manually labeling URLs) did not need to be repeated for this data set.

Recording filter decisions

The filter decisions were taken from the trial runs on the original data set. The result of running all the filters on the distinct set is summarized in the following table. The total number of URLs tested was 40,100. Table 5 shows the number of URLs blocked by each filtering product and the percentage of URLs that were blocked by each filtering product. As can be seen from the table, the filters agree to a certain extent on the number of URLs that are blocked from among the total test set.

Table 5. Number of URLs blocked by each filter for the distinct data set

Filtering Product Number of URLs Blocked % of Total
SmartFilter 17,629 44%
SurfWatch 18,836 47%
WebSense 18,441 46%
I-Gear 13,483 34%
CyberPatrol 17,849 45%
N2H2 18,354 46%

As can be seen from the table, the number of blocked URLs for all filtering products fall within a small interval except that of I-Gear.

Labeling the data set

Table 6 summarizes the labels, the first two rows show the number of URLs unanimously blocked and retrieved, respectively; the third row shows the URLs with different decisions, which were manually labeled. The final two rows show the summary of labels of all URLs in the database.

Table 6. Summary of URL labels for the distinct data set

  Number of URLs % of Total
Unanimously blocked 9,757 24%
Unanimously retrieved 17,611 44%
Different decisions 12,732 32%
Pornographic 18,451 46%
Nonpornographic 21,649 54%

The ratios here are similar to those of the original data set in Table 3.

Error analysis

Table 7 shows the errors of each filter for the distinct data set.

Table 7. Error analysis of filters for the distinct data set

Filtering Product Misdetection False Alarm Error Rate
SmartFilter 13% 4% 8%
SurfWatch 12% 11% 11%
WebSense 12% 7% 10%
I-Gear 36% 7% 21%
CyberPatrol 15% 9% 12%
N2H2 11% 1% 6%

As can be seen from the table, here the false alarm error rates are in general less than those for the original set. Here N2H2 takes the lead in having the fewest false alarm errors for distinct URLs. Figure 2 shows the results graphically.


Figure 2. Error rates for the distinct set

Conclusions

We presented a methodology to study and compare the performance of Web-based filtering products that use the black list approach. This methodology has the advantage of reducing two thirds of the time-consuming work required to prelabel all the URLs. It is based on taking the unanimous decisions of filters as absolute labels, meaning that any URL blocked by all filters is considered to be pornographic. The risk in this case is when all six filters agree in error, which we assume to be a remote possibility (the obvious case is when the same URL now points to different content than it did at the time of evaluation by the filter producer). The labeling effort could be reduced further by not insisting on unanimous decisions by filter but on a majority vote. This, however, could increase the labeling errors.

Another contribution of the work is in formalizing a performance metric for evaluating filters. The metric is based on estimating the rates of two types of errors: (1) the filter not blocking a pornographic URL, which is called a misdetection error, and (2) the filter blocking a nonpornographic URL, which is called a false alarm error. The misdetection rate for each filter was calculated by calculating the conditional probability that a URL was labeled pornographic but was not blocked by the filter. The false alarm rate for each filter was calculated by calculating the conditional probability that a URL was labeled nonpornographic but was blocked by the filter. In our experiments we gave equal weight to both types of errors. But it is very easy to give more weight to the misdetection rate, for instance, favoring filters that err on the side of blocking nonpornographic sites.

As to the results of the experiment, we can see that most of the famous filtering products agree on blocking most of the sites. Two thirds of the URLs in the test set made unanimous decisions either to block or to retrieve. Furthermore, the ratio of blocked URLs to the whole set was close for all filters (between 41% and 46%, except for one filter). For the error rates we can see that in general the misdetection rates are higher than the false alarm rates. This is to be expected given the numerous sites on the Internet and the negative media consequences of blocking a nonpornographic site erroneously. The error rates here too fall within a small interval (10% to 13%), except for one filter. When comparing the results of the error rates of the original set with those of the distinct set, one finds that the distinct set is associated with a smaller error rate with some of the filters. This could be explained by saying that the URLs that are in error for that filter are repeated multiple times in the test set.

In summary, the major filtering products are more or less close in their error rate, hovering around the 10% range. For parents, libraries, and schools, this means a lot: every tenth URL request will be handled incorrectly. This means that other methods should be investigated to augment the filters, such as image-based filters or better content-based filters.

Acknowledgments

The authors would like to thank their colleagues who assisted them in performing the experiment, particularly Rayed Al-Fayez, Muhammad Al-Korbi, and Waleed Al-Oriny.

References

Haralick and Shapiro, Computer and Robot Vision, Addison Wesley, 1992.

K. Schneider, A Practical Guide to Internet Filters, Neal-Schuman, 1997.

CyberNOT List - Search Engine Results

Cyber Patrol

Filtering Facts Home Page

PC Magazine: The 1998 Utility Guide -- Parental Filtering Utilities

RCLS LibraryLand: General Library Issues: Censorship/Intellectual Freedom

SMART PARENT- PROTECT YOUR KIDS

SurfWatch Home Page

SurfWatch Test Site