Evaluating Web Filters: A Practical Approach

Eyas S. Al-HAJERY <alhajery@kacst.edu.sa>
Badr Al BADR <badr@kacst.edu.sa>
King Abdulaziz City for Science and Technology
Saudi Arabia

Abstract

The Internet is becoming a significant source of all types of information to all people. This has made Internet censorship a major and controversial issue. While many people believe that the use of content filtering products is against free speech, there are others, especially parents and librarians, who are concerned about the negative effects of Internet pornography on minors. Many libraries and schools are mandating filtered access to the Internet. In this paper, we evaluate the performance of six major Internet content filtering products: SmartFilter, WebSense, CyberPatrol, SurfWatch, N2H2, and I-Gear. The performance evaluation is based on a log of several tens of thousands of uniform resource locators (URLs), which has been collected from an Internet service provider (ISP). This ISP provides unfiltered Internet access to several thousands of customers. Several performance measures have been investigated to compare the performance of these products. These measures include the blocking rate of pornographic materials and the false alarm rate (the blocking rate of nonpornographic material). Furthermore, the method we propose to evaluate the filters has the added advantage of being practical.

Introduction

One of the biggest complaints people have about the Internet concerns the proliferation of pornography. To guard minors and conservative communities from pornography, many products appeared in the market with the goal of filtering Internet access and hence restricting access to pornographic sites. Another important application for Internet filtering is resource management, where an organization wishes to ensure that its Internet connection is properly used for legitimate business activities during office hours, so nonbusiness sites are blocked.

The definition of the problem is as follows: Assuming that an organization has its own definition of what is suitable and what is not, the organization must find the filter that best satisfies its needs in allowing access to the maximum number of sites it deems suitable, while at the same time blocking the minimum number of sites it deems unsuitable.

This work attempts to measure the effectiveness of several commercial filters in blocking access to pornography. Filter effectiveness is measured as minimizing the blocking of suitable sites and maximizing the blocking of unsuitable sites. It is important to note that this work focuses on filtering Web-based traffic (which is in fact the vast majority of Internet traffic).

It is believed that the results of this work would benefit organizations (e.g., schools, public libraries) wanting to deploy pornography filtering software. The contribution of this work is not only in assessing filter effectiveness but also in outlining a practical procedure by which to test filtering software. The procedure can be applied to new filters, for different filtering objectives, or with different suitability standards in mind.

Usually, to assess a detection task the decisions of the detector are compared to the actual identities of the objects to detect. In our case, the objects to detect are URLs of pornographic Web objects. This means that a set of Web pages is needed, with each page being prelabeled as pornographic or not. Hence a major contribution of this work was in evaluating the performance of the filters without having to manually label all URLs in the test set as pornographic or not, by using voting among filters and taking unanimous decisions as absolute labels. The savings in the labeling step was on the order of 67%, as we had to manually label only one third of the URLs.

The work relies on the principles of statistical decision theory in evaluating the filters. The major components of this work are the following:

Overview of filtering methods

Filtering software blocks content in two primary ways: blocking by URL and blocking by the content of retrieved pages.

Content-based blocking is widely criticized for its ineffectiveness. A block on the word "breast" might block pages about breast cancer. Address-based blocking is preferred since it is less prone to errors. Moreover, sites in different languages are hard to detect. Newer image-based content filters are emerging; they have yet to gain widespread acceptance. However, address-based blocking is more expensive because of the overhead incurred in the frequent updating of the black list.

Filtering software can also be classified based on its location within the network into two classes: client based and server based. In client-based filtering software, the filter resides at the client side. It interacts with browsers installed on the client machine to employ filtering functionality while a person surfs the Internet. Because it is installed on the client machine, it is considered to be voluntary. The client can choose to uninstall it.

In server-based filters, on the other hand, the filter is installed on a server within a network. It is managed by the network administrator; therefore, filtering can be forced upon all network clients. These filters are widely used in corporations and large organizations. The filter can be a plug-in to a known Web proxy (e.g., Netscape, Microsoft, or Apache) or a stand-alone proxy.

Products and settings

In this paper, the performance of six filtering products was evaluated. All selected filters use the black list technique. Furthermore, all filters except N2H2 are server based. At the time of the experiment, N2H2 Inc. did not distribute its software but provided filtering solutions for ISPs.

Filter Name	Vendor Name	Version
SmartFilter	SecureComputing Co.	SmarfFilter for Netscape Proxy
SurfWatch	SurfWatch Software Inc.	Professional Edition
WebSense	WebSense Inc.	3.01
I-Gear	Symantec Co.	I-Gear for Solaris
CyberPatrol	The Learning Company	2.10
N2H2	N2H2 Inc.	N2H2 for ISPs

Experiments

Test data set

The first step in testing the filters was to construct a sufficient and representative test set of Web pages or URLs that adequately mimics the target user population. The target user population in this case is assumed to be the casual home user accessing the Internet through a dial-up connection to a public ISP. For that the test set was chosen to be a large set of 54,681 page requests (URLs) from actual users, registered in the proxy log of an ISP. The data was collected during a 24-hour period during the summer of 1998.

At that particular ISP, Internet access was provided through a proxy that cached frequently requested pages. However, the proxy did not block access to any sites. As a side effect of using the proxy, a log was automatically produced that specified for each user request the destination URL (the address of the page that was requested) among lots of other detailed information. The URLs were collected from the proxy log and were used after removing all source IP addresses (the IP of the requestor of the URL) and other unnecessary data. (It should be noted that a URL addresses a Web object and not a whole page, so pages with multiple objects such as images would have multiple URLs in the test set.)

Two data sets were prepared: (1) the original set with all URLs and (2) the distinct set, which is the data set after removing all duplicate URLs (URLs that were requested more than once during the data collection period) and keeping only distinct URLs. The size of this set was 40,100, meaning that more than 25% of the log is duplicated. Note that the second test set is a proper subset of the first and hence does not require any extra effort in performing the experiment. Only the analysis stage is affected. Error statistics on the original set are more indicative of the user populations because misdetections or false alarms in a URL requested multiple times will be reflected in the final error rate. On the contrary, the distinct set lists each URL only once, and hence an error will be counted only once.

The original set

First we will describe the experiments performed on the original set and analyze the results. In the next sections, we will address the distinct set.

Recording filter decisions

In this step all URLs in the data set were run through each of the filters to determine each filter's particular decision about each URL.

The test machine was a Sun Altra 10 that runs Solaris 2.6h. Netscape proxy server version 3.5 was installed on the test machine. The experiment was done between December 1998 and January 1999.

The result of running all the filters on the data set is summarized in the following table. The total number of URLs tested was 54,681. Table 2 shows the number of URLs blocked by each filtering product. Figure 2 shows the percentage of total URLs that were blocked by each filtering product. As can be seen from the table, the filters agree to a certain extent on the number of URLs that are blocked from among the total test set.

Filtering Product	Number of URLs Blocked	% of Total
SmartFilter	22,642	41%
SurfWatch	24,917	46%
WebSense	22,901	42%
I-Gear	18,171	33%
CyberPatrol	23,578	43%
N2H2	23,161	42%

As can be seen from the table, the number of blocked URLs for all filtering products fall within a small interval except that of I-Gear.

Labeling the data set

The next step was to use the filter decisions as a basis for labeling each URL as pornographic or not. The method used here was conceptually simple but saved a lot of effort practically. The method was to trust the unanimous decisions reached by the filters. So, all URLs with unanimous "block" decisions were considered to be pornographic, while all URLs with unanimous "retrieve" decisions were considered to be nonpornographic.

The remaining URLs with different decisions were labeled manually. All URLs with different filter decisions were manually checked and labeled to be pornographic or not based on the usual U.S. cultural standards for pornography.

Table 3 summarizes the labels. The first two rows show the number of URLs unanimously blocked and retrieved, respectively; the third row shows the URLs with different decisions, which were manually labeled. The final two rows show the summary of labels of all URLs in the database, where the pornographic row includes sites unanimously agreed to by the filters plus the URLs deemed pornographic from the manual check from among the URLs with different filter decisions.

	Number of URLs	% of Total
Unanimously blocked	12,072	22%
Unanimously retrieved	24,027	44%
Different decisions	18,582	34%
Pornographic	23,955	44%
Nonpornographic	30,726	56%

From the table we see that two thirds of the URLs were unanimously labeled, while one third of the URLs were manually labeled, as all filters agreed to most of the URLs.

Error analysis

The last step in this evaluation is to analyze the error rates of each filter. In any detection task two possible types of error by filters are possible. In our case the two types of error are (1) the filter not blocking a pornographic URL, which is called a misdetection error, and (2) the filter blocking a nonpornographic URL, which is called a false alarm error. The misdetection rate for each filter was calculated by calculating the conditional probability that a URL was labeled pornographic but was not blocked by the filter. The false alarm rate for each filter was calculated by calculating the conditional probability that a URL was labeled nonpornographic but was blocked by the filter.

(Probability of misdetection by filter) = (Probability of pornographic URL but not blocked by filter) / (Probability of pornographic URL)

(Probability of false alarm by filter) = (Probability of nonpornographic but blocked by filter) / (Probability of nonpornographic URL)

The probability of pornographic URL is calculated as the number of pornographic URLs divided by the total number of URLs in the set. The probability of pornographic URL but not blocked by filter is calculated as the number of pornographic URLs that were not blocked by the filter divided by the total number of URLs in the set. The other probabilities are calculated similarly.

In calculating the error rate of each filter, we give equal weight to each of the two types of errors. Table 4 shows the errors of each filter.

Filtering Product	Misdetection	False Alarm	Error Rate
SmartFilter	15%	7%	11%
SurfWatch	12%	7%	10%
WebSense	17%	9%	13%
I-Gear	36%	10%	23%
CyberPatrol	16%	7%	11%
N2H2	14%	7%	11%

As can be seen from the table, all products have error rates that are close to one another, except one. The values range from 10% to 13% for the top five filters. In this experiment, SurfWatch turned out to be the filter with the lowest error rate, but it was closely trailed by SmartFilter, CyberPatrol, and N2H2. Figure 1 shows the results graphically.

The distinct set

Similar processing and analysis was done on the distinct set. It is important to note that the most resource-consuming tasks in the experiment (running the filters on the URLs and manually labeling URLs) did not need to be repeated for this data set.

Recording filter decisions

The filter decisions were taken from the trial runs on the original data set. The result of running all the filters on the distinct set is summarized in the following table. The total number of URLs tested was 40,100. Table 5 shows the number of URLs blocked by each filtering product and the percentage of URLs that were blocked by each filtering product. As can be seen from the table, the filters agree to a certain extent on the number of URLs that are blocked from among the total test set.

Filtering Product	Number of URLs Blocked	% of Total
SmartFilter	17,629	44%
SurfWatch	18,836	47%
WebSense	18,441	46%
I-Gear	13,483	34%
CyberPatrol	17,849	45%
N2H2	18,354	46%

As can be seen from the table, the number of blocked URLs for all filtering products fall within a small interval except that of I-Gear.

Labeling the data set

Table 6 summarizes the labels, the first two rows show the number of URLs unanimously blocked and retrieved, respectively; the third row shows the URLs with different decisions, which were manually labeled. The final two rows show the summary of labels of all URLs in the database.

	Number of URLs	% of Total
Unanimously blocked	9,757	24%
Unanimously retrieved	17,611	44%
Different decisions	12,732	32%
Pornographic	18,451	46%
Nonpornographic	21,649	54%

Error analysis

Filtering Product	Misdetection	False Alarm	Error Rate
SmartFilter	13%	4%	8%
SurfWatch	12%	11%	11%
WebSense	12%	7%	10%
I-Gear	36%	7%	21%
CyberPatrol	15%	9%	12%
N2H2	11%	1%	6%

As can be seen from the table, here the false alarm error rates are in general less than those for the original set. Here N2H2 takes the lead in having the fewest false alarm errors for distinct URLs. Figure 2 shows the results graphically.

Conclusions

We presented a methodology to study and compare the performance of Web-based filtering products that use the black list approach. This methodology has the advantage of reducing two thirds of the time-consuming work required to prelabel all the URLs. It is based on taking the unanimous decisions of filters as absolute labels, meaning that any URL blocked by all filters is considered to be pornographic. The risk in this case is when all six filters agree in error, which we assume to be a remote possibility (the obvious case is when the same URL now points to different content than it did at the time of evaluation by the filter producer). The labeling effort could be reduced further by not insisting on unanimous decisions by filter but on a majority vote. This, however, could increase the labeling errors.

Another contribution of the work is in formalizing a performance metric for evaluating filters. The metric is based on estimating the rates of two types of errors: (1) the filter not blocking a pornographic URL, which is called a misdetection error, and (2) the filter blocking a nonpornographic URL, which is called a false alarm error. The misdetection rate for each filter was calculated by calculating the conditional probability that a URL was labeled pornographic but was not blocked by the filter. The false alarm rate for each filter was calculated by calculating the conditional probability that a URL was labeled nonpornographic but was blocked by the filter. In our experiments we gave equal weight to both types of errors. But it is very easy to give more weight to the misdetection rate, for instance, favoring filters that err on the side of blocking nonpornographic sites.

As to the results of the experiment, we can see that most of the famous filtering products agree on blocking most of the sites. Two thirds of the URLs in the test set made unanimous decisions either to block or to retrieve. Furthermore, the ratio of blocked URLs to the whole set was close for all filters (between 41% and 46%, except for one filter). For the error rates we can see that in general the misdetection rates are higher than the false alarm rates. This is to be expected given the numerous sites on the Internet and the negative media consequences of blocking a nonpornographic site erroneously. The error rates here too fall within a small interval (10% to 13%), except for one filter. When comparing the results of the error rates of the original set with those of the distinct set, one finds that the distinct set is associated with a smaller error rate with some of the filters. This could be explained by saying that the URLs that are in error for that filter are repeated multiple times in the test set.

In summary, the major filtering products are more or less close in their error rate, hovering around the 10% range. For parents, libraries, and schools, this means a lot: every tenth URL request will be handled incorrectly. This means that other methods should be investigated to augment the filters, such as image-based filters or better content-based filters.

Acknowledgments

The authors would like to thank their colleagues who assisted them in performing the experiment, particularly Rayed Al-Fayez, Muhammad Al-Korbi, and Waleed Al-Oriny.

Evaluating Web Filters: A Practical Approach

Abstract

Introduction

Contents

Overview of filtering methods

Products and settings

Experiments

Test data set

The original set

Recording filter decisions

Labeling the data set

Error analysis

The distinct set

Recording filter decisions

Labeling the data set

Error analysis

Conclusions

Acknowledgments

References