Wayne B. Salamonsen <firstname.lastname@example.org>
Roland Yeo <email@example.com>
National University of Singapore
This paper proposes a new method for performing content selection on the Internet using a proxy server and the PICS infrastructure. It begins by discussing and evaluating present methods used to perform Content Selection on the Internet, including the application of access rules on proxy routed requests and packet filtering of restricted IP addresses on routers and client PCs. It then proposes, evaluates and compares a new method in which a proxy server can use PICS labels, recently developed by the World Wide Web Consortium (W3C) for Internet Information classification, to perform content selection on a large scale. Finally, it presents some of the details involved in implementing this system.
Keywords: content selection, PICS, Platform for Internet Content Selection, proxy, filtering, access control rules, packet filtering, restriction, Internet, classification, label, rating, service, cache.
The Internet has undergone explosive growth over the last few years as a result of the amount of information made available on it. It has become the largest worldwide source of information available today, containing information on virtually everything and anything imaginable. Yet having such a volume of information available with such ease of access raises the problem of suitability. No one wants to ban their child or student from accessing information via the Internet. However, one does not want a 10-year-old child accessing pornography or violent imagery. Similarly, many companies wish to grant employees Internet access for work-related purposes but do not what them to make use of it recreationally. Many similar situations exist, making it a necessity to create some form of large-scale content management.
There are two methods currently used to enforce content selection on a large scale: the application of access control rules on proxy routed requests, and packet filtering of restricted IP addresses on routers and client PCs. The application of access control rules involves the routing of all Internet URL requests through a proxy, thus creating a situation in which access control rules can be applied to a request. This is performed by maintaining a list of restricted URLs and corresponding access rules that dictate situations under which access should be denied to the corresponding URL. Packet filtering of restricted IP addresses is similar and involves maintaining a list of URLs to which access is to be denied. Requests to these URLs have their associated IP packets dropped either at the router or client-PC-level, preventing the URL from being displayed to the user.
Recently, a new series of methods has emerged based on the Platform for Internet Content Selection (PICS) infrastructure. PICS has been formulated by the World Wide Web Consortium (W3C) and allows for the classification of URLs through the use of associated PICS labels. Each label associated with a URL classifies that URL according to the ratings specified in the label format (or ratings system). This paper proposes a new method for large-scale content selection using a PICS-aware proxy system. Internet requests can be redirected through a proxy. For each request, the proxy can fetch a corresponding PICS label and compare its ratings against the corresponding restriction criteria specified for the person making the request. If, upon comparison, any of the ratings contained within the label are not suitable, access to the URL in question can be denied.
The rest of this paper explains this newly suggested system in more detail and compares it to existing methods for content selection. Section 2 looks in more detail at the methods currently used to enforce content selection, and Section 3 introduces the proposed PICS proxy system in more detail and compares it to those methods. Section 4 provides details regarding implementation of the proposed system and Section 5 summarizes the paper.
The general mechanisms for large-scale content control are the application of access control rules on proxy routed requests and packet filtering of restricted IP addresses on routers and client PCs. These methods have been adopted as national-level controls by countries such as Singapore , China, and others with nationally controlled Internet service providers (ISPs). Both methods are similar in that they manually keep track of a list of questionable URLs and act upon the existence of a user-requested URL within this list.
The application of access-control rules on proxy-routed requests involves the redirection of all Internet requests through a compliant proxy server. Users are only granted Internet access via this proxy, ensuring that all relevant Internet requests are subject to the chosen content selection rules. Each URL request directed through this proxy is checked against the corresponding list of questionable URLs. If the requested URL is not present within the list, the request is allowed to continue uninterrupted; however, if it is present in the list, it is subjected to the access-control rules associated with it. These rules might specify total restriction to the page for all users, or may contain a subset of users to whom the restriction is to be applied.
Packet filtering of restricted IP addresses on routers and client PCs is similar in many ways to the application of access-control rules on proxy-routed requests. It too uses a list of restricted URLs against which each user URL request is compared. However, Internet requests are no longer required to go through a proxy. In this case, the checking is performed either at the router level or the client-PC level. In addition, access rules are not applied; if the requested URL appears on the list, access to the URL is disallowed by dropping the relevant packets.
Both methods are relatively simple and quick to implement with the list of restricted URLs usually implemented using a hash table. All that is required to check for access restrictions on a given URL is a simple table lookup. If no access-control rules are to be applied, then it is a simple matter to restrict access by dropping packets or restricting access via the proxy. If access-control rules are to be applied, this is done before any restriction occurs. However, this is an additional and time-consuming process.
Basing a system on a filter list, provides simplicity and speed, but leads to a very inflexible method. If at some point one changes the criteria for access restriction, the entire list needs to be reevaluated to determine if each URL on the list need be restricted under the new criteria. As the list gets longer this re-evaluation becomes an increasingly inefficient and time-consuming operation. Another shortcoming of this method is that restrictions are based on the existence of one list on the local machine. In many cases it is an impossible task to track down and manually place on a list, all sites to which access should be restricted, since there are simply too many URLs in existence. Usually only the common, well-known sites to be restricted are placed on the list. One solution to this problem would be the sharing of lists constructed by independent individuals or commercial parties, such as SurfWatch Software, Inc. However, this is frequently impractical because these lists may have been constructed on the basis of different restriction criteria. Also, there is no clear idea as to how many such lists should be examined before deciding that the URL in question is in fact an unrestricted URL to which access should be granted. With the existence of these limitations, the use of filter-list-based methods such as the application of access-control rules on proxy-routed requests and packet filtering of restricted IP addresses are not very effective. They will never provide more than an inflexible means of limiting access to a few well-known sites.
W3C's recently developed PICS allows for content selection to be performed on the Internet . It works by associating labels with URLs, with each label containing a set of ratings used to rate the corresponding URL according to a scheme in a corresponding rating system [2, 3]. Labels can be generated by the author of a URL or by a third party and can reside in the header of the URL, within a label database, or within a bureau accessible via the Internet. This means that multiple labels can exist to describe a particular URL according to the same or different rating systems. In this way a user can make use of labels from multiple trusted sources to decide whether or not to restrict access to a particular page. Initially, PICS was designed for the purpose of allowing parents to restrict the information available to their children. A number of products such as Microsoft Internet Explorer V3.0 and Cyber Patrol  exist that make use of PICS to restrict individual access to URLs at a client-PC level. However, building PICS compliance into a proxy server will allow content selection based on the PICS infrastructure to be performed on a wider scale than that of the individual, for example, at a workplace, intranet, or national level.
The proposed PICS-aware proxy server requires all Internet URL requests, upon which content selection is to be imposed, to be redirected through the PICS-aware proxy. For each URL request, a corresponding PICS label can be obtained providing a rating of that URL according to a selected rating system. This label can be retrieved from a number of places and in a number of ways. For example, it may be taken directly from the header of the URL. However, a label contained within the header of a URL has been created by the author of the URL, who may not be considered a reliable rating source. A label will usually be taken from a local label database containing labels created according to a locally endorsed rating system or from a trusted third party label bureau residing anywhere on the Internet. Once a trusted label is obtained, the ratings contained within that label can be compared against the locally stored restriction criteria. These criteria specify at which level a rating must be in order for access to a URL to be restricted. If any one (or a specified combination)of these criteria are met, access to the page is denied and the user receives an appropriate; if none of the restriction criteria are met, access is granted to the URL and the user is presented with the appropriate document.
Using PICS to implement content selection through a proxy provides an extremely flexible and functional method for large-scale content selection. As PICS ratings are compared against a series of locally chosen and stored access restrictions, it becomes a very simple task to alter the criteria under which a URL is to be restricted. All that is required is that the stored restriction criteria be altered. No review of labels or URLs is required, unlike filter-list-based methods. In addition, labels need not come from one source or follow one particular rating scheme. This provides enormous flexibility in selecting a label for a URL. Multiple-label sources can be selected in a hierarchical order, so that if a label cannot be found at one location, the next location in the list will be checked. The administrator may also decide whether to allow or restrict access to URLs for which no label can be found. Labels written according to any ratings system can be used as long as the appropriate ratings specification is located on the local machine. In this way labels from any source that are deemed acceptable or trustworthy can be used. In addition, if the labels used are made accessible to users of the proxy (through a locally set-up label bureau), users such as parents can implement further restrictions based on the same rating system (or a different one) using a PICS-aware browser.
There are two drawbacks to using PICS labels: performance and a lack of labels. However, our initial indications are that neither of these is really a problem. Speed and performance depend largely on where a label request is sent. If local databases are referenced, the speed is comparable to that in methods that apply access control rules on proxy routed requests as a label fetch and retrieval is equivalent to a filter-list lookup and control rule retrieval. Naturally if a third-party label bureau is accessed across the Internet, the speed will be dominated by that of the label fetch which is dependent on the network connection speed. In addition, our proposed implementation would implement a cache of recently accessed labels in memory to speed up subsequent accesses (see Implementation section).
A lack of PICS labels is the other drawback. However, this is not really a problem when compared to the other methods used for large-scale content selection. Once an appropriate label ratings system has been chosen or developed, the creation of a URL label requires very little effort, in fact, no more effort than is required by other methods to add a new entry to the list of restricted URLs. In addition to having locally created labels there is the ability to use labels created by trusted third parties. Many sites are now labeling their own sites using rating systems provided by the Recreational Software Advisory Council (RSAC) and others. A notable example of this is Playboy magazine . A number of companies based around content selection, such as Net Shepard , are also setting up their own labeling systems and bureaus to provide labels to users and administrators. These bureaus will provide a large number of labels, following established rating systems.
The implementation of the proposed PICS-aware proxy server consists of two parts: a standard proxy server module and a PICS label handling module. The proxy server PICS module will be an interface through which a standard proxy server such as Harvest  or CERN can make function calls. On each URL request, the proxy server will simultaneously fetch the required URL document and also make a call to the PICS module. This request will include the URL for which access is being requested. The PICS module will return a Boolean result with "True" corresponding to "restrict the URL" and "False" corresponding to "do not restrict the URL." If the return value is True, the proxy server will not serve the corresponding URL document to the user but instead will provide a document informing the user that the requested URL is restricted and therefore not available. If the return value is False, the proxy will proceed and serve the requested document to the user.
The PICS module is responsible for finding a corresponding PICS label, parsing this label to determine the rating system being used and the ratings contained within, and comparing these ratings against the locally stored restriction criteria. If the required combination of restriction criteria is met, the URL is to be restricted and it returns True, otherwise the URL is not to be restricted and it returns False.
The PICS module must contain a hierarchical list of locations from which to obtain a label. These places can include a local label bureau situated on the local machine, third-party label bureaus on the Internet, or the source of the requested URL. Each of these locations is visited in turn with a request for the corresponding label until a suitable label is found. If no acceptable label is found, the decision comes down to whether unlabelled URLs are to be restricted or not. This is a decision made by the administrator, who sets a corresponding flag within the program. For performance purposes, a localized cache of recently accessed labels will be kept in memory. This cache will take the form of a hash table indexed by URL and represents the first location in the label source hierarchy to be checked. An illustration of this implementation is found in Figure 1.
Once a label has been obtained, it is parsed to obtain its ratings depending on the rating system it corresponds to. This uses a rather complex label parser that could be written specifically for the system; however, several parsers already exist that can be taken advantage of. These parsers are publicly available and have been written by various third parties, including the World Wide Web Consortium . The result of parsing the label is a list of ratings that rate the corresponding URL. These are then compared to the locally stored restriction criteria. These criteria must have been previously defined by the administrator or some authorized body and define what constitutes a restricted document. They are indexed according to the rating system they correspond to because it is very important that the criteria used match the ratings system used to create the label currently being reviewed.
The result of the comparison between the parsed ratings from the PICS label and the corresponding restriction criteria determines whether or not access to the URL is to be restricted. If the comparison results in the fulfillment of any of the restriction criteria, access to the page is deemed to be restricted. In this case, the PICS module returns a True value to the calling proxy module and the user is denied access to the URL. If none of the restriction criteria is met, the PICS module returns a False value to the calling proxy module and the user is granted access to the URL.
The only form of content selection performed on a large scale until now has been done using simple filter lists. All URL requests are checked against the filter list and access is restricted if they are present on the list or according to a set of corresponding access-control rules. The filter list mechanism upon which these methods are based is a very inflexible and limited mechanism for performing content selection, allowing no easy method for changing restriction conditions and no method for accessing additional sources of restriction information. This paper presents a proposed alternative proxy-based content selection mechanism which makes use of the PICS infrastructure newly developed by the W3C. This system will provide a far more flexible, manageable, and functional mechanism for imposing Internet content selection, with no significant reduction in speed when compared to existing filter list mechanisms.