Multilevel Automatic Categorization for Web Pages
Jiyun LEE <email@example.com>
Dongwook SHIN <firstname.lastname@example.org>
Due to the exponential growth of the Internet and the production of enormous amounts of Internet documents as a result, manual categorization has passed beyond the limit of human possibility. Traditionally, two classification methods, such as text categorization and clustering, have been employed for automated and effective document processing.
We propose a multilevel categorization method combined with clustering, with text categorization to be used for basic categorization of Internet documents, but clustering to be used for negative examples for the prediction of new categories.
In addition to the strategic combination of categorization and clustering, we also propose to incorporate a weighting scheme for HTML markups which, we believe, will enhance the precision of categorization results.
By combining categorization with the clustering method, the quality of automatic categorization process employed by catalog services such as the Yahoo Web search system will not only be enhanced, but also be economical and efficient.
Although a considerable amount of research has been done on text categorization, much of it focuses on a text set with similar topics and requires a pre-labeled training set for categorization . The number of topics covered by currently available Internet documents are as immense as the number of the documents themselves, and substantial manual labor would be required to make a pre-labeled training set for the documents available to the Internet users.
Consequently, it would be extremely time-consuming, ineffective, and almost impossible to produce immediate categorization response with simple manual method during an Internet document collection process. In addition to the great variety and numbers of categories required to categorize the documents, the categories form a complex hierarchical structure that do not adhere to a flat category structure.
This paper proposes to utilize a multi-level categorization method structure using a combined clustering method to effectively classify the Internet documents and also to pay a special attention to the characteristics of the documents being categorized.
The Internet documents not only include text, but also contain markups, which represent attributes of the words and structure of the documents. In order to detect the most important word in a page, we used markup information associated with the words as well as the term frequency and inverted document frequency. For example, based on an assumption that the words inside the < TITLE > tags or those in bold face, large fonts, and colored fonts are more important than other words, the significance of the words were determined.
If determination of all the possible and correct categories for a Web page collection is not an easy task for a person, and the number of categories required for category service are enormous, then how about millions of Web pages for thousands of categories? The categorization method that we developed combined clustering extracts with some possible categories of Web pages in addition to pre-defined categories. This method can take on a significant role in the construction and management of a Web catalog service.
Text categorization and clustering methods have traditionally been used to classify texts. Text categorization is the automated assignment of natural language texts to predefined categories based on their content. It can be imagined as filing documents into folders that are pre-labeled according to their contents. This type of method, which requires pre-labeled information, is called a supervised method.
On the contrary, the clustering method is an automated grouping of texts, and this method can be pictured as gathering similar documents as groups prior to filing them into file folders. It does not require any type of pre-labeled information, making it an unsupervised method. The supervised method is known to be unsuitable for Internet document categorization because producing a pre-labeled Internet document set with already-known categories requires a strenuous effort.
In order to achieve an automatic categorization process, training or learning algorithms is necessary to allow classifiers to require a set of pre-labeled documents for each category. However, there is a case of the same algorithm in both supervised and unsupervised modes for categorization; therefore, it can be used without being part of a pre-labeled document set. Also, the automatic categorization process is computationally very expensive, especially in the unsupervised mode . When the method described above is used, a real-time update of newly produced Internet documents or category assignments during the document collection of a Web robot cannot be expected.
Clustering also demands such heavy processing and storage requirements that much of the early study on the cluster analysis for information retrieval was limited to small data sets, often only a few hundred items. The hierarchical agglomerative clustering method especially could be displayed as a dendrogram. It is useful to represent the paths that the retrieval process may follow . Although the clustering process requires expensive computation costs, it does not require any pre-definition as categorization does.
Another weakness of the clustering method is that it does not allow overlapping documents. Each document has to be included in only one cluster, which makes multiple category assignment for a document virtually impossible.
We have found through several research projects that the clustering method is not appropriate for rapid classification of large document sets . Internet document classification requires an instantaneous online processing and classification time must be conserved. Consequently, it is necessary to assign appropriate categories to groups of documents while information on the Internet documents is being gathered through online processing.
Although some text categorization methods satisfies the criteria of rapid classification, online processing and multiple-category assignment, it requires an enormous amount of pre-definition and pre-defined categories for its process. For certain cases, such as domains that deal with fixed topics, the text categorization method is appropriate since no additional pre-defined category needs to be created. It is impossible, however, to assign new categories for each of Internet documents pouring into the Web, despite their meaningful topics.
To optimize the categorization process of the Internet documents and to overcome the weaknesses and to enhance the advantages of the two methods, we combined the text categorization and clustering methods. The main processing was performed with the text categorization method to achieve a rapid online sorting of Internet documents into pre-defined categories.
For the categories that need to be newly created, the clustering method of negative examples was used. Negative examples, which are produced as a result of text categorization process, were accumulated, and when the accumulation of the negative examples reached a certain level, they were grouped into clusters using the clustering method. The clustering process was performed with a batch-job, and it did not affect the text categorization process.
The number of documents for the clustering process can be determined because the amount of time consumed for clustering process can be estimated. This is an important issue because, with this information, we can control the load on the hardware system. If the hardware system has the capability to handle a heavy load, a clustering process of a large number of documents would be performed in reasonable time. However, it is also important to remember that many aspects of the clustering processing results vary according to the number of documents that go through the clustering process. And the probability of many new categories being created is increased in proportion to the number of negative examples; therefore large negative example set is preferred.
After clustering, the result clusters were selected according to their size from the produced clusters. The size of the clusters was not an absolute criterion to detect meaningful clusters for prediction of new categories. But we used the cluster size as a criterion, because it was the easiest way and it did not require any cluster analysis.
We assumed that selected clusters contain new topics that are valuable as new categories. The centroids of each selected cluster became a template that played a role of a classifier, and it made new category prediction possible. Finally, categorization was processed using templates, including new templates that resulted from clustering process.
The topic domains of Web pages are very various and complex. Sometimes one Web page contains several topics at the same time. The relationships among categories are not simple and some categories could have multiple parents. Some Web pages can have from zero to many proper categories.
Therefore, tree structure is not enough to represent relationships among categories. A lattice structure is more proper, but we do not need the whole condition for lattice structure. We now suggest a hybrid structure between the tree and lattice. The new structure is the same as the tree structure except that it allows multiple parents for each node [Fig 1].
[fig 1] Category structure
We named the nodes in the category structure as "templates." Each category was represented as this "template structure" that plays a role of classifier of the categorization algorithm and points out its parent and child templates. The classifier was a type of wordlist containing weight information on the words in the wordlist.
Internal and external templates had different structures. The internal templates had a classifier and pointers to its parent and child templates. Only the internal templates had its negative example group[fig2]. External templates, the leaf node templates, just had a classifier and pointer to its parent templates. The leaf node templates represented the lowest categories [fig3]. The multi-level categorization using this structure and prediction processes using negative examples will be explained in the following sections.
[fig 2] Template structure for internal nodes
[fig 3] Template structure for leaf nodes
On the World Wide Web, there are Web pages with new topics that are not defined as categories, and they are being produced every day. It is tough work to predict all categories required for Web page collection in advance. We used negative examples to predict new coming categories.
Negative examples are the instances that are not included in any pre-defined categories. It is frequently the case that for a fixed category, there will be many more negative examples than positive examples . We clustered negative examples into some groups to use to predict new categories. We assumed that negative examples have a lot of possibilities to contain new topics because they were not assigned to any pre-defined categories. After clustering of negative examples, some clusters were selected by the cluster size. The cluster with enough members of Web pages were selected. From the selected clusters, new categories were extracted. In next section, multi-level categorization, the new category extraction method and how to determine the level of new extracted categories in the hierarchical category structure are explained.
The main idea is to divide the categories into an internal category group and an external category group. As explained in the structure section, the category relationship was represented as a extended structure of tree and each node of the extended tree was represented as a template structure. The categorization used all the templates, regardless of their level, containing a classifier. Web pages became positive examples when assigned into categories of leaf node templates. It was possible to assign multiple categories for a Web page, but if a Web page had categories in external templates, then it could not have any categories in internal templates. Multi-level categorization was done using this rule.
The process of categorization was repeated for Web pages; if they were not positive examples, then they were stored into the appropriate negative example group. Each category in internal templates had its negative example set. The Web pages became negative examples if they are only assigned into internal categories.
[fig 4] An example of category structure
After the clustering of each negative example set, meaningful clusters were extracted from them. The size of the clusters was used as a criterion to select meaningful clusters. Then, new templates for new categories were made using centroids of selected clusters as classifiers and their parent templates were the templates containing that negative example.
For example, in the [fig 4], there are four negative example sets, and a negative example of the "hockey" category means that it is related to hockey, but it is not relevant to any sub- categories of hockey, such as "field hockey" or "ice hockey." If a meaningful cluster from the negative example of the hockey category is "A," then the parent of "A" is the hockey category. "A" inherits all the parent links of hockey.
The advantage of using negative example groups is that it is possible to keep the hierarchical structure of the relationships between the categories because every new category inherits the parent links from the negative example group. Using negative example clustering, new categories can be predicted under the hierarchical relationships.
[fig 5] Block diagram for categorization process
To make automated processes, each Web page have to be represented as a form to measure the similarity between Web pages or Web pages and classifiers. We adopted feature vector to represent the Web page. IR systems often represent documents as feature vectors.
Di is a vector consisting of weights of j words that occur in the document; it contains j features and wj is the weight of jth word. We regard the words in the document as the feature of that document. We did not select all words of the Web pages for the feature vector. Non-informative words were removed to reduce the computation redundancy and improve the effectiveness . Words were stemmed by Porter's stemming algorithm and removed by "stopword-list" at first stage. After calculating weights of remaindered words using [Eq. 1], we sorted them by their weight. We used only the top 20 ranked words; that is, the maximum feature vector size is 20, because small size feature vector can reduce the computation cost. In addition, it is known that 10-15 features are enough for the optimal feature set size for word-based indexing .
Generally the weight value uses TFIDF value, which is given from Term Frequency and Inverted Document Frequency . We added tag information to the TFIDF weight. Usually people create the title of their Web pages using important and representative words. The font or face of the words could be presented differently using tags in HTML, if they are important. We focused on the title tag in this paper, but other tags can be used. The new weighting scheme gave more weights to the words within the title tags than the words that occurred in other places in the Web page. That means that words in the title are more important than the words in any other place. [Eq. 1] is the equation for this new weighting scheme.
c1 + c2 = 1
0 <= c1, c2 <= 1
0 <= tag_weight <= 1
The data set for the experiment consisted of 10,000 Web pages that were randomly selected from the Web pages gathered by a Web robot. The experimental subject field was determined to be sports. The sports links of famous search sites, such as Yahoo and Infoseek, were chosen as start points to gather sports-related Web pages. The [fig 4] was used as the first of the pre-defined categories.
To measure the similarity between classifiers and Web pages, cosine measurement was used . The feature vector of each Web page was extracted using the method described in the "Web page representation and weighting scheme" section. In this experiment, c1 was 0.3 and c2 was 0.7. Categories were assigned by the similarity and the threshold for the assignment was 0.6. All the categories were assigned if the similarity was greater than the threshold.
After categorization of 10,000 Web pages, 170 positive examples were produced. We clustered each negative example group using two clustering method. If the size of the group was larger than 500, the single-link method was used; otherwise, Ward's method was used. Because the single-link method is very simple, it can be implemented relatively efficiently, so we used it for the larger group. The cluster centroid consisted of the most common words in the cluster. Ward's method was used for a relatively small group . The criteria for produced cluster selection was the size of the cluster -- we selected clusters only if the size was larger than 10. The description below is about the clustering results, and the number within the parenthesis is the number of negative examples of the category.
virtual root (3835)--no clustering
We found that some selected clusters from 'sports' were not suitable for the new categories under the sports category. For example, it is better that the new category from the "<tiger, woods, golf, master, PGA, player>" cluster is located under the "golf" category.
hockey : <roller, hockey, skate, sports, club>
In [Table 2], the first column is the category name and second column is the centroid of a cluster from its negative examples. The categories for hockey or skating do not have complex nested sub-categories under themselves because their subject range is relatively narrower than sports and they are located in the lower level in the structure tree [Fig 4].
We found that our mechanism was especially suitable for the prediction of new sub-categories of the lower level category because it is hard to predict the direct sub-categories among the clusters produced from the higher level category like sports as explained in Tiger Woods and golf example. Centroids in the [Table 2] were chosen for the new sub-categories for hockey and skating. The categorization process was repeated including the new categories.
We used templates as classifiers, which contain word lists representing the topics of the category. Classifiers are made by learning or training algorithms through the pre-labeled documents. The classifiers could be better using a large pre-labeled document set. We gave the queries to a search engine, and then we regarded the result pages as pre-labeled documents. Finally the classifiers made by learning algorithms were modified manually to confirm the precision of classifiers. Because the results from the search engine are not complete, they cannot be trusted in comparison with the pre-labeled set made by hand. We believe that confirming the classifiers manually takes less labor than making the pre-labeled set. It would be convenient to support an easy interface to correct the classifiers made from the search engine results using learning algorithms to the future.
We will try to use other tags for the weighting scheme (not only title tag) and apply this weighting scheme, including tag information, to other tagged documents such as SGML or XML.
This paper does not suggest new algorithms for text categorization or clustering, but does suggest a way to combine categorization with clustering. Categorization combined with clustering is applied more effectively when all the necessary categories are not known in advance.
The clustering method is usually performed when there is certain amount of documents existing initially. As explained, the clustering method requires a certain number of documents, and does not allow documents to overlap with various clusters. In actual situations, however, there are many cases in which documents included a variety of topics and belonged to many groups.
Also, the categorization method is preferable to the clustering method when it is desirable to sort documents while collecting through the use of robots, or when it is desirable to know the categories of the documents being collected. However, it is important to pre-define all necessary categories in a case like this.
If there are many documents that are not related to any of pre-defined categories, they would remain unassigned. When numerous documents with new topics are collected, those topics would need to be defined in order to categorize them.
This paper discussed the ways to overcome the disadvantages of the categorization and clustering methods by combining the effectiveness of them and categorizing the documents collected by the robots.
The clustering method is only applied to the negative examples, which are the results of the categorization process, and detects the new categories produced from clustering the negative examples. After which, the categorization is repeated, this time including the newly produced categories from clustering the negative examples. Repetition of these processes can automatically generate the categories deemed necessary by newly produced Web pages and through repetition, the category structure will grow. This mechanism is very useful in predicting new sub-categories of lower level categories.
A possible problem that can occur through the proposed categorization method is choosing which cluster from the results of initial clustering process to use to generate a new topic. In this paper, we proposed to choose a larger cluster to generate a new category. However, it is not an easy task to determine the selection standard, the threshold of cluster size. Determination of the threshold affects the number of categories that are newly added. If the range of the determined threshold is small, categories for specific and detailed topics would be generated or insufficient topics would be generated.
On the contrary, if the range of the determined threshold is large, there is a danger of ignoring the categories that should have been considered as valid and noteworthy.