Do You Think Content on the Internet Is Easy to Understand?

Kazunori FUJIMOTO <>
NTT Communication Science Laboratories

Kazumitsu MATSUZAWA <>
NTT Service Integration Laboratories

Toshiro KITA <>
NTT Communication Science Laboratories


In the area of artificial intelligence, some research has been started to construct a computer-understandable knowledge base whose content mirrors that of the Internet. This research may help construct intelligent systems that summarize various and current information on the Internet and explain the information to Internet users. Accordingly, we are conducting research on such intelligent systems, called DSIU systems, that provide Decision Support for Internet Users. DSIU systems, instead of humans, handle information on the Internet and generate advice involving explanations of the information. This paper describes the idea of DSIU systems and our approach to constructing them particularly from the viewpoint of novel technologies for automating generation of explanations and acquisition of knowledge for the explanations.

Keywords: Artificial Intelligence, Decision Support Systems, Automated Reasoning, Knowledge Acquisition


1. Introduction

Internet usage has become widespread, and this has enabled people to immediately acquire information from all over the world. Although a great deal of effort has been put into the Internet, judgment using information from the Internet itself is still difficult. The amount of acquired information may be large because it is gathered from an enormous body of texts on the Internet. Moreover, it may not be familiar because the information on the Internet is frequently updated. These characteristics of information on the Internet do not allow people to quickly use it for making a judgment.

In the area of Artificial Intelligence, some research has been started to construct a computer-understandable knowledge base whose content mirrors that of the Internet [6,10,22]. This research may help provide intelligent systems that summarize various and current information on the Internet and explain the information to Internet users. Accordingly, we are conducting research on such intelligent systems, called DSIU systems, that provide Decision Support for Internet Users [11]. DSIU systems, instead of humans, handle information on the Internet and generate advice involving explanations of the information.

As an example to show the DSIU process, we consider a user who wants to buy an electric product, e.g., a digital camera. First, the user inputs his or her preferences for digital cameras, e.g., excellent image-quality, good portability, and so on. DSIU systems then recommend some digital cameras and provide logical explanations for the recommendation to the user, e.g., CCD pixel sizes are important for image-quality and the camera has 1.41 megapixel CCD, which is comparatively superior specification. The role and importance of CCD are understood by the user while such explanations are provided although the technical term "CCD" may not be familiar to the user initially. As a result, the user can make a decision based on his or her thinking and understanding.

There has been much research on recommending items to satisfy user preferences. Some information-filtering frameworks recommend items to a user that are similar to items previously preferred by the user or recommend items preferred by a group of members whose preferences are similar to those of the user [1,15]. However, these frameworks are based mainly on similarity of items or members, so they are unable to provide minute explanations beyond the basic explanation: "because they are similar to each other." Some intelligent agent frameworks recommend by finding recommendable items from Web pages [7,23]. However, many of these frameworks use fixed criteria to choose items, e.g., cheaper items are better, and do not accept users' various preferences. From the viewpoint of providing advice for users' judgments, providing logical explanations and accepting users' various preferences are both important capabilities.

Our research on DSIU aims at constructing an intelligent system that evaluates various items on the Internet from the viewpoint of Internet users and provides valuable explanations for the users. This paper describes research on DSIU as a key technology for popularizing the Internet not only for business purposes but also for personal purposes in the first decade of the 21st century. Evaluations and explanations are favorite subjects of automated reasoning techniques in Artificial Intelligence. Section 2 describes an automated reasoning framework that generates valuable explanations from statements on the Internet. For automated reasoning to generate valuable explanations, various and current knowledge has to be acquired. Section 3 describes two core techniques for automatically acquiring knowledge from statements on the Internet and gives some experimental results. Finally, Section 4 describes the importance of our approach to DSIU in terms of making contributions to society.

2. Approach to automated generation of explanations

To automatically generate valuable explanations of information on the Internet, computer-understandable knowledge for the explanations is required. Because the explanations concern information on the Internet, the knowledge has to correspond to various up-to-date content. From this viewpoint, one may say that this is not a feasible approach for humans to prepare knowledge by hand because it will take enormous cost to preserve knowledge that is adequately large and up-to-date. Thus, the problem "How can we acquire the knowledge that corresponds to information on the Internet?" is one of the most important in constructing DSIU systems.

To cope with this problem, we consider the fact that information on the Internet is available for acquiring knowledge. For example, the Internet provides various articles on various types of a product, e.g., advertisements and performance reviews. It may contain a statement describing a relationship between specifications, e.g., "the lens l of this camera improves image quality." By extracting such relationships, we may acquire dependencies between lens l and image quality, which is available for the explanations of digital cameras. In addition, such statements on the Internet contain various content and are frequently updated. Our approach is to automatically construct the knowledge from statements on the Internet and to preserve the knowledge for explanations to correspond to information on the Internet.

There are various kinds of knowledge that we have to acquire for generating valuable explanations. From the viewpoint of acquiring knowledge from statements on the Internet, we classified knowledge into two kinds: objective and subjective. For digital cameras, names and specifications, e.g., names of digital cameras put on the market and CCD sizes of each digital camera, are classified as objective knowledge because they are facts and are not based on human subjectivity. In contrast, evaluations of specifications and types of digital cameras, e.g., 1.41 megapixel CCD is an excellent specification and the digital camera takes beautiful photos, are classified as subjective knowledge because their arguments are based on human subjectivity involving the subjective expressions "excellent" and "beautiful."

Various research efforts have already been done to acquire objective knowledge from statements on the Internet. Ariadne [22] provided lots of tools to improve the efficiency of manual acquisition of knowledge from Web-pages. Ontobroker [10] presented knowledge acquisition using annotation put manually into Web pages. Web-KB Project [6] showed that it is possible to automatically learn extraction rules for extracting objective knowledge such as name of home page owner by using machine learning techniques. Various wrappers [29], which extract facts from structured texts, and Information Extraction techniques [27] are also available for acquiring objective knowledge. Efforts have also been made to provide Internet information in a semi-structured form by using such languages as XML (extensible markup language) [3].

On the other hand, only a few attempts have so far been made to acquire subjective knowledge from statements on the Internet, mainly because of the difficulties in handling human subjectivity. Subjective concepts such as excellent and beautiful may not be shared by all persons, i.e., some people think an image is beautiful but other persons don't think it is beautiful. Moreover, subjective knowledge is often described incompletely, i.e., context of the knowledge is often omitted from the descriptions. For example, a beautiful photo is made only when all specifications work well, although humans may pick up some of the specifications to describe. Thus, subjective knowledge is not always certain and such uncertainty in subjective knowledge makes the acquisition and handling difficult. In order to represent and handle such uncertainty, we have developed a new framework [12] based on probability theory. Probability theory provides a theoretical foundation for handling uncertainty and it enables the introduction of a lot of analytical techniques. The framework represents statements on the Internet as constraints on subjective probabilities of those humans who describe the statements. This term "subjective probabilities" is used to represent personal probabilities that reflect human knowledge and are distinguished from physical probabilities. For example, a sentence: "Specification s of this camera improves image quality." may be interpreted as an argument "adopting s makes the possibility of the camera having high image quality more likely" of the person who describes the original sentence. This argument may be formalized as the subjective probabilities of the person:

Pr(Q=high | Spec = s) > Pr(Q=high)

where Q and Spec denote discrete random variables representing image quality and specifications, respectively. This transformation enables various conventional reasoning methods [9] to handle uncertainty in subjective knowledge because they do not have to directly manage the statements but manage the transformed probabilities instead.

Explanations for Internet users are generated by using exact probabilities derived from constraints on the subjective probabilities acquired from statements on the Internet [12]. For example, the variance of probability distribution of Pr(Q = high | CCD size) over the variety of CCD sizes tends to be larger than the variance of probability distribution of Pr(Q = high | battery size) over the variety of battery sizes because the CCD pixel size changes the image quality but the battery size seldom changes it. From this, we can acquire a piece of knowledge such that CCD pixel size is more important for image quality than battery size. On the other hand, in terms of the probability of Pr(Q = high | CCD size), the probability of larger pixel size CCD tends to be larger than the probability of smaller pixel size CCD because a larger pixel size CCD is more likely to produce high image quality. From this, we can acquire a piece of knowledge such that a 1.41-megapixel CCD is a superior specification to a 0.81-megapixel CCD. By combining these types of knowledge and objective information of digital cameras, we can generate explanations such that CCD pixel size is important for image quality and that a camera with a 1.41-megapixel CCD has a comparatively superior specification.

Thus, we can generate valuable explanations by combining objective knowledge with subjective knowledge acquired from statements on the Internet. In this section, we particularly focus on representing and handling subjective knowledge and describe our framework for generating explanations. In this framework, various techniques of text analysis are required to acquire knowledge from statements on the Internet. We describe our approach to such text analyses in Section 3.

3. Approach to automated knowledge acquisition

In order to acquire subjective knowledge from statements on the Internet and to generate explanations for Internet users, various text analyses based on natural language processing are required. Such analyses can be difficult mainly because of the variety of natural language expressions [31]. In this section, we particularly focus on two core knowledge acquisition techniques -- text understanding and semantic word-matching -- and describe our approach to improving the accuracy of knowledge acquisition.

3.1 Text understanding techniques

Much research has been done on acquiring subjective knowledge with respect to their suitable representations [16,26] and to their relationships with verbal expressions [2,8]. What seems to be lacking, however, is research on the relationships between subjective knowledge and various information on the Internet. We aim to achieve extraction of subjective knowledge from information on the Internet. Therefore, our research has two new important issues:

  1. which information on the Internet can be made available for subjective knowledge, and
  2. which forms of subjective knowledge are suitable for representing information on the Internet.

Pearl has studied the probabilistic interpretation of meanings in logical sentences [28]. Goldszmidt extended the idea to take account of linguistic quantifiers, e.g., believable, unlikely, in each sentence [14]. To apply such ideas to information on the Internet, various kinds of meanings that are not restricted to verbal meanings should be considered. For example, the location of each sentence in a text and the font sizes for each sentence may contain useful information for constructing subjective knowledge. As a concrete example, we use an article containing two sentences "spec s1 improves image quality" and "spec s2 improves image quality." When these sentences are independently described, no difference between spec s1 and s2 can be acquired. However, they are quite different when the font size of the former one is larger than the latter. In this case, we can guess that the person who describes the sentences would like to emphasize the former sentence over the latter one because a larger font size is used for the former sentence. As a result, this interpretation gives us a kind of knowledge indicating that spec s1 is much more effective in improving image quality than spec s2. This knowledge can be formalized with a parameter C determined by the ratio of their font sizes as

Pr(Q=high | Spec = s1) > Pr(Q=high | Spec = s2) + C

where C is a positive real number less than 1. Thus, in order to acquire subjective knowledge from statements on the Internet, we take into account not only verbal meanings in each sentence but also meanings in text structures like locations, font sizes, and so on.

Based on this idea, we have developed new acquisition methods [19,20] that focus on the customs humans commonly use in describing information. We introduce the basic idea of the method [19] as below. In advertisements, writers usually describe significant features at the beginning of the advertisement in order to catch readers' attention. Based on this custom, we hypothesized that we could identify the rank of a particular specification by the location of each spec-description in an advertisement. To examine the validity of this hypothesis, we randomly selected 60 digital cameras and gathered advertisements on them from home pages provided by their makers. Then, focusing on CCD pixel size, we divided the cameras into three categories: less than 0.8 million, between 0.8 and 1.3 million, and above 1.3 million. We then examined the locations of the CCD descriptions in the advertisements. Table 1 shows the results of this examination.

Table 1: Location of the CCD description in each advertisement

CCD pixel size Number of advertisements Average line number of first mention
-- 0.8 million 10 21.6
0.8 -- 1.3 million 18 15.2
1.3 -- million 32 4.0

In Table 1, the second column shows the number of advertisements gathered that correspond to each camera category. The third column shows the average line number on which CCD is first mentioned in an advertisement. The lines that mention CCD were detected automatically as lines that included the word "CCD" or its synonyms. In this table, we can see that for advertisements that correspond to larger pixel CCDs, line numbers tend to be smaller, namely, the first description of CCD appears relatively earlier in the advertisement. We have implemented an acquisition method that is based on this rating custom and have confirmed that it is highly accurate when applied to ranking several specifications in digital cameras [19]. We have also developed an acquisition method that acquires dependency between attributes, e.g., sharpness of image depends on CCD pixel size and does not depend on batteries, by using the spatial configurations of words and sentences [20].

It is difficult to acquire correct knowledge from statements on the Internet by using only linguistic information. Therefore, it is important to consider other types of information such as customs human commonly use in describing information to improve the accuracy of knowledge acquisition. We consider that this approach will help conventional natural language analyses to acquire valuable information from statements on the Internet.

3.2 Semantic word-matching techniques

Semantic word-matching is a process that connects words to other words for acquiring knowledge efficiently from text containing daily-used words. For example, we consider a user who pays attention to the quality of image of digital cameras and a statement of digital cameras on the Internet: "5-megapixel CCD takes a sharp photo." In this example, the word " photo" in the statement is similar to the word " image" from users. By connecting the word " photo" to the word " image," we can acquire knowledge like "the specification of CCD is important for image that the user is interested in" from the statement. As shown in this example, the appearances of words in statements on the Internet do not always match with that of words from users. Therefore, semantic word-matching connects words not by their appearance but by their meaning.

There are various machine-readable data sources for constructing databases for semantic word-matching, and much research has been proposed to automatically construct such databases by exploiting each data source. Some typical research approaches are as follows: using thesauri to measure lexical cohesion [24], using morphemes in explanations of words in dictionaries to construct a meaning vector of each word [21,18], and using corpora to cluster words by their lexical co-occurrences [17,4]. Although these approaches are useful, each technique has its own limitations for improving the accuracy of the similarity or association judgments due to the principle of each technique. For example, in a thesaurus, ambiguous words, which have the same written appearance but do not have the same meaning, are assigned into different categories. Therefore, semantic word-matching techniques based on thesauri could judge different meaning words as having a similar meaning. As a result, precision of the judgment decreases. On the other hand, in meaning vectors acquired from dictionaries, ambiguous words are unified during natural language processing. Therefore, semantic word-matching techniques based on dictionaries could judge similar meaning as different meaning. As a result, recall of the judgment decreases. As just mentioned, each technique has its own inherent limitations.

Consequently, combining the superior characteristics of each technique is a reasonable approach to improving the accuracy of the semantic word-matching. Thus, we are investigating the effect of combining various kinds of techniques [13,30] on improving accuracy. We introduce some experimental results of the combination of thesaurus-based and dictionary-based methods as shown below.

Thesaurus-based method

A thesaurus usually consists of

  1. semantic categories constructed as a tree-structure, and
  2. daily-used words assigned to certain categories.

The distance between two words on the tree-structure gives the semantic similarity between the words. Therefore, by defining the distance between words on a tree-structure, we can use thesauri in semantic word-matching if two words are closely assigned in the tree-structure [30]. We used the Kadokawa thesaurus [25] to examine the accuracy of the thesaurus-based semantic word-matching.

Dictionary-based method

From machine-readable dictionaries, our research group has constructed a concept base that measures semantic similarity as a real number between 0 and 1 [18]. This concept base contains knowledge of about 40,000 Japanese daily-used words, which are represented as vectors in a 3,000-dimensional semantic space. The semantic similarity between words is measured as a cosine of the angle spanned by the corresponding word vectors. This similarity can be used in semantic word-matching by determining whether the number is more than or less than a threshold [30].


We prepared two sets of Japanese words for the experiments. One of them consisted of 982 words (i.e., nouns, adjectives and adverbs used in 72 Japanese press releases on digital cameras randomly gathered from the Internet). The other consisted of 10 words (i.e., words frequently used by users to describe features or preferences concerning digital cameras). Consequently, there were 9,820 pairs of words to compare, and we marked 54 pairs whose words were similar to each other, e.g., (bright, sharp), (cheap, inexpensive). By regarding these 54 pairs as a correct answer set, we calculated the F-measures [5] of the individual methods. The F-measure, which is a standard evaluation metric in the field of information retrieval, falls between precision and recall, and a higher value means a system's higher performance.

Figure 1 shows the results. In this figure, the vertical line represents the value of F-measure performed by each method and the horizontal line represents the value of 1 / B, which is a parameter of F-measure, as a relative importance given to precision over recall. The dictionary-based method worked better than the thesaurus-based method when 1 / B was greater than 1, while the thesaurus-based method worked better when 1 / B was less than 1. The parameter 1 / B represents a consideration-weight for precision and recall. This means that the dictionary-based method is better when the reliability of the acquired knowledge is important while the thesaurus-based method is better when the amount of the acquired knowledge is important. By combining these different properties, we can realize a semantic word-matching that keeps high accuracy under different conditions of user requirements.

Figure 1: Evaluations of thesaurus-based and dictionary-based methods.

In future work, we will introduce corpus-based methods to examine the accuracy improvement of combining the methods with thesaurus and dictionary-based methods. Corpus-based methods will provide different ways to improve the accuracy of semantic word-matching, so the combination is expected to provide more accurate semantic word-matching processing. Then, we will evaluate our approach to semantic word-matching from the viewpoint of its contribution to generating correct explanations for Internet users.

4. Conclusion

The Internet contains various and up-to-date information on commodities, persons, places, and so on. In order to use this information for human judgment, it is important to realize a framework that provides explanations of its content to Internet users. The research done so far has been insufficient for providing explanations that also consider individual preferences. In this paper, we described DSIU systems that evaluate various items on the Internet from the viewpoint of individual preferences and provide valuable explanations. We believe that the capabilities of DSIU systems will help popularize the Internet not only for business purposes but also for personal purposes.

In order to construct a DSIU system, the knowledge base in the system has to be preserved as a widely varying body of up-to-date knowledge that corresponds to information on the Internet. We focused on the fact that such knowledge exists on the Internet itself and presented our techniques for (1) automating generation of explanations and (2) automating acquisition of knowledge for the explanations from statements on the Internet. We have confirmed that these techniques are highly accurate when applied to statements in information related to digital cameras on the Internet. Although we have so far only applied these techniques to electrical products, the basic idea of our framework described in this paper should be available to wider domains. As future work, we will apply the idea of DSIU systems to various domains, e.g., persons and places, and start services by DSIU systems on the Internet.


The authors would like to thank the advice of Tsuneaki Kato, NTT Communication Science Laboratories. The authors also thank Hiroshi Sato and Hideto Kazawa, NTT Communication Science Laboratories, for providing statistical data on the Internet.


  1. Marko Balabanovic and Yoav Shoham. Content-based, collaborative recommendation. Communication of the ACM, 40(3):66-72, 1997.
  2. Ruth Beyth-Marom. How probable is probable? a numerical translation of verbal probability expressions. Journal of Forecasting, 1:257-269, 1982.
  3. Tim Bray, Jean Paoli, and C. M. Sperberg-McQueen. Extensible Markup Language (XML) 1.0. W3C Recommendation, Available from, 1998.
  4. P. F. Brown, V. J. Della Pietra, P. V. de Souza, J. C. Lai, and R. L. Mercer. Class-based n-gram models of natural language. Computational Linguistics, 18:467-479, 1992.
  5. N. Chinchor. Muc-4 evaluation metrics. In Proceedings of MUC-4, pages 22-29, 1992.
  6. Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom Mitchell, Kamal Nigam, and Sean Slattery. Learning to extract symbolic knowledge from World Wide Web. In AAAI-98, pages 509-516, 1998.
  7. Robert B. Doorenbos, Oren Etzioni, and Daniel S. Weld. A scalable comparison-shopping agent for the world wide web. In Proceedings of the First International Conference on Autonomous Agents, pages 39-48, 1997.
  8. Marek Druzdzel. Verbal uncertainty expressions: Literature review. CMU-EPP-1990-03-02, Carnegie Mellon University, 1989.
  9. Marek J. Druzdzel and Linda C. van der Gaag. Elicitation of probabilities for belief networks: Combining qualitative and quantitative information. In Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence (UAI-95), pages 141-148, Montreal, Quebec, Canada, 1995.
  10. Dieter Fensel, Stefan Decker, Michael Erdmann, and Rudi Studer. Ontobroker: The very high idea. In Proceedings of the Eleventh International Flairs Conference (FLAIRS-98), pages 131-135, 1998.
  11. Kazunori Fujimoto and Kazumitsu Matsuzawa. Intelligent systems using web-pages as knowledge base for statistical decision making. New Generation Computing, 17(4):349-358, 1999.
  12. Kazunori Fujimoto, Kazumitsu Matsuzawa, and Hideto Kazawa. An elicitation principle of subjective probabilities from statements on the internet. In Proceedings of the Third International Conference on Knowledge-Based Intelligent Information Engineering Systems (KES-99), pages 459-463, 1999.
  13. Kazunori Fujimoto and Hiroshi Sato. Semantic word-matching for knowledge acquisition from text containing daily-used words: A multiagent-based approach. submitted to The Eighth Workshop on Multiagents and Cooperative Computing (MACC-99), 1999.
  14. Moises Goldszmidt and Judea Pearl. Qualitative probabilities for default reasoning, belief revision, and causal modeling. Artificial Intelligence, 84(1):57-112, 1996.
  15. Nathaniel Good, J. Ben Schafer, Joseph A. Konstan, Al Borchers, Badrul Sarwar, Jon Herlocker, and John Riedl. Combining collaborative filtering with personal agents for better recommendations. In AAAI-99, pages 439-446, 1999.
  16. David Heckerman and Holly Jimison. A bayesian perspective on confidence. In L. N. Kanal, T. S. Levitt, and J. F. Lemmer, editors, Uncertainty in Artificial Intelligence 3, pages 149-160. Elsevier Science Publisher, 1989.
  17. D. Hindle. Noun classification from predicate-argument structures. In Proceedings of ACL, pages 268-275, 1990.
  18. Kaname Kasahara, Kazumitsu Matsuzawa, Tsutomu Ishikawa, and Tsukasa Kawaoka. Viewpoint-based measurement of semantic similarity between words. In Lecture Notes in Statistics 112: Learning from Data, pages 433-442. Springer-Verlag, 1996.
  19. Hideto Kazawa and Kazunori Fujimoto. Grading properties according to the locations of their descriptions in press releases. In Proceedings of the Fourth Australian Knowledge Acquisition Workshop (AKAW-99), pages 30-43, 1999.
  20. Hideto Kazawa, Kazunori Fujimoto, and Kazumitsu Matsuzawa. Attribute dependency acquisition from formatted text. In Proceedings of the Third International Conference on Knowledge-Based Intelligent Information Engineering Systems (KES-99), pages 464-468, 1999.
  21. T. Kitagawa and Y. Kiyoki. A mathematical model of meaning and its application to multidatabase systems. In Proceedings of IEEE International Workshop on RIDE-IMS, pages 130-135, 1993.
  22. Craig A. Knoblock and Steven Minton. The ariadne approach to web-based information integration. IEEE Intelligent Systems, 13(5):17-20, 1998.
  23. Bruce Krulwich. The bargainfinder agent: Comparison price shopping on the internet. In Joseph Williams, editor, Bots and Other Internet Beasties. SAMS.NET, 1996.
  24. J. Morris and G. Hirst. Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Computational Linguistics, 17:21-48, 1991.
  25. S. Ono and M. Hamanishi, editors. Ruigo Kokugo Jiten (in Japanese). Kadokawa Shoten Publishing Co., Ltd, 1985.
  26. Gerhard Paas;. Second order probabilities for uncertain and conflicting evidence. In P. P. Bonissone, M. Henrion, L. N. Kanal, and J. F. Lemmer, editors, Uncertainty in Artificial Intelligence 6, pages 447-456. Elsevier Science Publisher, 1991.
  27. Maria Teresa Pazienza, editor. Information Extraction. Springer-Verlag, 1997.
  28. Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.
  29. Arnaud Sahuguet and Fabien Azavant. Wysiwyg Web Wrapper Factory (W4F). Available from, 1999.
  30. Hiroshi Sato and Kazunori Fujimoto. Semantic word-matching for knowledge acquisition from text containing daily-used words. To appear in The First International Conference on Advances in Intelligent Systems: Theory and Applications (AISTA-2000), 2000.
  31. T. Wetter and R. Nüse. Use of natural language for knowledge acquisition: Strategies to cope with semantic and pragmatic variation. IBM Journal of Research and Development, 36(3):435-468, 1992.