A Query Transformer in a Digital Library Environment

Sang-goo Lee <sglee@mars.snu.ac.kr>
Seoul National University
Korea

Sang Ho Lee <shlee@computing.soongsil.ac.kr>
Soong Sil University
Korea

Abstract

There are numerous search engines in operation that help users to locate documents in the Internet. Although these aids are generally considered to be useful, current practice of the search engines not only often returns too many references, most of which turn out to be irrelevant, but also takes little account of the implicit search characteristics of each user in the search process. In a digital-library environment where user profiles and accounts are maintained, it is possible to maintain user-specific information to assist user searches. This paper proposes the concept of a query transformer that refines the user query so that it more closely represents the user's intention. The user profile and other search-related knowledge is represented in first-order logic clauses, and semantic query optimization techniques are used in transforming the query. Details of the transformation process as well as the architecture of our search mechanism is presented.

Keywords: Internet databases, semantic query optimization, digital library, World Wide Web, search engine.

Contents

1 Introduction

The Web represents a large, heterogeneous, distributed collection of documents connected by hyperlinks. One feature of the Web is that it is totally decentralized and very dynamic in nature. Unlike a traditional library that is a finite and a centrally controlled resource, the Web does not have a single complete catalogue of its resources. Web users are confronted with the gigantic sea of information and very often have difficulty locating what they want. Web users are easily "lost in space." The same problem is applicable to digital library applications [2], [8], [12] in which a huge amount of information is digitized and stored in computer.

To cope with such problems, search engines that automatically locate and index resources in the digital-library environment have been developed. There are over a dozen different search engines currently in existence. Examples include Yahoo, Infoseek, WebCrawler [13], Galaxy, AltaVista, Lycos, Open Text, and RBSE [6]. A typical search engine navigates the Web for new and updated documents, builds/updates its index, and returns a list of documents for a query, which is essentially a list of keywords.

However, current search services, though useful, are far from ideal. First, they often return a huge amount of references, most of which are irrelevant, forcing the user to retry their queries several times until they find appropriate responses. The process of requerying is quite tedious. A study [14] shows that up to 75 percent of the references returned can be removed if the user supplies a more expressive query. Second, it is hard to reflect the unique search characteristics (such as interests, tendencies) of each user efficiently in the search process. A user often has unique search preferences that could be used to focus the search more accurately to the relevant areas, thus dramatically reducing the number of references returned. For instance, when a computer scientist poses a keyword "thread" to a search engine, it is likely that the intention is to denote lightweight threads in operating systems, not the kind of thread used in sewing.

Customizing queries with more descriptive phrases or words certainly prunes the search space significantly. However, almost none of the search engines widely used address this issue. One of the most important reasons is that in a Web environment where the user group is literally the whole world, it is impossible to keep user-specific information in any catalogue. However, in digital library environments, especially in those where the subjects and users are fixed to a relatively small set, it is often feasible to keep track of individual user profiles. The issue is how effectively and efficiently such information (or knowledge) can be used in assisting the user in the search process.

In the artificial intelligence field over the last few years, development of intelligent Web search modules called "intelligent agents" is extensively studied [1], [3], [5], [7], [10]. An example of an agent would be one that helps the user select articles from a continuous stream of news. On the other hand, the idea that utilizes various semantic knowledge as a means to improve query processing has been around in the database community for quite a while. Semantic knowledge is the set of integrity constraints that a database in a valid state must satisfy. The paradigm is easily adapted to searching in a digital library environment where the semantic knowledge is information about each user and how certain information can be used to transform the query.

In this paper, we propose an architecture of a search system that transparently refines a user's search and customizes the result set for the user. We present the details of the query transformer component of the architecture. The transformer takes a user's query and generates a modified query expression that is more informative, thereby directing the search engine to return mostly relevant documents. The transformed query can also help the search engine to probe a significantly reduced search space. Our strategy is to employ logic-based semantic query optimization, which has created considerable research interest in the database area. We use search-related and user-related semantic knowledge that is represented in first-order logic. In Section 2, we present the architecture of the search mechanism. Section 3 is devoted to a discussion of detailed techniques of the transformer, which is the core component of the search system. Closing remarks are found in Section 4.

2 The search system

2.1 Architecture

Figure 2.1 shows the global view of our architecture. The user's query is submitted to the search system whose transformer reformulates the query into a semantically richer one. The reformulated query is passed to the appropriate information servers. An information server may be a specific digital library, an FTP server, a Web server, or any of the Web indexing services such as Lycos and Yahoo. This implies that the transformer must "speak the languages" of the information servers. The information servers will then respond to the query with a list of documents. Details of the transformer are covered in Section 3.

The presenter filters and ranks the results and presents a tailored form of the result set to the user. The presenter shares the knowledge base with the transformer. The knowledge base includes user-specific information, domain knowledge, information server information, and relevant information about the current query that is inferred by the transformer. The internals of the presenter are still in the works and beyond the scope of this paper.

The search system thus acts as an agent that works on behalf of the user to refine the search. However, we refrain from calling this system an agent since we do not wish to address the issues of agency at this stage.

Figure 2.1: The search system

2.2 Queries

Each of the search engines has its own interface and query language with which users express what they want to find. A user query expressed in the query language is transformed into a formal representation that the search engine can understand. The most common form of query language is a fill-in-the-form type of interface where the user supplies search strings for each field (or attribute) specified. Generically, the search string is essentially a Boolean expression of <field, value> pair, or equivalently an atomic predicate field(value), which is true if and only if attribute field has value value. The atomic predicates are often connected by logical operators (AND, OR, and NOT) to form a more complex query. For example, an expression "subject('database conference') AND area(Asia) AND (year(1997) OR year(1998))" may be interpreted as find all database conferences that will be held in Asia in 1997 or in 1998. The point here is that whatever format a particular query language takes, it can be mapped into a Boolean expression whose atoms are in the form of field(value), without loss of generality.

Definition 2.1 A query is defined recursively as follows:

  1. field(value) is a query.
  1. If Q1 and Q2 are both queries, (Q1 AND Q2), (Q1 OR Q2), and NOT(Q1) are queries.

3 The query transformer (QT)

3.1 Basics

As discussed in the previous section, the role of the transformer is to reformulate the query into a better query. It takes a query expression and transforms it into another expression that not only describes the target more precisely but also reflects the user's search characteristics. The objectives of the query transformer are twofold: the first is to make the search effective by either reducing the search space or providing more descriptive search conditions. The second is to reflect the user's particular taste to search conditions.

The query transformer consists of two distinct processes, as shown in Figure 3.1. In the inference stage, the user query is taken through the rule base and transformed into an intermediate query expression. The query that the user supplies is actually more than what is explicitly typed in. For example, the user may specify only Q1 = (subject(database) AND year(1996)) but the system can immediately augment this query with simple data about the user such as

Q1' = subject(database) AND year(1996) AND username(shlee) AND user_site(Korea) AND system_type(unix).

If the system keeps a profile for each of its users, it can proceed and augment those facts;

Q1" = Q1' AND interests('computer science') AND hobby(golf) AND (occupation(student) OR occupation(professor)).

The last part of the query shows that the system may have indefinite knowledge about certain user facts; in this case, it is not certain whether the user is a student or a professor.

The inference process proceeds with no specific target information server in mind, therefore may generate expressions that some information servers do not understand or cannot utilize. For example, the inference stage may result in augmenting an expression with a billing field that turns out to be useless for a specific information server. Utility of an augmented field is completely dependent on the functionality of target information server by nature.

The translation stage takes care of such discrepancy. The intermediate query expression is filtered so that subexpressions that are not contributory to the particular information server are eliminated. Every single field and value pair generated should be understandable and useful to the search engine. In selecting the target information servers, the translation stage makes use of a catalogue on the indexes and problem domains that each search engine deals with.

Figure 3.1: Query transformer

Separation of the inference process and the translation process allows the inference module to be independent of any particular information server. The inference process is also independent of any user interface client owing to the fact that queries are modeled as Boolean expressions on <field, value> pairs.

3.2 The knowledge base

The knowledge base consists of three parts: a set of facts about the users, a rule base describing general inference rules, and a set of directed acyclic graphs (DAG) representing the hierarchy of concepts.

The set of facts about the users is information pertaining to search habits, general interests, access history, etc. The facts are expressed in the above defined query form. The set of fields that a particular system maintains can vary from one system to another. We do not intend to address the issue of which fields are most important but wish to present a framework for representing and utilizing user information represented as field values.

The rule base is a set of Horn clauses describing individual situations for query transformation. A rule is in the following form;

H <- C1, ..., Ck

where Ci is either any field(value) predicate. Following are two example rules.

(1) billing_type(free) <- occupation(student)

(2) doc_type(tech_report) <- occupation(student), user_site(academic)

Rule 1 represents the knowledge that, for students, free resources should be searched, while Rule 2 tells us that for student users logging on from academic sites, technical reports are the recommended document type.

The concept hierarchy represents the relationship between words used as arguments in the rule base. In theory, the concept hierarchy is simply a set of rules of the following form:

fieldname(super_term) <- fieldname(subterm) (3)

However, it is easier to understand the model both as DAGs where nodes represent terms (concepts) and as directed arcs representing subset relationships. An example of this model is shown in Figure 3.2. In the hierarchy, a parent term is a broader term (BT) than a child term, which is the narrower term (NT) of the parent. All rules defined in a parent term are applicable to its child terms. For example, Rule 2 is valid for college and university as well as for academic. Currently, we only allow strict subset relationships (subset hierarchy).


Figure 3.2: Concept (subset) hierarchies

A single DAG can be shared among all fields. But in practice, because different fields use different set of terms as arguments, it is better to allow multiple hierarchies and have each field utilize the most appropriate one. For example, in the above example, it may be beneficial to merge the two trees of occupation and user site and have the fields share the merged tree, whereas there is little to be gained by merging the tree for doc_type with the other two.

3.3 Methodology

The idea of rewriting a user's query in order to guide the search process is not a new one. In fact, semantic query optimization [4], [9], [11] is exactly the process of rewriting the user query into a "semantically equivalent" one which would yield the same answer, but more efficiently. There are, however, important differences between the proposed query transformer and SQO. First, for QT, the set of answers for the transformed query is not intended to be the same as the one for the original user query. In contrast, the objective of SQO is to find the same set of answers more efficiently. Second, the predicates in an SQO paradigm represent relations, whereas a predicate in our setup represents a single fact or concept. Third, the answer to a query in an SQO paradigm are instantiations of free variables occurring in the query predicates. The answer to a query in our situation is the set of documents that are related to the concepts described in the query.

Despite these differences, there are a number of key similarities that allow us to apply SQO techniques to our situation. First, first-order logic can well represent the queries and knowledge base of both cases. This enables us to use the same inference mechanism used for SQO to deduce results that are readily applicable for the QT. The set of rules in our knowledge base plays the role of integrity constraints (IC) in SQO. The simple format rules that make up the concept hierarchy (Rule 3 in Section 3.2) work in the same way as the intentional database (IDB) rules. There is, however, no direct counterpart in our knowledge base to the extensional database (EDB).

The SQO methodology proposed in [11] is flexible in that it allows queries to have nested ANDs and ORs. This is important for our purpose since ORs can be introduced in various parts of our inference process. Furthermore, because the process separates the gathering of relevant information stage from the actual query reformulation stage, it fits nicely into our two-part transformation process; inference and translation. For these and other reasons, we build on the work of [11].

We present an example that describes how the inference process works. Let us consider the following query along with Rules 1 and 2 above.

Q: subject(database)

Suppose that from the user profile we know of two facts about the user. Since these are true facts, we can add them to the query without altering its (intended) meaning.

Q': subject(database) ^ occupation(student) ^ user_site(snu)

Using the inference methods of [11], we are able to deduce from Rule 1 that billing should be free and augment this information to the query.

Q": subject(database) ^ occupation(student) ^ user_site(snu) ^ billing_type(free)

We can represent this query as a query tree [11] as follows:

Rule 2 is not yet relevant to Q" since we do not have a predicate user_site(academic) in the query tree. So the next step is query tree expansion using the concept hierarchy. The node (in the query tree) that is being expanded, say node n, is replaced by the subtree extracted from the corresponding DAG. The subtree is extracted as follows:

  1. create an AND node s and make it the root of the subtree, and
  2. all the nodes in the path(es) from the root of the DAG to node n' that corresponds to node n becomes the children of s.

The query tree above is expanded on node user_site(snu) using the DAG in Figure 2.

Since occupation(student) and user_site(academic) are both present in the query tree, Rule 2 becomes relevant and we may add doc_type(tech_report) to the query tree. Although the process (and reasoning) becomes more complicated in the presence of ORs in the query tree, it can be handled effectively [11].

The transformation process is similar to SQO differing mostly in the expansion step. Here we are able to exploit ancestors of the target node (term) since we are dealing with strict subset relationships rather than definitions, as in IDB rules. It can be shown that the above described process is sound.

The transformed query tree is then passed to the translator. It is the translator's role to filter out useless portions of the query tree and tailor the query to each particular information server. For example, if an information server required SIMS [10] query form, a translation of the above query tree would be

(retrieve (?title)
(:and (documents ?docs)
(keyword ?docs database)
(type ?docs tech_rep)
(charge ?docs 0)
(title ?docs ?title) ) )

4 Closing remarks

We have proposed an architecture for a query transformer that utilizes user specific knowledge. The search system lies between the end user and information sources, and is able to work transparently. One strong point of our query transformer is that it is easily applicable to a wide range of environments including the Web, digital libraries, and Usenet.

An adapted version of an SQO method is developed for the query transformer. The concept hierarchy differs from that of [10] in that we focus much on meta-level terms as opposed to subject-level (or content-level) terms. Subject-level concept hierarchies can only be addressed in future work since the complexity and size would be orders of magnitude that are higher than the meta-level ones. For this reason, specialized concept hierarchies or thesauri are being built for specific subject domains. One potential benefit of our architecture is that we can add a few rules into the knowledge base so it will be able to choose the appropriate subject domain for each user.

The decision regarding the amount and type of user-specific knowledge is still being studied. The query transformer may be required to have a learning capability that dynamically adjusts its list of important fields. User search characteristics are subject to change over time and the query transformer should be able to update its knowledge base accordingly. Thus, dynamic adaptation or learning in different dimensions definitely deserves further study.

References

[1] ACM, Special Issue on Intelligent Agents, Communications of ACM 37(7), 1994.

[2] ACM, Special Issue on Digital Library, Communications of ACM 38(4), 1995.

[3] D. E. Atkins, et al., Toward Inquiry-Based Education through Interacting Software Agents, IEEE Computer 29(5): 69-76, 1996.

[4] U. Chakravarthy, et al., Foundations of Semantic Query Optimization for Deductive Databases, Foundations of Deductive Databases and Logic Programming, (J. Minker, ed.), 1986.

[5] D. Dreilinger, Integrating Heterogeneous WWW Search Engines, 1995.

[URL: ftp://132.239.54.5/savvy/report.ps.gz]

[6] D. Eichmann, The RBSE Spider: Balancing Effective Search Against Web Load, Proc. of the First International WWW Conference, 1994.

[7] O. Etzioni and D. Weld, A Softbot-Based Interface to the Internet, Communications of the ACM 37(7): 72-76, 1994.

[8] S. P. Harter, What Is a Digital Library? Definitions, Content, and Issues, Proc. of the International Conference on Digital Libraries and Information Services for the 21st Century: 8-17, 1996.

[9] J. King, Query Optimization by Semantic Reasoning, Ph.D. dissertation, Department of CS, Stanford University, 1981.

[10] C. A. Knoblock, Y. Arens, and C.-N. Hsu, Cooperating Agents for Information Retrieval, Proc. of the Second International Conference on Cooperative Information Systems, 1994.

[11] S.-G. Lee et al., Semantic Query Reformulation in Deductive Databases, Proc. of the 7th IEEE Data Engineering Conference, Kobe, Japan, 1991.

[12] P. J. Nurnberg et al., Digital Libraries: Issues and Architecture, Proc. of the Second Annual Conference on the Theory and Practice of Digital Libraries, 1995.

[13] B. Pinkerton, Finding What People Want: Experiences with the WebCrawler, Proc. of the Second World Wide Web Conference, 1993.

[14] E. Selberg and O. Etzioni, Multi-Service Search and Comparison Using the MetaCrawler, Proc. of the 4th International World Wide Web Conference, 1995.

Acknowledgments

This work is supported in part by Electronics and Telecommunications Research Institute (ETRI) of Korea as part of the 21c Database Initiative Project.