Indexing Links of Pages to Search for "Hearts" Through the World Wide Web

Ryunosuke OHSHIMA <ryu@jaist.ac.jp>
Shinsuke MIWA <s-miwa@jaist.ac.jp>
Yoichi SHINODA <shinoda@jaist.ac.jp>
Japan Advanced Institute of Science and Technology
Japan

Abstract

Users of Web search engines wish to search for not pages but "hearts," groups of pages that concern the same topic. Traditional robot search engines lack the ability to manage hearts. Traditional directory search engines lack the ability to manage the changeable Web. To solve these problems and realize users' wishes, in this paper we present a new method of indexing Web pages. We combine our techniques with traditional text search techniques to produce better search results: not pages but hearts. By adding a link structure index to the full-text search that uses keywords, we remove redundant pages, leaving users with more suitable pages from which to choose.

Our Namazu-Hige prototype system groups out of order pages or redundant pages. It ranks not pages but hearts so some noisy result ranks become low. It also reduces users' time to examine results because it offers them "the heart of the heart" (the most representative pages of the heart). We explain Namazu-Hige and compare it with traditional search engines by using some typical Web pages as examples.

Contents

Introduction

Although the World Wide Web is the most popular aspect of the Internet, users of the Web still get lost among its pages. The Web is not equipped to index pages. Users employ text search systems called search engines to retrieve information on the Web. Search engines help users but are far from perfect. When keywords are entered, many useless pages appear, while other useful pages do not. Users cannot find the suitable pages buried among the unsuitable pages. If Web search engines possessed the ability to manage Web links effectively, they could manage hearts.

In next section, we analyze Web links. Then we propose a method to find hearts using Web links' structure information. The method to find "the heart of the heart" is also shown. Next, we explain how to combine our method with full-text search engines and our implementation of the prototype system Namazu-Hige to test our theory. Last, we discuss and conclude this paper.

Analysis of Web links

Web page creators (who are also Web users) are constantly linking their pages to other pages. Link structures of pages are important clues to finding hearts.

Web links have many functions: to suggest routes for users, to show users detail, to help users with contents or indexes, to guide users to a word's definition, and so on. Web links differ in their strength to connect pages. Route suggesting links connect pages more strongly than word glossary links, for instance. Our one goal is to discover links' strengths and roles. We calculate connectivity among pages based on links' strengths and roles to find hearts.

Geometric properties of Web links

First, let's consider the geometric properties of Web links. We consider that n-distant strongly connected components are important. An n-distant strongly connected component is a group of pages. A page in this group can reach any other page in the group and come back to itself within a distance of n. In other words, that group of pages is linked within a length of n.


Figure 1: Page map of Japan Advanced Institute of Science and Technology (JAIST)

For example, Figure 1 shows the neighborhood pages map of research laboratories in the School of Information Science, Japan Advanced Institute of Science and Technology (JAIST). Strongly connected components are shown with a colored line. Groups lined in red are 2-distant strongly connected components, and groups lined in magenta are 3-distant. There are also more distant groups like the pink-lined group. We assume that links consisting of n-distant strongly connected components are strong links and that each page's connectivity is strong.

The number of links in pages is also a big factor to estimate link strength. In Figure 1, the group of Shinoda Laboratory pages and the group of School of Information Science pages are part of the same 2-distant strongly connected component. However, if users chose links at random, some users would go out of the group of School of Information Science pages easily because the pages have many outgoing links. But the Shinoda Laboratory pages have few outgoing links, so users will stay in the group. So the group of Shinoda Laboratory pages' connectivity is stronger than the connectivity of the School of Information Science pages.

To discover pages' roles, we use the direction and strength of links in each page. If one page has more outgoing links than another page, the first page is more general or abstract. If one page has more incoming links than another page, the first page is more particular or more concrete. Some strongly connected component pages have both characteristics. In Figure 1, because the School of Information Science page is the index page, it has many outgoing links to each laboratory, publications, and so forth. But the Shinoda Laboratory page has both characteristics: It is a particular page of JAIST and a general page for other pages of the Shinoda Laboratory.

To make use of this information in a new Web search engine, we introduce two parameters, page distance and page influence. We find these parameters using links and other information present in pages to reveal pages' characters. Page distance indicates the number of links needed to move between pages. Page influence shows the rate of reach via links between pages. We use page distance and page influence as concise representations of the link structures of pages.

Page distance

We define page distance as the minimum number of links to reach one page from another.

The definition of page distance (PD) from Page "a" to Page "b" is:

    If "a" and "b" are the same page
        PD(a, b) = 0
    else if "a" has no links
        PD(a, b) = infinity
    else if a link of "a" to "b" exists
        PD(a, b) = 1
    else
        PD(a, b) = min(PD(x, b) + 1) for all x (x is an outgoing link of "a")

Page influence

We define page influence as the probability of reaching one page from another page.

The definition of page influence from Page "a" to Page "b" is:

    If "a" and "b" are the same page
        PI(a, b) = 1
    else if PD(a, b) == Infinity
        PI(a, b) = 0
    else
        PI(a, b) = (sum(PI(x, b)) / #a) for all x (x is an outgoing link of "a")
    #a is the number of outgoing links in page "a"

Reverse page influence

Reverse page influence is almost the same as page influence. Reverse page influence uses incoming links instead of outgoing links.

N-distant strongly connected pages

From the definition of page distance

    Pages "a" and "b" are n-distant strongly connected if and only if
        PD(a, b) + PD(b, a) <= n

Lexical properties of Web links

We have discussed the geometric properties of Web links. Here, we explain the lexical properties of Web links.

Occurrence of same words

If linked pages have some of the same words and those words appear frequently, it strengthens the links among these pages. For example, the fact that Shinoda Laboratory pages all include the word "Shinoda" is an indication that it is correct to group them.

Special words attached to links

Some pages attach special words such as "back" or "top" to links. These special words are also clues to how pages are classified. The special words "next" or "prev" strengthen the connectivity of pages. "Top" and "index" suggest that a page guides users to more general pages.

How to find hearts

In the previous section, we analyzed Web links. In this section, we propose our method to group Web pages based on some keywords. We use page distance and pages' keyword frequency score. First we try to find n-distant strongly connected pages of each page. We are using a breadth first search algorithm based on a Dijkstra algorithm to find that solution. Below, we show the outline of the algorithm to find all n-distant strongly connected pages of Page "p."

queue = [p]
result = []
dist = 1
while (dist < n) {
    newqueue = []
    for all x (x is an outgoing link of queued pages) {
        push x in newqueue
        if x == p {
            push x in result
        }
    }
    queue = newqueue
    dist += 1                                       
}
return result

Overview of the algorithm to find all n-distant strongly connected pages of page "p"

N-distant strongly connected pages are treated as candidates for classification as hearts. Each group sums up its member pages' scores. If the distance becomes longer, it is harder for users to recognize the group. So this group's score is reduced based on the group's distance n.

    Score(G) = sum(score(g)) * WR ** n

WR means "walk rate." WR's range is [0, 1]. WR simulates users' average probability of using links to examine the group. The score indicates the justness of the group. High-scoring groups are deemed to be hearts.

How to find "the heart of the heart"

The previous section explained how to discover hearts. Next, we discuss how to find "the heart of the heart." Each page has a spread score. Spread scores are calculated based on page reverse influence and page distance. We use a breadth first search algorithm to calculate the n-distant reverse page influence of Page "p."

for all page x {
    result[x] = 0
}
queue = [([p], 1.0)]
dist = 1
while (dist < n) {
    newqueue = []
    for all (path, infl) in queue {
        #l is incoming link number of a path's last page
        for all x (x is an incoming link of a path's last page) {
            continue if (path includes x)
            result[x] += (infl / #l)
            push ([path, x], infl / #l) in newqueue 
        }
    }                                       
    queue = newqueue
    dist += 1
}
return result

Overview of the algorithm to calculate the n-distant reverse page influence of Page "p"

The propagated scores of all pages in one "heart" are summed. The new score of Page "p" is calculated from this formula:

    newscore(p) = sum(score(x) * RPI(x, p) * WR ** PD(p, x))

The page with the highest newscore is "the heart of the heart."

Namazu-Hige Prototype System


Figure 2: Overview of Namazu-Hige

To assess our method, we have implemented a prototype system named Namazu-Hige. (In Japanese, "Namazu" means "catfish" and "Hige" means "whiskers.") Figure 2 shows an overview of Namazu-Hige. The left column (the Namazu part) is the full-text search engine part. The right column (the Hige part) is the grouping system part using link structure to find hearts.

Namazu

Our full-text search engine part is based on Namazu, an open-source full-text search engine written in C and Perl. A full description of Namazu is available [4]. Namazu has a "mknmz" part to make an index and a Namazu part to search an index. Mknmz makes an index from raw data hypertext markup language (HTML) files. This part parses raw page data lexically, counts the occurrence of words, and weighs them by context and HTML tags. Namazu searches that index with keywords and finds the score of the files. If you give plural keywords, Namazu computes scores by the TFIDF method to figure out the weight of each keyword. Finally, Namazu sorts files by their scores and returns high-ranked pages as results.

Hige part

The Hige part, the main part of our prototype system, is written in Ruby [5]. The Hige part has two parts of its own: the hyperlink indexer and the grouping engine.

Hyperlink indexer

The hyperlink indexer extracts anchors and collects link information from HTML files. It also unifies its uniform resource identifiers (URIs) into absolute URIs and makes an index of them. The index consists of each page's incoming and outgoing links. Page distance and page influence are also calculated up to some distance.

Grouping engine

The grouping engine receives pages' lexical scores from the Namazu part and reads a link index of those pages. It finds hearts and sorts them by score. It discovers "the heart of the heart" for each heart and returns that result.

Demonstration


Figure 3: Search result of Namazu-Hige


Figure 4: Search result of Namazu

Figure 3 shows a search result using Namazu-Hige. It searches through JAIST with the keywords "information science" and "research." Figure 4 shows a search result using Namazu for comparison.

You can try Namazu-Hige to search JAIST on the Web. You can also make a trial on your PC with your data set. All sources needed for the trial are available.

Discussion

Precision and recall

Web search engine users tend to use only the top 20 to 100 search results. Our goal is to the increase precision in these top 20 to 100 search results. To increase precision, we cut off all pages of hearts except "the heart of the heart" because one heart's pages have similar (and often redundant) information.

Ubiquity of hearts

We examined JAIST's site. It had 6,856 HTML files on 25 January 2000. The table below shows the results of the analysis. We found 759 hearts that were 5-distant. They were made up of 2,317 HTML files. This means that 33.8% of HTML files belonged to hearts and the average number of pages in each heart was 3.

Hearts of JAIST

Distance Hearts Pages Belonging to Hearts Other Pages
2 835 2,222 4,634
3 872 2,297 4,559
4 872 2,314 4,542
5 759 2,317 4,539

Future work

Now we are planning to provide a Namazu-Hige package. Source codes of Namazu-Hige will be available on our site [6] soon. If you want them immediately, please send an e-mail to ryu@jaist.ac.jp. We will make a quantitative assessment of the Namazu-Hige system to improve our analysis of Web links and challenge ourselves to discover more efficient algorithms to deal with hearts.

Conclusion

Our methodology can expand into many other Web technologies to shift their base unit from pages to "hearts." We expect to apply the methodology to various fields other than information retrieval, such as information summary, a new structural informational addition to the Web, and recomposition of the Web. In the future, our methodology may help the Web evolve and improve.

References

  1. Sergey Brin and Lawrence Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. In Proceedings of the Seventh International World Wide Web Conference (WWW7), 1998.
  2. Michael Chen, Marti Hearst, Jason Hong, and James Lin. Cha-Cha: A System for Organizing Intranet Search Results. The Proceedings of the 2nd USENIX Symposium on Internet Technologies and SYSTEMS (USITS), October 1999.
  3. S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Hypersearching the Web. Scientific American, June 1999.
  4. Namazu http://openlab.ring.gr.jp/namazu/.
  5. Ruby http://www.ruby-lang.org/.
  6. "Namazu-Hige" http://shinoda-www.jaist.ac.jp/Projects/hige/.