Hiroaki KASHIMA < kashima@csce.kyushu-u.ac.jp
>
Yoshiki ISHIDA < yoshiki@cc.kyushu-u.ac.jp >
Zengo FURUKAWA < zengo@ec.kyushu-u.ac.jp >
Kazuo USHIJIMA < ushijima@csce.kyushu-u.ac.jp
>
The following characteristics of the data cause the difficulty to find out the information. (1) The number of data is very large, (2) the data are widely dispersed on the Internet, (3) the mass of data is increasing rapidly, and (4) the data always change in structure and contents.
In this paper, a set of such data is represented as the word ``super-dispersed information.'' It is difficult to find required information on super-dispersed information with existing information retrieval systems.
This paper introduces ``March'' which searches filenames on many anonymous FTPs using the IP Multicast. Multicast is a mechanism suitable to search super-dispersed information on the Internet. Filenames on anonymous FTPs are always changing and are dispersed widely. March consists of march-servers on anonymous FTPs and march-client on user machines. March is superior to archie systems in less traffic on the Internet and in shorter time for getting filenames. However, the current IP multicast restricts the March usability on user interfaces for specifying area of searching filenames.
2 Existing Information Searching Systems
3 Searching Information with Multicast
4 File Searching System "MARCH"
Finding objective information becomes difficult in the Internet. In the Internet, many kinds of information are widely spread and many systems are currently developed for providing information. These systems provide many data on computers and you can access the data through the Internet. However, finding objective data from many kinds of information provided by many kinds of information providing systems is difficult. In some cases you should search some kinds of information to retrieve objective data.
Existing information searching systems cannot search some kinds of information. They are designed for searching static data. The change of static data should be little or is managed completely. On the other hand, information providing systems, such as World-Wide Web[1], make information explosions on the Internet. These exploding data also change their structures and contents dynamically. The searching systems on static data cannot deal with volumes and transitions of exploding data.
We name such kinds of information ``super-dispersed information,''
which is un-search-able or may become un-search-able because of the
overflow of searching systems.
Figure 1 shows the super-dispersed information.
Super-dispersed information has some characteristics.
Figure 1: Super-displersed information
Existing information searching systems are not suitable for searching super-dispersed information. When you are going to search information from super-dispersed information, you should use an appropriate system to find information. Existing information searching systems have a common problem.
Systems that are usually called ``information retrieval systems'' are classified into three types: information providing systems, information searching systems and information searching-and-providing systems. Moreover, they are usually classified in another way into three types: centralized systems, tree structured distributed systems and net structured distributed systems. Table 1 shows the classification of information retrieval systems.
Table 1: Classification of Information Retrieval Systems
Centralized systems are not suitable to search super-dispersed information because the collection of super-dispersed information must overflow the server. Tree structured distributed systems are not also suitable to search super-dispersed information because these systems use the tree structured based on the relation of data but super-dispersed information has no precise relation. Net structured distributed systems are suitable in case the net can reflect the relations of super-dispersed information quickly enough to search.
Existing information searching-and-providing systems --- such as whois, wais, X.500 and domain name system --- are centralized or tree structured systems and they are not suitable to search super-dispersed information.
Existing information providing systems --- such as World-Wide Web, Gopher and anonymous FTP --- are centralized or net structured systems and data in these systems are weakly related or independent. Currently these systems provide super-dispersed information but these systems don't have the search method themselves.
Existing information searching systems --- such as archie, Veronica and robot --- are centralized or net structured system. Archie and Veronica are centralized systems and and they must overflow with data. Robot uses the net of the World-Wide Web but it transfers data at a stretch to the local host and searches on the local. Therefore the robot has heavy load on networks and World-Wide Web servers.
Super-dispersed information is provided by many information providing systems and existing information searching systems have several problems.
Multicast 2 is a communication way suitable for searching super-dispersed information. Multicast is a communication protocol to send packets to members of a multicast group at the same time. You will send a query to the certain multicast group and you will receive search results from many searching servers which are members of the group.
On the Internet virtual multicast backbone ``MBone[3]'' is running and many broadcasting experiments are done on MBone. MBone is constructed with LANs which have one or some multicast speaking hosts and tunnels which connect these LANs via non-multicast-speaking routers. This is based on Steve Deering's proposal in the latter of 1980's, which has following characteristics:
We propose a prototype here to search super-dispersed information efficiently. Figure 2 shows this prototype.
Figure 2: Searching super-dispersed information
With this method you will have following merits.
The multicast group of current IP Multicast is not suitable for searching information. On the current IP Multicast you cannot use the multicast group to choose a specific area to which the packet reaches.
We have made a plan to select the multicast group with only the type of information and should select the search area with the TTL variable in the packet. The IP address of multicast group for searching information should be pre-defined, or you cannot select the group corresponding to the objective information. The TTL value should be decided when you begin searching. Each query packet will reach servers in the area which the TTL variable indicates.
There are too many files to search on many anonymous FTP servers on the Internet. They are a kind of super-dispersed information. You can use ``Archie[4]'' as a system to search files on many anonymous FTPs, but it has several problems. To solve these problems we have developed a new file searching system ``March'', which adopts the method we describe above.
The anonymous FTP is an information providing system and the system itself has no method to search files on each server.
The number of files on anonymous FTPs is quite large and increasing rapidly. This is because an anonymous FTP server can provide many files with a large volume of disks. The cost of disks becomes cheaper and cheaper, then anonymous FTP server can provide more and more files.
Moreover, an administrator of an anonymous FTP server need not contact with administrators of other anonymous FTP servers. Therefore, anyone can start a new anonymous FTP server easily. For these reasons, files on anonymous FTPs are a kind of super-dispersed information.
``Archie'' is a system for searching files on many anonymous FTPs. On the current Internet you can get the location of files that you want with this system.
The archie system is a centralized system. This causes a serious problem. The archie server must have filenames of many files on many anonymous FTPs, but the number of files on anonymous FTPs must become too many to manage and search on a single server because they are a kind of super-dispersed information. The archie system will overflow.
Searching files on anonymous FTPs with the above method is effective because files on anonymous FTPs are a kind of super-dispersed information. According to the above method, march servers have features as follows: each server corresponds to an anonymous FTP; each server searches filenames on the anonymous FTP; each server belongs to the march multicast group; servers exchange nothing each other; you can send a query packet to the march multicast group and receive search results directly from servers.
As described above, you select the march multicast group and you describe the searching area as the TTL of the query packet. It is useful to send query packets in order of the TTL from smaller to bigger because many of files are copies from the other anonymous FTPs. As there may exist many copies for the objective file, you can find it within a smaller TTL, and you can avoid making a load on servers over the specified TTL. At the same time, you can get the closest or mostly closest file to avoid making a load on networks. Figures 3 and 4 show difference between archie and march.
Figure 3: Search with Archie
Figure 4: Search with March
The system is implemented in client-server model as shown in Figure 4. Each server joins the march multicast group and a client throws the query packet to the group. We use the IP address ``224.224.2.2'' for the march multicast group.
The server is implemented on the SPARC SunOS4.1 with ipmulti-patch and is constructed with three part: the receiver, the searcher, and the database builder. Figure 5 shows the function of each part.
Figure 5: Construction of the March Server
The receiver runs as a daemon process and joins to the march multicast group to receive queries from clients. Each query is passed to the searcher. The receiver is written in C to use the functions of existing multicast applications.
The receiver makes the searcher start, which searches the filenames that match the query from the receiver. The filenames are compiled and stored in the local database built by the database builder. The searcher is written in Perl that is powerful at string processing.
The database builder makes periodically the local database from the ls-lR file on the anonymous FTP corresponding to the march server.
The client is also implemented on the SPARC SunOS4.1 with ipmulti-patch and is written in C to use the functions of existing multicast applications. This sends a query to the multicast group with the appropriate TTL value. The query contains the query string, the search type and the display type. The client knows the searching servers dynamically with the ACK from servers. The client regards the receipt of search results from all senders of ACK as the end of searching.
When an objective file is not found, you can re-send the query with the larger TTL to search the larger area. Each server should discard the same query from the same client. Then the server should discard the same queries from the same host. We decide that the client must send the same queries from the same port and servers should discard queries from the same port of the same host.
You can specify the search string at the beginning of the searching. If you don't, servers recognize the query as the request of ACK and you can get the server list.
You should also specify the search type, which indicates the type of the search string: ``SUBSTR'' as the substring, ``REGEXP'' as the regular expression and ``SHELL'' as the filename expansion method of the bourne shell.
You should also specify the display type, which is the request to servers for items and format to send. You should use ``INFO'' to request information of each file, or should use ``URL'' to request the URL of each file.
We experimented in MBone-JP in Feb., 1995. Seven organizations: Kyushu University, Kyushu Institute of Technology, Nagasaki University, Osaka University, The University of Electro Communications, Chiba University and Kyushu Area Regional Research Network started a march server of anonymous FTP on each organization.
The march system uses the virtual network MBone to transfer the query packet. However, MBone is mainly constructed for audio and visual conference. This causes a problem.
On the current MBone, you can find a file on an anonymous FTP connected with thick networks in preference to a file on an anonymous FTP connected with thin networks. The current MBone constructed to let packets flow in the thick networks in preference makes a query packet reach march servers connected with thick networks.
However, when you are on hosts connected with thin networks you must use large TTL to search out of the thin networks and then you cannot select proper searching areas except very large ones.
The result of our experiment shows this problem. We can search an objective file closer to us in Kyushu area. However, we can find the closest objective file out of Kyushu area with only searching almost all of servers in Japan.
This is caused by the current MBone. The current MBone is a virtual tunnel network and does not reflect the topology of the Internet. Figure 6 shows the current JP-MBone.
Figure 6: JP-MBone connection as of Feb. 2, 1995.
With Archie, you will get the list of files that the server has and you cannot find the closest objective file. On the other hand, you can find the closest objective file with March. Table 2 shows the search result from Kyushu University with the search term ``kterm'' and the search type REGEXP.
Table 2: The number of found files with "kterm"
We found total 51 files at the experiment, where 27 files were without overlaps. Many files were found when TTL=127 and TTL=191 but the number of newly found files was less on TTL=191 than that on TTL=127. 20 files were found on TTL=191 but the number of the new files were only 6 and other 14 files were already found in the smaller searching area. We will be able to find the objective file with searching in the grade of organization or region if almost all of anonymous FTPs are search-able with March.
The searching time on a server depends on the size of the local database which depends on the size of the file ls-lR on the anonymous FTP. Figure 7 shows the relation between the size of ls-lR and the searching time. We measured the searching time with the search string ``term'' and the search type ``SUBSTR'', and timed on the HelioStation1030 (SPARC 2 clone) with the 80MB memory.
Figure 7: ls-lR file size and searching time
The searching time is in proportion to the size of ls-lR but
there are two lines that have different slants. The upper line is
expressed by
t = 1.98 * 10^{-3} * x + 1.52 * 10^{-3}
and the lower one is expressed by
t = 7.35 * 10^{-5} * x + 6.15 * 10^{-2}.
We are investigating the reason why the
relation between the searching time and the size of ls-lR is
approximated with two lines.
The increase of the number of files on anonymous FTPs is due to both the increase of the number of files on an anonymous FTP and the increase of the number of anonymous FTPs.
The increase of the number of files on an anonymous FTP should not make longer the search time of the local database. The increase of the number of files on an anonymous FTP will enlarge the local database. However, the search time of the local database will not be longer since computers will become more powerful.
The increase of the number of anonymous FTPs should not make the search time longer, too. The number of march servers will increase as the number of anonymous FTPs because each march server stands with the corresponding anonymous FTP. The increase of the number of march servers will bring following results.
In comparison, the database will be too large to have on the archie server because the database becomes larger in proportion to the increase of total number of files on anonymous FTPs.
The march system can correspond to the increase of the number of files with which the archie system cannot deal.
The World-Wide Web and the Gopher are the other typical super-dispersed information providing systems. We consider the searching of these kinds of information with the method.
To search information on the World-Wide Web, the robot system reflects the relation of data on the World-Wide Web better. However, the robot system makes a load on many servers by getting many data one after another. In addition, this system makes a load on the network by transferring whole data to the machine that the system works for searching. Super-dispersed information should be searched with a dispersed system of searching information.
To search information on the Gopher, the Gopher net is useful for searching like the robot, but a system to search on each Gopher server is required. Veronica system is to search Gopher space, but this system will overflow like archie because this system is centralized, too.
There is another problem. There are no copies on both systems but IP Multicast can lose packets. Therefore you may not find what you want.
In addition, you should be careful not to insert security holes when you implement a searching system with the method. You should use the method only for searching information.
We have proposed a searching method of super-dispersed information using the IP Multicast. Super-dispersed information is some kinds of information that have difficulty to search and that may overflow information searching servers. Existing information searching systems are not suitable to search super-dispersed information.
This method has the following merits: you can know the location of information closest to you; you can avoid to make a load on many searching servers that do not have the types of information you want; each server's database is small and the server's load is light.
We have developed the file searching system ``march.'' This system uses the IP Multicast to send a query to many servers at once. Servers' load is light because each server has the filenames on only an anonymous FTP. A user can search files in several areas using the mechanism of the IP Multicast.
Our future work is as follows:
The authors would like to thank researchers of WIDE Project and JAIN
Consortium for discussion, and staffs of Nagasaki University, Kyushu
Institute of Technology, Osaka University, Nara Institute of Science
and Technology, The University of Electro-Communications, Chiba
University, and KARRN Associations for experiments and discussions.
Acknowledgment
References
Author Information
Yoshiki Ishida
Zengo FURUKAWA
Kazuo USHIJIMA
Return to the Table of Contents