Building a Working Directory Service with Whois++

Patrik Falstrom
Tele2, Sweden

Some statements:

Global searches

Current technologies like Domain Name System (DNS) and X.500 uses a given, well-known path to an object in the directory hierarchy when searching for an object. The same methology is used in a normal hierarichal file system. If one does not know the path, one is forced to use a global search on all nodes below the one where the search is initiated. What is needed is an index-based system, which, instead of announcing the "local name" of a node in the given place in the hierarchy, announces an index over the data that the node (itself and children) has stored.

When doing a global search, it is not permitted to miss any positive hits, but to make the index as small as possible, negative hits might be necessary. The important thing is to cut off the parts of the tree where no hits are found as fast as possible.

Global searches also have to work on, for the indexing node, unknown attribute names, i.e. the index itself has to be self contained with all information that is needed for the indexing server.

Work is going on in the FIND working group of the Internet Engineering Task Force to define the Common Indexing Protocol (CIP) which is one candidate for a true general indexing protocol which can be used for a number of different directory services. Currently CIP is only used in the Whois++ protocol, but experiments with combinations of CIP and Lightweight Directory Access Protocol are performed at Umeå University in Sweden.

Data quality

When publishing information in a central system like Yahoo or Lycos, there is always a risk that the data just "sits there" and is not updated as it should be. To raise the accuracy, it is better to let the data stay as close to the authorative source for the data, and just export an index which helps other servers to give referrals to the authorative source. By doing this, all fetches will be at the authoritative source, which will help keep the quality of the service high.

Schema management

When setting up a local directory service server on a site, it is very common that that site needs some extra attributes in a given schema. That can be handled by defining a new object type, but propagating that schema definition and making sure that clients understand it can be a very hard task. It is better if the data sent from the directory service server is in textual format so the client can choose to display the raw data. This is the method chosen in the Multipurpose Internet Mail Extension (MIME) standardization, so why not use that method in directory services as well?

Also, the cooperation between servers in the index hierarchy should not stop working just because some new attributes are created on a few leaf servers. Any new attributes defined have to be searchable on a global level (see Global Searches above).

Dynamic mesh

The indexing hierarchy must be very dynamic. The architecture of the directory service cannot be such that one directory service server can only participate in one mesh at a time. This is because most organizations have the need for both an organizational and geographical hierarchy. As an example, one can have a look at a multinational company. They want one hierarchy internally in the company (organizational) but at the same time, each office in each country wants to participate in the local hierarchy inside that country. When searching with start at the topnode of the company, the search is done worldwide, but only within the company. When starting at the topnode in a country, the search is limited to that geographical area, including the parts of the company that happen to be there.

Some of these problems can be solved with "pointers" or "aliases." Unfortunately they do not work so well, and they only try to solve some of the problems one can have when one server wants to participate in more than one hierarchy. For example, most implementations of "aliases" are only one-way, like soft-links in Unix. The "alias" is not on the target node of any reference count, and there is no chance of knowing what objects refer to this particular node.

The only solution must be that one server can have more than one "true" parent.

Minimize directory services

When starting to use a directory service, one normally only talks about white-pages services. But very soon the ideas come to also have yellow-pages, certificate, document information, contact information, ranking, etc. When all of these ideas settle down, one starts to have nightmares about all of the databases one has to keep track of.

Because it can be hard enough to see that the data in all of these databases are up to date, one should be able to store all of this information in the same directory service. In the directory service, it should be possible to store information on records of different types. It might also be that the server keeps track of the certificate of an organization, another one the address and a third a ranking of the company.

It should be possible to start the search on a global level for information about a company, and what is returned is all of these three blobs of information--maybe from three different sources. Global searches on several schemas is because of that necessity.

Internationalization

Internationalization is a very difficult problem. Even the word itself has been misused a lot in the last couple of years. What Whois++ tries to solve is the ability to use as many different characters as possible (by using Unicode as the base character set) and the ability to have alternative attribute values for one attribute depending on what language the client asks for.

One record can use all of the glyphs in the Unicode character set. The client can still query in any of the character sets supported by the server, but the query is translated into a fully decomposed Unicode string before the actual search is done.

Second, the record can have one attribute in several languages. If the client asks for French, the French attributes are given back (or the default ones if no French attributes were defined). This makes it possible for a company or person which publishes information to publish records very easily in several languages.