Frank Kappe (fkappe@iicm.tu-graz.ac.at)
This paper presents a scalable architecture for automatic maintenance of referential integrity in large (thousands of servers) distributed information systems. It is assumed that a link database at every server maintains local consistency, and keeps track of incoming and outgoing links to other servers (the servers surface).
A central feature of the proposed architecture is the p-flood algorithm, which is a scalable, robust, prioritizable, probabilistic server-server protocol for efficient distribution of update information to a large collection of servers.
3 An Architecture for Referential Integrity
The net effect of this is that a certain percentage of references are invalid. We may expect that this percentage will rise as time goes by, since more and more documents become outdated and are eventually removed, services are shut down or moved to different servers, URLs get re-used, etc. Obviously, it would be desirable to have some support for automatically removing such dangling references to a resource which is deleted, or at least to inform the maintainers of the other resources.
Let us assume that we would maintain a link database at every server that keeps track of all the links involving this server, i.e. emanate from and/or point to a document that resides on this server. Storing the links outside of the documents in a link database (like it is done in the Intermedia [8] and Hyper-G [2][12] systems and was recently proposed for the WWW [16]) will not only give us an efficient solution for the dangling link problem (as we will see); it also enables more advanced user interfaces for navigation in the information space, such as local maps and location feedback [1], and allows links between arbitrary document types.
Interestingly, the more primitive Gopher [13] system does maintain referential integrity in the local case. When a document (which is an ordinary file on the server's file system) is deleted (or moved or modified), the menu item that refers to it (which is a directory entry) is updated as well. This is automatically taken care of by the underlying operating system (unless you try real hard to break it and use the symbolic links of UNIX). References to remote servers remain insecure, however.
While both Gopher and WWW scale quite well with respect to the number of servers, documents, and links, there is a scalability problem with respect to the number of users. When a large number of users for some reason decides to access the same document at the same time, the affected server and the network region around it become overloaded. This phenomenon (Jakob Nielsen calls it a "Flash crowd" [15] after a 1973 science fiction story of Larry Niven) was observed during the 1994 Winter Olympics in Lillehammer, where the Norwegian Oslonett provided the latest results and event photographs over the Web and drowned in information requests. Similar but smaller flash crowds appear when a new service in announced on the NCSA "What's New" page or in relevant newsgroups.
This problem may be alleviated by the use of cache servers, which keep local copies of information which has been recently requested, and give users requesting the same information again the local copy instead of fetching it from the originating site. This strategy does not work, however, in two cases:
In the Hyper-G system a database engine is employed to maintain meta-information about documents as well as their relationships to each other (this includes, but is not restricted to, links). Since the links are stored in this database and not in the documents themselves, and since modifications of documents or their relationships are only possible via the Hyper-G server, referential integrity can easily be maintained for local documents. The link database makes links bidirectional, i.e. one can find the source from the destination (as well as vice versa). In order to keep this useful property when a link spans physical server boundaries, both servers store the link information as well as replicas of the remote document's meta-information. This means that all updates related to the documents and the link in question have to be performed on both servers in order to keep the web consistent, thus requiring an update protocol between servers.
However, the multi-server transaction has scalability problems in certain situations. When for some reason many servers (say, 1000) decide to refer to a specific document (e.g., by pointing a link to it or by replicating it), all of them have to be informed and acknowledge the update before it can be performed. This not only increases network traffic and slows down things considerably, but it also requires that all servers involved have to be up and running, or the transaction cannot be completed. As the number of participating servers increases (and given the unreliability of the Internet), the probability that all of them are reachable approaches zero. This means that it becomes practically impossible ever to modify a heavily-referenced object.
This scheme is well-suited to applications where objects are read frequently and updates happen only infrequently. The Sun Yellow Pages (YP) service (nowadays known as NIS) is an example of a master/slave system.
The central master server also makes it easy to resolve conflicts between update requests and maintain consistency. The obvious disadvantage is that the master server has to be up and running in order to perform updates. Otherwise, this scheme scales very well (provided that we have a good way of propagating updates from master to slaves).
This requires that a server knows about all the other servers that also have copies (let us call this set of servers the server-set). In a perfect world, all the copies would be identical, but because of network failures and for performance reasons it may not be possible or desirable to immediately notify all servers of an update. We may instead adopt a looser form of consistency (weak consistency), in which all copies eventually converge to the same value at some time interval after the updates have stopped.
However, one still wants to be sure that all read requests are based on up-to-date copies and all updates are performed on the latest version. The trick which ensures this is majority consensus: updates are written to a (random) majority of the server-set (more than 50%). Before every read or write operation, the server that is in charge of performing the request contacts some other servers of the server-set and requests the object's version number (or modification time) to identify the current version. When a majority has answered, at least one of it has the current version. This is because in every two majorities there is at least one common member.
The advantage of this algorithm is its robustness: There is no single point of failure and it works even in the face of failure of almost 50% of the server-set. The downside of it is again scalability: The server-set for any object must be known to all members of the server-set, and more than 50% of the set has to be contacted before every write and even read operation. If the set contains, say, 1000 servers, we have to get a response from 501 of them!
This requirement may be relaxed for read operations if we are willing to accept weak consistency. Still, it is mandatory for write operations to ensure that no conflicting updates can occur.
Harvest heavily relies on replication to achieve good performance. The indexes created by the Gatherers are periodically replicated to the Brokers. Since the indexes tend to be large, this has to be done efficiently.
Harvest uses a technique called flooding for this purpose. Rather than having a Gatherer send its indexes to all Brokers, they are sent to only k of them (e.g., k = 2). It is then the responsibility of the k chosen nodes to distribute the indexes to another k each, and so on. While of course the total number of indexes that have to be transferred remains the same, flooding has the nice property of distributing the network and server load over the whole network.
The flood algorithm used by Harvest is called flood-d [7]. Flood-d tries to minimize the network cost and propagation time of the flood by computing a "cheap", k-connected logical update topology based on bandwidth measurements of the underlying physical network. An important requirement was that this topology should not need manual configuration (like for example the Network News [10]), but shall be computed and updated automatically. Finding a good approximation of the optimal topology is computationally expensive, however (finding the optimum is even NP-complete), especially when the replication group becomes very large. The paper [7-d] therefore suggests to use a hierarchical scheme of smaller replication groups. However, it is left open how this hierarchy can be found and updated automatically.
Figure 1 illustrates this situation. The hyperweb is partitioned by server boundaries (the servers are labeled A, B, and C in the figure). Links which span server boundaries are shown as thicker edges. We will call these links surface links, and documents connected to other servers by such links shall be called surface documents (the others are called core links and core documents, respectively). Although not apparent from Figure 1, a server's surface will typically be small compared to its core.
In order to keep the useful property of bidirectional links, the link information of surface links must be stored in both affected servers. For increased performance, the servers also keep replicas of the other surface document's meta-information. In Figure 1, server A stores document 1 plus a replica of document 2's meta-info and the link between them, while server B stores document 2 plus replicas of documents 1's and 3's meta-info and the links from 1 to 2 and from 2 to 3.
In this setup, documents on different servers are interconnected as tightly as the documents on a single server. The bidirectional links enable more advanced navigation techniques (the link map shown in Figure 1 can actually be computed and shown to the user), but also simplify maintenance of the hyperweb: when I choose to remove document 2, the system can inform me that this will affect document 1 on server A and document 3 on server C (among others on server B). I may either use this information to manually modify the affected documents and links, or let the system ensure automatically that at least the links from 1 to 2 and from 2 to 3 are removed as well.
The problem which remains is how to inform the other servers that document 2 has been removed. As already mentioned, an earlier implementation of Hyper-G used the knowledge about what documents are affected to directly engage the other servers in a multi-server transaction, in order to remove document 2 and all links to and from it. As was also discussed earlier, this approach has problems when many servers must participate in the transaction (because many links point to the document).
Therefore, we decided to adopt a weak consistency approach, whereby we accept that the hyperweb may be inconsistent for a certain period of time, but is guaranteed to converge to a consistent state eventually. Of course, we would like to keep the duration of the inconsistency as short as possible.
Like in the master/slave model, updates may only take place at a well-defined server. Unlike the master/slave model, this server is not the same for all operations, but depends on the document or link being modified (or removed or inserted): For documents, it is the server which holds the document; for links, it is the server which holds the document where the link emanates (in our example, server B would be responsible for updates of document 2, while the link from 1 to 2 would be updated by server A). This reduces the problem of overload of the master, while eliminating the problem of conflicting updates (they are handled one after the other). One disadvantage remains: the master server must be available at update time. However, since for security reasons users wishing to update document 2 must have write permission for document 2 (this is checked by server B which holds document 2), this fact is inevitable and we have to live it, anyway.
Updates of core documents or core links require no further action (integrity is maintained by the local link database). However, other servers need to be notified of updates happening at a server's surface (i.e. updates of surface documents or surface links). We chose to use a flood algorithm similar to the one employed by Harvest to propagate updates from the master to the slaves (i.e. all other servers), because of its scalability (the traffic generated does not depend on the number of references to the object in question), because it does not require that the recipients are available at update time, and because it can be used for other purposes as well (like distributing server addresses and statistics, and maintaining the consistency of replicas and caches).
The p-flood algorithm is a probabilistic algorithm which fulfills the above requirements. Figure 2 illustrates its behavior. The servers are arranged in a circle (for example by sorting them according to their Internet address; see section 6.1 for a discussion how this can be done in a better way). Every server knows all other servers (updates of the server list will of course be transported by the algorithm itself).
Servers accumulate update messages which are generated either by the server itself (as a result of modification of a surface document or surface link), or are received from other servers, in their update list. Once in a while (every few minutes) the update list is sent to p other servers (p >= 1). We will call this time period a step of the algorithm. For p=1, updates are sent only to the immediate successor, otherwise they are also sent to p-1 other servers that are chosen at random. If p is fractional, they are sent to other servers only with probability p-1. For example, p=1.3 means that one message is sent to the successor, and another one with probability .3 to a random server; p=3.2 means that it is sent to the successor, two other random servers plus one other random server with probability .2.
Figure 2 shows one step of the p-flood algorithm with p=1.5. Note that at every step the operations described above are performed by all servers in parallel, i.e. within the step time period every server performs one step (the clocks of the servers do not have to be synchronized). We may observe that at every step p*n update lists are sent (n being the number of servers). The higher the value of p, the shorter the time it takes to reach all servers, but the higher the amount of traffic generated (it happens that the same message is received more than once by some servers). The algorithm in principle allows the assignment of different values of p to individual messages, so we may call p the priority of the message.
After a message has successfully been transmitted to a server's immediate successor, it is removed from the sending server's update list and not sent again to any server in future steps. This ensures that messages are removed after they have been received by all servers and keeps the update lists relatively short. Messages are time-stamped using a per-server sequence number, so that duplicates can be discarded and messages can be processed in the correct order by the receiver.
What happens when a server is down or unreachable? Since a message must not be discarded from the update list until it has successfully been sent to the successor (we assume that a reliable transport protocol like TCP is used and that receipt is acknowledged), the message will effectively wait there until the successor comes up again. Almost immediately after that, the accumulated update messages will be sent. In a way, every server is responsible for delivering messages to its successor. The penalty is that when a server is down for a long period of time, its predecessor's update list grows.
Setting the priority p=1 (send messages only to the successor) will effectively block update messages in case of an unreachable server and is therefore not feasible. A higher values of p not only speeds up the propagation of the messages significantly, but also contributes to the robustness of the algorithm. In the example of Figure 2, a crash of server B would not inhibit update messages from server A being propagated to the other servers.
A few extensions of p-flood are necessary for real use. They
are described later in section 6, in order to not
burden the reader with additional complexity at this point.
Figure 3 gives us a feeling of how p-flood
performs. It is assumed that m update messages have been
generated at the n different servers before the simulation
starts, and we watch their propagation to the other servers, in
particular how long it takes until they arrive there. It turns out
that it does not matter whether all m updates are made on a
single server or whether they are distributed randomly over the
n servers, but the random placements gives smoother curves,
so I have chosen this method for producing the graphs.
The top graph shows how the update information is propagated to the
1000 servers, using different values of p. A higher value of
p gives faster propagation, e.g. at p=2 and
n=1000, 50% of the servers are reached after about 4 steps,
99% after 7 steps, and the last one is typically updated after 10-13
steps. The price for faster propagation is a higher load on the
servers and networks: The middle graph shows the average size of the
update list held at each server, and the bottom graph shows the
traffic in messages that is sent at each step.
Since every message has to be sent to every server at least once,
every algorithm that delivers every message to every server will need
to transmit at least m*n messages, so we will call this
number the optimum traffic. Under perfect-world conditions,
the total traffic sent by p-flood is p*n*m messages,
or p*optimum. The point is that the flood algorithm
distributes this traffic nicely over time and over the whole network,
as opposed to the trivial solution where every server simply sends all
its updates to all other servers (which requires only optimum
messages to be sent). The lower the value of p, the more
network-friendly the update.
Clearly, there is a tradeoff between fast propagation and peak network
load. Figure 3 suggests that a good setting of
p is somewhere between 1 and 2.
Figure 4 demonstrates the remarkable scalability of
p-flood with respect to the number of servers. The time to
reach 50% and 99% (The time to update 100% of the servers is of course
always an integer value and subject to a great deal of randomness. The
99% value can be interpolated and is less affected by randomness, so I
decided to use this value to get smoother curves in the graphs. The
100% value is about 3-6 steps higher) of the servers is plotted
against the number of servers. The logarithmic performance of
p-flood is clearly visible, meaning that that
p-flood is well-suited for use in the context of very large
server groups.
Figure 5 plots the propagation delay (again,
reaching 50% and 99% of the servers) versus the priority p
for a constant number of servers (n=1000).
First, let me make the distinction between soft or transient
errors and hard or persistent errors. Soft errors are temporary
network errors due to routing problems or network overload. This type
of errors occur rather frequently in the Internet, but usually only
for a short period of time. Hard errors last longer (e.g, as a result
of a hardware failure) but fortunately happen less frequently.
Figure 6 shows the effect of soft errors on the
propagation delay and the traffic generated. The propagation delay
(i.e. the time to reach 50%/99% of the servers) increases only slowly
with increased soft error rate. A soft error rate of 10% means that in
every step 10% of the update propagations will fail (10% of the
servers are unreachable), which are chosen randomly. In the next step,
another random set of 10% fail, but is unlikely that the two sets are
identical.
The bottom graph of Figure 6 shows how traffic
increases with increased soft error rate. The set of messages that is
sent increases, but the number of acknowledged messages (i.e. that are
sent successfully) remains constant. Since p-flood detects
duplicate messages and messages that arrive out of order itself, we
could in principle use an unreliable protocol like UDP to transmit
messages and acknowledgments. However, UDP is very much subject to
soft errors (e.g., packets dropped). Using TCP, the transport protocol
repairs a large number of soft errors itself. When a server is
temporarily unreachable, this will usually already been detected
during the opening of the connection, and the messages will never be
sent in this case. This means that for TCP the "messages sent" graph
is not significant, i.e. the number of messages actually sent is
constant with respect to soft errors.
Hard errors are usually described by the two variables Mean Time
Before Failure (MTBF) and Mean Time To Repair (MTTR)
(Figure 7). Then uptime, i.e. the
fraction of time the server is up, is defined as
In our simulations, we will measure MTBF and MTTR in units of steps.
For MTTR = 1 we have soft (transient) errors, larger values
of MTTR mean that a same server is down for a longer period of
time. It is expected that server uptime will be well over 90%
(this is already a bad value; it means that a server will be
unreachable for 2.4 hours per day). In the beginning,
MTTR/(MTBF+MTTR) servers are marked as down, with the time
they remain down chosen randomly between 0 and MTTR. The others are
assigned a time they remain up between 0 and MTBF. It is assumed that
the servers which are down also carry update messages (they could have
been accumulated before they went down; the servers could be only
unreachable from the outside but running). During the simulation,
servers that come up remain up for MTBF steps, those that go down
remain down for MTTR steps.
In Figure 8, uptime is constantly 90%, and
MTTR varies between 1 and 100. The top graph shows the effect on the
propagation delay. Because of the probabilistic nature of
p-flood there is almost no effect on the remaining servers,
which is why the time to reach 50% of the servers remains almost
constant. Of course, there is an impact on the time to reach 99% of
the servers (because only 90% are available), which grows linearly
with MTTR. The number of messages sent also grows linearly. This time
the number of messages which are sent and acknowledged (i.e. the ones
that would be sent when we use TCP) also increases, but only slowly.
Then, the total number of messages sent per day is
p*optimum, with optimum = n*m = 1E7 messages (every
message shall be delivered to every server). A message is a few bytes
long (say, 10). At p=1.5, we would generate network-wide
traffic of 1.5E8 bytes (150 MB) per day, or 4.5 GB per month.
On the other hand, the NSFnet transmitted about 22,462 GB throughtout
November 1994 [14](Unfortunately, correct newer
values are not available). If we assume that 25% of the whole
(non-local) Internet traffic pass the NSFnet (i.e. the whole traffic
is about 90,000 GB/month), this means that the update messages of our
information system would cause an additional 0.005 % of network
traffic; in other words, the effect is negligible.
Another nice property of p-flood is that the traffic is
distributed evenly over the whole set of servers, which means that the
amount of traffic seen by a single server does not increase when the
number of servers is increased (As has been shown before, the total
Internet traffic generated does depend on the number of
servers). It only depends on the number of updates of surface
objects. Of course, it may is possible to operate servers in an
isolated fashion, in order to avoid being molested by any traffic at
all.
Since every update message has to be propagated to every server, the
nature of p-flood can be compared to the USENET news service.
However, the messages are much smaller than news articles. The numbers
given are for perfect-world performance. Consult Figures
6 and 8 to see the effect of soft and hard errors on
network traffic.
5 Simulation Results
This section presents data gathered by running extensive simulations
of p-flood. We will first concentrate on "perfect world"
simulations (i.e. all servers are reachable) and then look at the
effect of network and server faults.
5.1 The Behavior of p-flood in the Perfect World
In this first set of experiments we want to find out how weak exactly
our weak consistency is, i.e. how long it takes to arrive at a
consistent state after updates have stopped, how this time depends on
the number of servers and the priority factor p, and how much
traffic is generated over time.
5.2 Network and Server Failure
Since one of the major requirements for our flood algorithm was its
robustness with respect to network and server failures, we will now
take a close look to this issue.
5.3 Traffic Estimates
Let us now try to estimate the (additional) amount of network traffic
that would be caused by applying the described architecture to a
distributed information system. In order to do so, we assume the
following values of variables:
6 Extensions of p-flood
The description of p-flood in section 4
was a bit simplistic for the purpose of understanding the simualtion
results. However, the following details have to be addressed when
actually implementing p-flood:
6.1 Arranging the Servers
A potential weakness of p-flood is its random usage of
logical Internet connections, without knowledge of the underlying
physical network. There is no preference of fast links over slow ones,
as in flood-d. On the other hand, random selection of flood
paths propagates the updates faster than the cost-based selection [7], which tends to use the same links again and again.
However, p-flood chooses its propagation paths in both non-probabilistic (the immediate successor) and probabilistic (among the other servers) ways. The amount of randomness is controlled by the p parameter. Since for reasonable values of p (see the simulations in section 5) most traffic runs over the static circle of servers (see Figure 2), clever arrangement of servers in this circle can vastly reduce network cost and delay, without giving away the advantages of fast propagation and robustness by random choice of some of the flood paths.
Computing the optimal circle using actual bandwidth measurements would be difficult, since it would require gathering a fully connected matrix of bandwidth between servers. Furthermore, the measurements would have to be repeated quite frequently, because global network utilization changes with the time of the day. Hand-configuring is not considered an option. Therefore, we choose a more pragmatic, heuristic approach:
Servers are sorted according to their reversed fully-qualified domain name. Server i then is the successor of server i-1, and the first server is the successor of the last one. Sorting by reverse domain name (i.e. by last character first) results in all servers in for example Belgium (domain .be) being neighbors in the circle, followed by the servers in Germany (domain .de) and so forth. Within Germany, the servers located in, e.g., Berlin will be neighbors (domain -berlin.de). Since in most cases local connectivity is cheaper and faster than international connections, this simple trick will result in better use of the available bandwidth (Unfortunately, host names in the US domains (.edu, .com, .gov, .mil, .net) in general do not give any hints on the host's geographical location, with the exception of the new .us domain). No computations (other than sorting) and measurements are necessary.
During propagation of server list updates (addition and removal of servers, moving a server to a different host) it is important that a server uses its old server list for flooding, until the message has been acknowledged by the successor. Simple modifications of server attributes (e.g., description, Internet address, e-mail of administrator) do not require such precautions.
6.3 Recovery after Catastrophic Events
When operating a large number of servers it may and will happen (not
too often, hopefully) that catastrophic events occur which result in
loss of information (for example a head crash on the server's disk).
In such cases, operation needs to be resumed from a backup copy of the
information base. If the backup copy is i days old, then the
restarted server has lost all its updates of the last i
days. This is inevitable, though.
However, other servers may also have a now obsolete picture of our server's surface. For example, somebody may have recently (less than i days ago) created a new document in our server, with a link pointing to (or from) another document on another server. The document has now disappeared and of course this link also has to be destroyed in order to keep a consistent state. In other words, the other servers also have to roll back to the situation i days ago. In such a situation, our server may flood a special message that contains its whole surface (documents and links), thus requesting all other servers to check this picture against their view of our server, and adjust their information about our server accordingly.
Server B can detect this inconsistency when the update message from server A eventually arrives, since it requests creation of a link to a non-existing object. It may now flood a "document removed" message for this non-existing object, as if it had been on the surface.
Alternatively, we may choose to live with such (rare) inconsistencies for a while, and have all servers periodically flood their whole surface, like after a catastrophic event (section 6.3). This would serve as a fall-back mechanism that deals with all kinds of inconsistencies and errors, including unforeseeable hardware and software errors in the update server. Since these messages may be rather long, they should be sent infrequently and with low priority. The exact time and priority will have to be determined when we have a feeling of how often such problems occur.
The architecture described and p-flood is now being implemented in the Hyper-G system, and integration into next-generation Web servers is planned.
Dr. Frank Kappe is the director of the Hypermedia Group at the Institute for Information Processing and Computer Supported New Media (IICM) at Graz University of Technology, Austria. As such, he is coordinating the software development of the Hyper-G project. He is also affiliated with the Institute for Hypermedia Systems of JOANNEUM RESEARCH (an Austrian non-profit research organization).
His reserach interests include distributed information systems, networking, hypermedia and multimedia applications.
More info can be found at http://info.iicm.tu-graz.ac.at/CTU-5060.Pers.fkappe.