The net effect of this is that a certain percentage of references are invalid. We may expect that this percentage will rise as time goes by since more and more documents become outdated and are eventually removed, services are shut down or moved to different servers, URLs get re-used, etc. Obviously, it would be desirable to have some support for automatically removing such dangling references to a resource that is deleted, or at least to inform the maintainers of those resources.
This paper presents a scalable architecture for automatic maintenance of referential integrity in large (thousands of servers) distributed information systems. A central feature of the proposed architecture is the p-flood algorithm, which is a scalable, robust, prioritizable, probabilistic server-server protocol for efficient distribution of update information to a large collection of servers.
Extensive simulations of p-flood show logarithmic response of propagation delay to the number of servers, almost no effect of spurious network errors on propagation delay and traffic, and remarkable robustness with respect to persistent server faults.
The total additional traffic generated by p-flood is negligible compared to current Internet traffic (for example, the messages generated to distribute 10,000 modifications of surface objects per day to 1,000 servers account for about 0.005% of current Internet traffic (estimated from NSFnet statistics)). Another nice property of flooding is that the traffic is evenly split among all servers, in other words the traffic per server does not depend on the number of servers (only on the number of updates).
The p-flood algorithm is now being implemented in the HyperG system, but may in principle also be implemented as an add-on for existing WWW and Gopher servers.