Creating a Scalable Architecture for Internet Measurement
Andrew ADAMS <firstname.lastname@example.org>
Vern PAXSON <email@example.com>
Historically, the Internet has been woefully undermeasured and underinstrumented. The problem is only getting worse with the network's ever-increasing size. In this paper we describe an architecture for facilitating a "measurement infrastructure" for the Internet, in which a collection of measurement "probes" cooperatively measures the properties of Internet paths and clouds by exchanging test traffic among themselves. The key emphasis of the architecture, which forms the underpinnings of the National Internet Measurement Infrastructure (NIMI) project, is on tackling problems related to scale. Consequently, the architecture emphasizes decentralized control of measurements; strong authentication and security; mechanisms for both that maintain tight administrative control over who can perform what measurements using which probes; delegation of some forms of measurement as a site's measurement policy permits; and simple configuration and maintenance of probes. While the architecture is general in the sense of not being tied to any particular measurement tools, it also currently supports "TReno", "poip," and "traceroute" measurements.
Historically, the Internet has been woefully under-measured and under-instrumented. The problem is only getting worse with the network's ever-increasing size. The Internet's rapid growth and change have led to significant concerns for its continuing health and stability. In addition, new, more demanding applications are pushing the capabilities of the current Internet, while at the same time increased load and demand have resulted in an apparent degradation of the overall quality of the Internet. Consequently, the need to monitor the health of the global Internet has now become crucial.
The NIMI project (National Internet Measurement Infrastructure) is being undertaken to facilitate the development of a large-scale "measurement infrastructure" for the Internet. One of the principal roles for such an infrastructure is to measure "vital signs" of the network, such as throughput, latency, and packet loss rates.
The Internet desperately needs mechanisms to quickly identify and diagnose faults in network performance. There is a wealth of different types of problems that can affect network performance. For example, routing changes can increase delay or decrease available bandwidth; route flapping can cause packet reordering; and overloaded links or deficient router queue implementations can cause packet loss due to congestion. The NIMI project attempts to identify and diagnose faults through a series of active performance measurements [Pa96a] -- ongoing measurements between NIMI probes strategically positioned throughout the network.
A number of researchers are working on developing "beacon" or "probe" platforms that attempt to measure such vital signs for small groups of sites -- IPMA [La97]; Surveyor [A97]; and Felix [Hu97]. The key difference between these and the NIMI project is that the foremost goal of NIMI is to devise an infrastructure capable of instrumenting a very large network.
NIMI uses multiple synchronized NIMI daemons (nimids) as end-points for a set of measurement tools. An IP packet generator collects data along an Internet path between two nimids. The benefits of monitoring the global network are obvious; moreover, a focus on end-to-end measurements means sites using NIMI probes can "monitor" entire paths through the Internet, not just those portions of the infrastructure they directly control.
The focus of the NIMI project is to develop techniques for performing Internet measurement that are capable of scaling to full-Internet size. In order to reach global scale, we need to work in many dimensions. First, the measurements performed must be "high yield" -- they must extract a large amount of information from a very small network test. Second, tests must be coordinated to ensure low redundancy -- heavily aggregated links must be measured much less often than once for each path traversing such links. Finally, techniques for hierarchical management of the tests must be created which still ensure the overall security and safety of the resulting system. It is particularly important that the resulting system not have a single "root node" or central authority for managing the measurement. Like everything else in the Internet, the measurement controls and data must be automatically and dynamically distributed to ensure robustness in the face of both technical failures and political realities.
The IP Performance Metrics (IPPM) working group of the Internet Engineering Task Force (IETF) is in the process of defining a number of metrics that can be used to quantify performance along an Internet path. We expect that a key component of the work done by NIMI probes will be measurements of these standardized metrics for Internet performance. Among the metrics being standardized are Connectivity [MP97], Packet Loss [AK97b], Delay [AK97a], and Bulk Throughput [Ma97]. We are currently running implementations of each of these metrics on the NIMI probes. Once in place, NIMI probes will be able to measure baseline results for these metrics over various paths through the Internet.
Because Internet performance problems can be very specific to a particular path and because it can be very difficult to infer from single-link measurements the total performance that will be obtained from the end-to-end concatenation of many such links, it is very desirable to run end-to-end measurements (as opposed to measurements localized to a specific provider). End-to-end measurements also have the advantage of not requiring cooperation from all providers along a path. However, for fault diagnosis, additional NIMI probes positioned along the path will allow problems to be readily localized. In addition, widespread deployment of NIMI probes will allow for performance measurements of entire clouds. Cloud performance metrics are not yet well understood, but some important concepts are included in the IPPM framework draft [PAMM98].
Section 2 of this paper describes in detail the components of the NIMI architecture. Section 3 presents some early measurement data among a small testbed of four NIMI probes. Following this, we discuss some lessons learned and future plans for NIMI development.
The NIMI architecture is patterned after Paxson's Network Probe Daemon (NPD), which was used to perform the first large-scale measurements of end-to-end Internet behavior [Pa97, Pa96b]. The NPD experiment included 37 participating sites and measured more than 1,000 distinct Internet paths. The probe daemons listened for measurement requests using a custom NPD protocol. The requests would first be authenticated, an important step to prevent misuse of the NPD, using a challenge/response based on a shared private key and an MD5 cryptographic checksum. Once authenticated, the request would be served and a reply returned.
Based on the experiences from the NPD project, the NIMI probe was designed with scalability in mind. First, the code should be lightweight and portable, so that NIMI can be run on different platforms and eventually migrated directly to hardware. Second, the probe must be dynamic in the sense that the addition of measurement suites or tuning of existing suites can be incorporated seamlessly. Furthermore, the probe must be able to self-configure and function with a minimal amount of human intervention. And finally, the NIMI probe must be secure such that privacy can be maintained in a sensitive area or measurement.
To satisfy the first two conditions, the NIMI probe was segmented into three main sections: a communication daemon; a job scheduler; and a set of programs necessary to run the measurement suites. The communication and scheduling tasks are combined within the NIMI daemon nimid, while the measurement suites are external modules. This allows new measurement suites to be added without modification to nimid.
An important aspect of the architecture is that it completely separates the tasks of making measurements vs. requesting measurements vs. analyzing results vs. configuring probes. So far, we have only discussed the first of these, making measurements, and this is the sole role of NIMI probes -- they are "lean and mean" measurement devices. The task of requesting that a particular probe perform a particular measurement is done completely separately, as is the later task of analyzing the results of any measurement and the ongoing task of configuring and updating probes. All of these functions can be performed by separate computers (though the NIMI software package can be configured to support various subsets on a single computer).
In keeping with this separation of tasks, in addition to the measurement daemon, the architecture includes two management entities. The first, the Configuration Point of Contact (CPOC), allows a site to configure a collection of NIMI daemons in a single place. The second, the Measurement Point of Contact (MPOC), allows a coordinated set of measurements to be configured at a single location. Communications between the nimid and management entities (MPOC and CPOC) are shown in Figure 1. It is important to separate the MPOC and CPOC functions in order to allow a site to delegate partial control of its NIMI daemons to measurements being coordinated by one or more external MPOCs, while still maintaining ultimate control locally.
In order for a NIMI probe to function either as an independent entity obtaining one-way measurement information or as a part of an extensive mesh involved in generating point-to-point performance, the NIMI probe must first be given a set of policy rules and a list of measurement tools. To accomplish this, nimids communicate with each other by exchanging messages using TCP. A message consists of a "command" followed by any data necessary for the nimid to fulfill that command.
The nimid protocol includes a compact message set, shown in the table above. The BOOT_ME, ADD_ACL and DEL_ACL messages are used in communications between a nimid and the CPOC to configure the control policy for the nimid. The HELLO, SCHEDULE_TEST, DEL_TEST, and TEST_COMPLETE messages are used in communications between the nimid and the MPOC in scheduling and controlling measurements instances. The GET_DATA message is used by the MPOC or other agents authorized to retrieve the results of specific measurement suites. Finally, the UPGRADE_SUITE and FORCE_UPGRADE messages are used between the CPOC and nimid to request a particular version of a measurement suite software from the CPOC, or the CPOC can force a nimid to download a new version.
As mentioned earlier, the rules in the form of messages are passed to the NIMI probe by its Configuration Point of Contact (CPOC). A CPOC acts as the distributor of administrative guidelines for any NIMI probes that it controls (has entries for). These messages, which become entries in an Access Control List (ACL) table, consist of the public key identifier of that NIMI probe (KEY) and the particular measurement tool(s) or suite that the remote NIMI probe may request to run (SUITE). The message, now a tuple within the ACL table, authorizes the NIMI probe to communicate with a remote host concerning requests to participate in the task or measurement suite, providing that remote host also sends a signature for the message that matches the public key.
Note that the authentication architecture is entirely based upon the possession of private keys corresponding to public/private key pairs. That is, anyone possessing a private key gains immediate access, via the ACL table, to all of the measurement activities associated with the key pair. Thus, administrative control of a site's measurement policy is entirely managed by determining to whom the site gives copies of the corresponding private keys.
The MPOC coordinates measurement suite tasks by sending messages. These messages include the measurement suite (SUITE), timing arguments (TIME), and arguments to be passed to the measurement suite scripts (ARGS). The name of the SUITE listed in the message corresponds to a script or executable that the NIMI probe can run. Upon completion of the task, the nimid will send a message containing the SUITE, TIME, and location of the data produced by the task (LOCATION) to the MPOC. The MPOC can then send a message to the participating nimid requesting the "data" generated from the measurement suite. Although not implemented yet, it is not unreasonable to conceive that either the MPOC may not receive the data if the CPOC deems that data to be sensitive or that multiple hosts other than the MPOC may be able to request the data. Multiple hosts can obtain the data if they have a copy of a private key that the CPOC previously instructed the NIMI probe to recognize as granting access to that particular type of measurement result (the CPOC does this by sending a corresponding ACL table entry).
To ensure that the MPOC can request a NIMI probe to participate in any measurement suite, the owner of the MPOC must manually contact the owner of the CPOC for the participating NIMI probe and request that an ACL entry be added for each different measurement suite that the MPOC will request of a participating NIMI probe. After the manual addition of the ACL entries on the CPOC, the CPOC would then be told (via a HUP) to contact the NIMI probe to update its ACL table. At this point (and only at this point), manual configuration is a feature and not a bug: it is precisely the fact that human intervention is required to alter the measurement policy used by a site's set of NIMI probes that provides a solid, human-based mechanism for ensuring that policy changes are always scrutinized (but, once accepted, they are automatically propagated).
A major component of the NIMI architecture is the expectation that a single NIMI probe will be used by multiple groups running distinct measurement suites. In order to support this, nimid must include a secure method for delegating authority for running certain measurements to other organizations. Three components make up the security infrastructure in NIMI. First, an ACL table specifies who is allowed to request each type of measurement. Second, secure authentication is used to compare measurement requests to entries in the ACL table. Finally, encryption is used to guard against sniffing of the NIMI protocol and of the possibly sensitive NIMI measurement results.
Host identification is accomplished through the use of public key/private key technology. Each NIMI probe randomly generates a unique private key. Similarly, private keys are used as part of the credentials used by MPOCs to authenticate measurement requests. NIMI probes use these credentials to check against ACL table entries for the particular measurement suite requested to determine if a measurement request is authorized.
The ACL table consists of one entry for each measurement authorized to a given key identifier. ACL entries can allow full authority to schedule a given measurement or limited authority to limit the rate and/or scope for a given type of measurement. A private key may correspond to a single MPOC or (in a more complicated measurement scenario) may be shared among several cooperating MPOCs.
With public/private key pairs in place, all communications are encrypted as well. This ensures that it is impossible for bystanders to monitor the NIMI protocol (for example, to know when tests might be scheduled) or capture results of tests when they are being retrieved for analysis.
NIMI authentication and encryption are implemented using the RSA reference library. Currently, the distribution of the public keys is done manually. Upon startup, a nimid sends a "BOOT_ME" message to its Configuration Point of Contact and in return receives a list of credentials that it uses to build its ACL table. Each credential contains the identifier for the public key of an MPOC and a measurement suite that the nimid has been authorized (by the CPOC) to participate with. Each time a nimid receives a message, it checks its ACL table to verify that the request is authorized prior to carrying out the instructions in the message.
The second part of nimid, the scheduler, is responsible for acting upon information which the communication daemon receives from an MPOC. The scheduler queues each measurement task for execution at a specific time. At the appointed time, the scheduler executes the measurement task. Upon completion of the measurement task, the scheduler logs the event and will instruct the communication daemon to notify the MPOC of the task's completion.
The tools necessary for implementing the requested measurements are external to the nimid and need only exist on the proper location on the file system for nimid to use them. This enables the NIMI probe to use the most advanced versions of the networking measurement tools available without requiring a software upgrade to the daemon. To add any measurement tool simply requires the addition of the scripts/binaries on the participating NIMI probe and the addition of the measurement suite to the "data" field in any messages the MPOC and CPOC send to the participating NIMI probe.
To illustrate this, a MPOC sends the following message:
SCHEDULE_TEST http://nimi.psc.edu/ runtreno 98:02:14:06:00:00 nimi.fnal.gov
The communication daemon in nimid receives the message and then processes it as follows: it checks the header for the encryption type and public key needed for decryption; decrypts the message; reads, compares, and authenticates the COMMAND (SCHEDULE_TEST) and SUITE (runtreno) elements with the values in its ACL table; logs the message to disk; and informs the scheduler of the new task via an IPC call. After receiving the IPC message, the scheduler daemon reads the data from the communication daemon, converts the GMT to epoch time, and stores the task to memory sorted by epoch time. When the epoch time for the task is reached, the scheduler parses the message for SUITE, TIME, and ARGS (the third, fourth, and fifth elements) and constructs a command line consisting of (for this example)
runtreno 98:02:14:06:00:00 nimi.fnal.gov
It then forks and execs the command line. The script "runtreno" is responsible for executing the appropriate measurement tool and writing the data to a predetermined area. After checking the return value of the child (the forked measurement tool), the scheduler logs the event and informs the communication daemon. The communication daemon notifies the MPOC that the task has been completed and gives it the location of the data. The MPOC can then send a message to request the data from the NIMI probe or even to inform an operator (human) that the data is ready, in which case the operator could fetch the data. Future versions of the NIMI probe will have data dispersal and acquisition built in.
Currently, the NIMI probe uses Traceroute [Ja89], TReno [MM96] and Poip; however, the plug-in design of nimid allows any tool to be included as part of a measurement suite. Other groups are using different tools for measurements of bulk transfer, connectivity, delay, and packet loss -- all of which could be used by nimid. Some other tools that could be used by nimid include path bandwidth characteristics (à la pathchar [Ja97] or bprobe/cprobe [CC96]) and multicast performance metrics.
TReno (Traceroute Reno) [MM96] is a tool designed to measure bulk transfer throughput. The IETF IPPM working group is defining a metric for bulk throughput based on the TReno metric [Ma97]. TReno operates by sending a stream of UDP packets, to which the destination returns either ICMP TTL expired or ICMP port unreachable messages.
Poip (Poisson Ping) is a tool designed to measure one-way delay and packet loss characteristics of a path. It includes a considerable number of measurement integrity checks and is written using a generic "wire time" library that we hope will prove useful for other measurement tools that need to obtain accurate network timing information via a generic packet filter interface. The IETF IPPM working group is drafting a suite of metrics for connectivity [MP97], delay [AK97a], and packet loss [AK97b]. Poip is capable of measuring all of these metrics.
The first step in implementing a NIMI probe, to be used in either private or public performance measurements, is to acquire a platform that can support the software. Currently, we are using inexpensive Pentiums with a moderately-sized (2GB) hard disk drive and a network interface card to allow access to the Internet. The only stipulations are that the peripherals must be supported under a BSD Unix O/S. A "sample configuration" can be found at http://www.ncne.net/pub/nimi/pc-specs.html.
Next, the software package (ftp://ftp.ncne.net/pub/tools/nimi.tgz) must be installed. The "INSTALL" document covers the few steps necessary to implement a NIMI probe on the Internet. After compiling the sources, a private and public key must be generated using tools supplied in the NIMI distribution. The CPOC must be chosen, and public keys for the CPOC and nimid must be exchanged. A short configuration file which specifies the name of the nimid (actually, the identifier for its public key) and the identifier and IP address for the CPOC must be created. The NIMI probe can then be started.
The administrator of the CPOC must prepare an ACL table for the new nimid which delegates authority for running specific measurements to one or more MPOCs. Similarly, the administrators for the MPOCs must prepare suitable measurement suites for the new nimid. In the future, we expect to provide GUI tools to make these management functions easier.
Upon startup, nimid will contact the CPOC and request its ACL entries. Once the nimid starts to build its ACL table, it contacts the MPOC for any tasks (measurement suites) that match those permitted by its ACL table. If the contacted MPOC has tasks for the NIMI probe, the MPOC will then send the participating NIMI probe task messages, which will be scheduled and subsequently executed by the NIMI probe.
The nimid program, in its current form, includes all three components: the NIMI probe, MPOC, and CPOC. Based on command line options and data in its file system, it can be run in any (or all) of these capacities. Future releases of the NIMI package will include separate CPOC and MPOC programs which are more fully functioned and a leaner, meaner nimid which can be more widely deployed.
In our initial testing, NIMI probes have been implemented on four Unix platforms at LBL, Fermilab, and PSC. The following table shows connectivity of the four probes.
The path between Fermilab and LBNL uses only ESnet. The ESnet sites choose one of several connections between ESnet and the vBNS to reach psc1. There is a high-speed local path between psc1 and psc2. Finally, psc2 uses a commodity Internet connection to MCInet to reach the ESnet sites.
Measurements between the four sites included traceroutes run in a full mesh, TReno runs from each probe to all other probes at random times during each hour, and a full mesh of continuous poip tests sending packets at random intervals with an average spacing of five seconds between packets.
We now turn to phenomenological analysis of some of the measurements made to date. We emphasize that this analysis is superficial. We present it here to convey the flavor of the possibilities of more detailed analysis.
The traceroute tests show that the routes over the high performance networks (the vBNS and ESnet) appear quite stable, though from [Pa96b] we know that to definitively establish their stability requires finer-grained measurements than per-hour. Also, traceroutes to and from the probe psc2 show some variation in routes through the provider's border routers to the commodity Internet.
The TReno and poip tests turned up several interesting performance anomalies on the paths tested. As might be expected, several of the paths showed standard diurnal variations in bandwidth availability. The higher speed paths generally had low packet loss and high bandwidth availability. The commodity paths, shown in Figure 3, had low bandwidth availability during daylight/evening hours and higher bandwidth available in the late night/early morning hours. (In Figure 3, the ticks on the X axis are at 00:00 GMT on that particular day. The bandwidth drops drastically at roughly 11:00 AM EST and ramps back up at about midnight EST). We have not yet developed an explanation for the striking symmetry of the available bandwidth along the two directions of the FNAL/PSC2 path.
The high performance backbones do significantly better. Figure 4 shows the TReno results between ESnet and LBNL. Here, we can see some of the power of active probing techniques. It appears that there was a two week period where performance problems on ESnet resulted in a significant drop in available bandwidth; or, perhaps, the network was configured differently during that time, such that its raw capacity changed (this more naturally explains why the change was observed bidirectionally). On February 18, however, the problem was cleared and the bandwidth returned. This graph also demonstrates a limitation of end-to-end techniques: it is impossible to determine from this graph whether the problem lay within the ESnet backbone or was local to the campus network of LBNL or FNAL (but see below).
Figure 5 shows the TReno results from psc1 to the ESnet sites. Here, we see good performance between psc1 and fnal, but lower performance between psc1 and bip. We also see a significant improvement between psc1 and bip on February 18, as above. The performance between psc1 and fnal also drops noticeably at the same point in time, which would fit with some sort of reallocation of capacity within ESnet. In particular, with this additional data, we can conclude that the problem between FNAL and LBNL in Figure 4 did not lie within FNAL. This is still somewhat unsatisfying; certainly the addition of a few more sites on ESnet and the vBNS would greatly aid in drawing stronger conclusions about network performance within ESnet and the vBNS.
The poip data provides detailed information on packet loss and other aspects of the network. This data was analyzed using natalie, a tool to pair up data from poip senders and poip receivers and to report interesting events. Several paths saw brief changes in TTL of packets received, reflecting short-lived routing changes. One path saw a long-lasting reduction of TTL by one hop, reflecting a longer-lived change.
As might be suspected from the performance problems seen in the TReno data, the poip data showed significant packet loss. We aggregated the raw poip data to generate packet loss data across a portion of the time interval of our experiments. Since TReno also reports packet loss percentages, we included both TReno and poip results on the same plot. Figure 6 shows the packet loss between psc2 and bip as reported by both TReno and poip for a short portion of the testing interval. There is quite good correlation between the packet loss data reported by TReno and poip. On the one hand, this gives us confidence in the measurement tools. On the other hand, [Pa97] found that loss rates experienced by packets sent at rates that adapt to the presence of congestion (as does TReno) have a considerably different distribution than those sent at nonadaptive rates (as does poip). Consequently, the close correlation merits further investigation. (The times of high packet loss also correlate well with the times of low performance as reported by TReno, as we would expect.)
Beyond the packet loss information, poip contains a wealth of additional data which we are developing tools to analyze and report. Eventually, this will include detailed information on delays, outages, and bottleneck bandwidth characteristics of Internet paths.
The NIMI project is proceeding in several stages. The first stage, Mark I NIMI, is the prototype version used to develop the communications protocol and architecture of the NIMI system. The Mark I NIMI is mostly written in "Perl." This version is now essentially complete and has accomplished its goal of prototyping the system and, in particular, the security and messaging protocols for NIMI. The experience of developing Mark I NIMI has also shown us the road to many improvements on the system, which we plan to include in Mark II NIMI.
The major change from NIMI Mark I to NIMI Mark II will be to recode all of the Perl components into C. We expect this to result in major performance improvements for the communications. The Mark II version of NIMI will include a full implementation of the NIMI messaging protocol, as well as additional security options to address the tedious US encryption export requirements.
From a management and operational standpoint, several improvements are clearly needed. Manual distribution of measurement modules is time consuming and easy to get wrong. Mark II NIMI will include the UPGRADE facility which will allow for quick distribution of improved measurement modules to running NIMI probes. Improvements will also be made in the area of error messages and logging, enabling error messages to be accumulated (as appropriate) by the MPOC and CPOC platforms, rather than being stored locally to each NIMI probe. Finally, several features for wildcarding and aggregating control information will be added which will both simplify operation of the CPOC and MPOC platforms and improve the efficiency of the NIMI messaging protocol.
The next major improvement, to be delivered with a later version of Mark II NIMI, is the addition of a simplified interface to schedule measurement suites. Under Mark I NIMI, the messages that schedule measurement suites are manually entered on the MPOC. For large suites of measurements, we have written custom programs to generate the schedule of tests using brute force, but this clearly does not scale since we need mechanisms for concisely representing measurement durations and intervals. An ideal tool for this would be a GUI client that would allow you to choose the measurement suite, arguments, and timing information (i.e., start intervals) and then automatically pass this information to the MPOC. We hope to progress on Mark II NIMI fairly quickly, since most of the work is relatively well understood and straightforward at this point.
Mark III NIMI will pioneer more research-oriented developments for the system. To this point, all measurement suites are human controlled. Choices of which measurements to run and on what schedule to run them are determined by the MPOC operators. In order to scale to really large sets of NIMI probes, MPOCs must be able to learn topology information and choose collections of end-to-end measurements which make sense together for a given set of NIMI probes. NIMI probes must be able to aggregate measurement requests from multiple MPOCs in order to gain economies of scale in this dimension as well. Data analysis engines will also need to have real smarts; they must be able to find and aggregate the data from a collection of NIMI platforms in order to answer specific questions about the health of the network. Mark III NIMI will pioneer these areas of further automation and attempt to buy several orders of magnitude more scaling of the NIMI measurement platform.
This work has been supported through funding by National Science Foundation Grant No. NCR-9711091 and by the Director, Office of Energy Research, Office of Computational and Technology Research, Mathematical, Information, and Computational Sciences Division of the United States Department of Energy under Contract No. DE-AC03-76SF00098.
The authors are grateful to David Martin for providing the NIMI probe at Fermilab and to Craig Leres for assistance in configuration of the NIMI probe at LBL. Valuable input for the NIMI project has come from many sources, including David Martin, Les Cottrell, Guy Almes, Van Jacobson, and Craig Labovitz.