Abhaya S. INDURUWA <firstname.lastname@example.org>
Peter F. LININGTON <email@example.com>
John B. SLATER <firstname.lastname@example.org>
University of Kent at Canterbury
JANET, the U.K. academic and research network, has, over almost three decades, grown from a small research network to a major national resource with a vastly expanded user base that now expects an appropriate level of service. Performance targets for the network are set in negotiation between the supplier and representatives of the funding bodies; they are documented in a set of service level agreements. Quality of service measurements are needed to monitor adherence to these agreements, to predict future behavior, and to assist in the formulation of strategy. This paper describes the measurement framework and gives examples of its use to resolve problems and guide policy.
The Joint Academic NETwork (JANET ) in the United Kingdom, developed over nearly three decades from colored book to Internet Protocol (IP), from primarily leading edge to primarily production, and from primarily expert and research driven to mass, multiservice use in an expanded, mainly university based, community. The organization tasked with managing the delivery of the network was originally part of the community but a number of pressures in staffing and technical expertise led to the formation in 1993 of a separate company, the U. K. Educational and Research Network Association (UKERNA ). Management of the interface with the client sites and government as funding bodies was left in the control of the Joint Information Systems Committee (JISC ), a body with representatives from the four constituent national parts of the U.K. HE sector and the research bodies.
In late 1993, the funding body advertised for an organization to develop service level agreements (SLAs)  for the service to be provided by UKERNA. These would be renegotiated and reviewed on an annual basis. They would be regularly monitored and audited from time to time on a sampling basis. Most monitoring information was to be provided by UKERNA but there should be some independent monitoring and the ability to form an independent view on network performance. A further requirement was the ability to give technical advice to the funding bodies to enable efficient development of the network, marrying funding to technology. JANET was to continue to develop some innovative services and so the successful bidder was required to have appropriate up-to-date strategic knowledge. The University of Kent was awarded the contract and established the Technical Advisory Unit (TAU).
The first task was to devise the SLAs and reporting data and formats by the start of the contract on 1 April 1994. Thereafter each year has seen an expansion of the network in number of nodes and capacity and an improvement in service levels. A value for money exercise took place in 1996/7. SLAs have been substantially extended, especially in the areas of customer support and liaison. This has been essential because the number of sites has grown to include many without significant internal technical expertise. The current user population is more thstn a million and the number of primary sites is 220.
With ever-increasing use of the services without a corresponding funding stream, there has been a need to establish priorities and deploy alternative funding mechanisms including charging. Quality-of-service issues have played an increasing role in the evaluation of options. It has been essential to assure government that public funds are being deployed against a clear need, backed by sound and reliable trend data. The importance of adequate performance information for the Internet has been highlighted by many groups, including network providers, users, and funding bodies  .
Accordingly, a simple but powerful measurement system has been developed. Trends have been analyzed over long time periods. Nevertheless, the system can also be used to help identify and isolate immediate problems, to counteract inaccurate rumor, and to address some "what if?" queries. This paper describes the system and gives a number of examples of its use. These measured results are backed up by periodic user surveys to check the user perception of service quality .
The monitoring activity consists of a series of independent experiments, operated from a set of measurement machines distributed around the United Kingdom. There are six of these machines, which are all Unix workstations and which run independently, but are coordinated by a harness that manages the experiments. The machines are placed to give good geographical coverage, but are on sites that are sufficiently well connected to the core of the network to avoid distortion of the measurements by local bottlenecks.
The machines are all managed from the University of Kent, and both system updates and changes to the experimental program and schedule are automatically distributed from the control site using the standard rdist utility.
The data collection functions are more complex. If there are network problems, the remote monitoring machines may be cut off from the controlling site for protracted periods. The harness that runs the experiments is responsible for collecting their results and combining them into a local staging area. The results for the full set of experiments are then retrieved nightly when possible. A checking process at the central machine confirms the completeness of data for each experimental trial and if there are no gaps, merges them into a consolidated archive. If part of the data have not yet been returned, archiving is delayed, possibly for a number of days. If, when a machine finally transfers its backlog, there is still data missing, the archive is completed by adding failure records. As far as possible, this process is automatic; it is illustrated in figure 1.
The definition of a new experiment thus only requires the creation of a script giving details specific to the particular measurement task. Data handling and archiving are performed in a common way by the harness, and use of a common style for results records simplifies general aspects of analysis and summarization.
Measurements are taken from the six sites of performance and accessibility of other sites, key network components, and central services. Measurement frequencies are low so that the process does not generate a significant network load. The service level applies to the service delivered at the point of entry to the site. However, this point is generally inaccessible for routine measurements. Accordingly, the measurement machine is placed close to the main entry router, both logically and physically, and any service level test that is passed with this overhead must, inter alia, also be acceptable overall. The measurement machine is usually in the same room as the main site-entry router and is monitored for Kent by the local computing service. There have been few machine problems.
A typical experiment runs every 15 minutes, 24 hours a day. The raw results from each of the six sites are retained in the archive to allow specific sites to be excluded from processing if a problem in the immediate vicinity of one of the measurement points is suspected, but most analysis is based on a unified view produced by merging all the available measurements for a given time. Each experiment thus typically produces 576 data points in a 24-hour period. The volume of data varies from experiment to experiment but totals about 20 MB per day. The daily data are compressed before storing it in an archive, which currently stands at about 3.5 GB.
The majority of the measurements taken sample availability and cross-network delay. Internet Control Message Protocol (ICMP) echo probes are used, but rather than the standard ping utility, a modified program has been produced to reflect the criteria in the network service level agreements. These state that an access route is only to be considered unavailable if six successive packets, spaced at ten-second intervals, are all lost, and so this is the behavior used for measurement.
Measurement using ICMP probes immediately raises questions about possible bias. One issue involves the symmetry of routes for request and response, as discussed in , but the number of end points involved here makes the installation of equipment at all points of interest uneconomic, and, since the network is known to be substantially symmetric, symmetry is assumed. Another issue is the extent to which there are route initialization effects at the start of a group of measurements. This question can be answered by looking at the distribution of the number of failures before a success is reported. The measurement tool reports both the round trip time of a success and the number of preliminary failures that go before it. Three kinds of failure can be expected: failures during route initialization; random failures from, for example, congestion; and runs of failures as a result of observing the ends of longer periods of unavailability. The first would increase the observed rate of success after one or two failures, the second would lead to an exponential distribution, and the third to a constant distribution.
The observed distribution for about four million trials in the first quarter of 1998 is shown in figure 2.
These figures should be compared with a probability of complete unavailability in this period of 0.012 per trial. A least squares fit to theese data gives a probability of 0.008/trial of initial loss, 0.002/packet of random loss, and 1.7x10-5/sec of resumption of service.
On this basis, the probability of incorrect assessment of availability from one test site is less than 1 in 104, and in any case, the results from all six sites are combined. The loss of some packets while routes are being established can thus be seen to lead to a small background level of retrial in the tests, but much below the figures for congested networks reported below.
The primary purpose of the measurement system is to provide regular data to check that the service level agreements are being met. However, the data can also reveal otherwise undetected operational problems. As one example, routine review of the round trip delay profiles for individual sites showed that, at one particular site, the time of day variation was the opposite of that normally expected; delays were short during the day, but became very long in the early hours of the morning. Further investigation showed a problem with the network equipment supporting that link, such that the last packet fragment sent was not flushed into the network. This had little effect at busy times, because there was generally following traffic before long, but at night the longer idle periods gave rise to longer gaps. Once the problem had been identified, it was corrected rapidly.
To detect such incidents, average latency is calculated by site from time to time. The sites with particularly long latencies are examined and, in case of doubt, the network operators are asked to investigate further. The distribution of latencies observed for the network as a whole highlights the different technologies involved (figure 3).
The secondary peak in this curve arises from the connection of some small sites to regional concentration points at somewhat lower speed than used in the core of the network.
The U.K. academic network is quite large, and its users are sufficiently dispersed for a variety of beliefs about the level of service achieved and the causes of problems observed to become common currency. In particular, some sections of the user community may feel that they are not receiving their fair share of service or investment. Impartial monitoring can help to dispel such beliefs.
For example, there has in the past been a belief that certain technologies or suppliers are inherently less reliable than others, or that institutions with a teaching focus receive less support than those with a research focus, or that there are widespread networking problems where, in fact, the problem is local to a particular site. Comparative analysis of the measured performance helped to demonstrate that there was rather little difference in reliability among a variety of wide area technologies. To compare provision at different institutions, average access latencies were calculated for all the sites in the U.K. academic community, and the comparison shown in figure 4 was generated.
This figure shows the logarithm of the latency for each site, plotted against that site's rank when sorted by increasing latency. This presentation would show, as a change in gradient, the boundary between different parts of the population if the service they received differed significantly.
Examining the figure, it is clear that there is a difference between ranking numbers above and below about 240. However, when the individual sites are identified, this point corresponds precisely with a break between sites which are connected to the network in their own right and those associated secondary sites that are connected via one of the primary sites using local means, often via locally funded lower performance links. The analysis shows that the primary recipients of the network service are fairly served.
Another important motivation for the measurement program is to support the planning for network expansion. Analysis of current usage helps to predict future growth and identify changes in the way users exploit the network. In recent years, a major focus for this activity has been the links across the North Atlantic, which have been felt to be the main causes for concern. For large parts of most years, these links have been operating at or near their capacity, at least for the west to east flow. This makes it difficult to use traffic levels as an indicator of demand; when new capacity is provided, it generally fills within a matter of days. Figure 5 shows the variation of traffic with time of day in late January 1999, with a nominal installed transatlantic bandwidth of 90Mbps.
The other information available is from the regular measurement of round trip time and packet retry rate (which is closely related to packet loss rate, as long as one is operating in a regime where any initial effects can be neglected). The next two figures (6 and 7)
give examples of these measurements. They both show a distinct saw-tooth profile, which matches the upgrade history of the set of links (shown in figure 8). However, the delay graph (figure 6) tends to a constant upper bound, since the buffering available in the routers feeding the critical links is limited, and so packets will not be delayed for more than a time given by the buffer size divided by the bandwidth (isolated periods with much longer delays result from cable faults with fallback to satellite paths). The retry graph (figure 7) has a more pronounced triangular shape since the packet loss increases with excess user demand, even when saturation is reached.
Retry rate, therefore, gives the best indicator available to us of loading in an overloaded network. However, the average retry rate hides significant time of day variation. For planning purposes, a better indicator than simply the average is needed to take more account of peak loading. This is provided by calculating the fraction of time during which given congestion levels are exceeded. To do this, the average retry rate for each hour is calculated, and the number of hours in each week in which given thresholds are passed is determined. We take a retry rate of one-sixth of a retry per packet as a reference, since this corresponds roughly to significant degradation of user service for activities such as Web browsing. (Outside holiday periods, Web access has accounted for a steady 65% of west to east traffic across the Atlantic for the past three years . This number is strikingly similar to the results quoted in  for the U.S. backbone.) The graph of the fraction of hours in the week during which this threshold is passed is given in figure 9.
A compact representation of network growth can be extracted from this graph and the known upgrade history. Following each increase in capacity, congestion drops and then rises again as demand increases. The point where, for a given capacity, congestion first passes eight hours a day (the dotted line in the graph) is a significant milestone for traffic growth as it corresponds to a full working day, and these milestones can be plotted to illustrate the time at which the capacity was fully stretched. This is done, on a logarithmic scale, in figure 10.
Up until the start of 1997 this graph showed a steady exponential growth, with demand doubling every seven months. At that time, there was little evidence as to when demand would drop below the exponential curve, and the predictions were an important factor in establishing national policy for the resourcing of further growth.
At the same time, similar analysis of traffic on our links to the remainder of Europe showed a doubling of demand every 14 months, so there was a clear case for giving priority to additional transatlantic bandwidth. This policy was followed, and there has been a steady investment in new capacity.
On the basis of this level of growth, even given that the curve is no longer the same exponential, but has dropped back to a doubling every 12 months, it was clear that central funding could not keep pace with growing demand. There was a need for either some form of control or some other source of revenue. Usage-related charging for transatlantic traffic was introduced in the summer of 1998, although at less than the full economic cost.
Some insight into the way users respond to consistent network congestion can be gained from figure 11, which shows the results in a different way.
This displays the packet retry rate for traffic across the North Atlantic as a false-color density function of time of day from midnight to midnight and calendar time with green representing no congestion and dark red heavy congestion. In the earlier parts of this strip, a regular sequence of behavior can be seen. The working day is congested throughout, but the loading in the evening and night varies. Congestion persists progressively later into the night until an upgrade occurs. The congested period then falls back to the working day with quiet nights, and the cycle repeats.
After mid-1997, however, the regular pattern breaks down; while there is still some correlation with the known upgrades, University terms are much more prominent and congestion comes and goes at other times; in the vacations, it correlates both with the U.K. and the U.S. working day. Thus the transatlantic links are no longer always the dominant cause of congestion and planned further significant improvement of transatlantic facilities needs to be matched at the U.S.. end to give the same visible user benefit. This has fueled a recent reawakening of U.K. interest in improvement of internal U.S. bandwidth.
The long history of periodic congestion because of lack of resources on international routes, and the much wider recent user base for networking facilities, makes this a particularly sensitive area with respect to fair allocation. The ranking technique outlined above can also be used here.
Figure 12 shows the plot of the logarithm of traffic against the rank of the site thazt is its destination. Again, the consistent gradient indicates that there is a uniform population over a usage range of almost a thousand.
At first there was healthy skepticism about the need to monitor performance against agreed service levels. This has been steadily overcome. The use of the data to identify problems and report them to the network provider, identify existing bottlenecks, model behavior and hence predict future bottlenecks have all assisted the funding bodies.
In a climate that demands increasing justification for the expenditure of public funds, the funding bodies have been able to point to the high quality of service from their supplier. They have been able to justify their capacity planning assumptions in an environment in which provision "in advance of need" is unacceptable. Finally, they have been able to justify to the user population that it is fairly and even-handedly treated. Methodology and detail will continue to be refined, but the harness will continue to be deployed.
The work described in this paper is funded by the U.K. Joint Information Systems Committee. It would not have been possible without the support and cooperation of UKERNA and the Universities of Aberdeen, Bangor, Kent, Plymouth, Salford, and Warwick, who house our measurement machines. Special thanks go to Vern Paxson of LBL for providing our main U.S. test target.