PingER: Internet Performance Monitoring.

Warren MATTHEWS and Les COTTRELL
Stanford Linear Accelerator Center,
Operated for the U.S. Department of Energy by Stanford University
2575 Sand Hill Road, Menlo Park, California 94025, U.S.A.

Objectives

This poster details the importance of monitoring the modern Internet, and in particular the experiences of the PingER project using data gathered from sites connected to the commercial Internet. The effect of how the networks connect together is of particular interest, and examples of how connectivity effects packet loss and response time are discussed.

Internet Performance Monitoring

The rapid growth and development of the Internet in recent years often leaves even the most knowledgeable user wondering what is going on.
The countless servers and clients on the Internet, and the applications that run on them, communicate by passing packets of information between one another. The modern Internet can be a perilous place for these packets. If the queues in the routers that navigate the packets around and between the networks are full then the packets can be delayed or lost and consequently the communication between the applications will suffer, sometimes making the applications unusable. However, some parts of the Internet do not suffer such dangers, and packets are rarely lost or delayed beyond the physical limit imposed by the speed of light. This fact makes us wonder what is the best route for the packet, and how can resources be allocated to improve the situation to optimize the performance of the applications we rely on so much in research and commerce on today's Internet.

Internet Performance Monitoring points to the solution. By understanding the issues and discovering where the bottlenecks are, the managers and engineers who develop the networks can make informed decisions about what needs to be done.

The PingER Project

The [quantities related to performance and reliability] must be useful to users and providers in understanding the performance they experience or provide (RFC 2330).
PingER (Ping End-to-end Reporting) is the name given to the Internet monitoring project begun at the Stanford Linear Accelerator Center (SLAC). It is now deployed throughout the Department of Energy funded labs and High Energy Particle and Nuclear Physics facilities and collaborating Universities around the world. The main component of the project is the ping program. Ping is a simple network tool that sends a chosen number of packets of a chosen size to a remote computer and records the minimum, average and maximum time taken for the packets to return to the originating host as well as the number of packets that didn't return, in other words the packet loss. Ping is an ICMP (Internet Control Message Protocol) echo request. It is part of the TCP/IP protocol. It comes installed on most (all?) platforms so there is no need to install other applications.

Architecture (click for full size image)

This image shows the architecture of the PingER project. Each monitoring site pings each of the remote sites every half hour and summaries of the results of each sample are written to a file. Each day the archive site retrieves the data and stores it in a database. Analysis is done on the data by each monitoring site to show short term reports in the form of a webpage table with the packet loss and response time of the latest data. Each site provides the raw data and a simple configurable graphing tool. Furthermore, a collective analysis is done on all the data at the analysis site, providing detailed hourly, daily and monthly reports.

The Internet has traditionally been a best effort service with providers passing on each others traffic in a co-operative attempt to try to get packets to the correct destination. Users are now demanding more guarantees and Quality of Service (QoS) with Services Level Agreements (SLA's). However in order for some packets to arrive faster, some others must arrive slower, and some providers are deliberately hindering the progress of ping packets across their network in order to improve the progress of other types of packets. Also, certain types of security attacks can be launched using ping, so some networks give ping packets very low priority or block them altogether. There have been a few incidents where ping can no longer be used for monitoring (we expect this to increase in time) but the PingER framework can be extended to employ other tools to allow us to monitor network connectivity such as web/HTTP download time. Other network monitoring projects such as Surveyor and NIMI do not suffer similar problems because they do not rely on ping for their measurements. The PingER project group is actively working with the other groups to benefit from this.

The XIWT and IPERF

The Cross-Industry Working Team (XIWT), is a membership organization consisting of a diverse group of communications, computer systems, information and service providers who have joined together to develop a common technical vision for the National Information Infrastructure (NII).

As part of its work, the XIWT formed an Internet Performance Working Group (IPERF) with the explicit goal of helping to cause improvements to Internet performance. The IPERF consists of seven corporations and three other organizations from within the XIWT membership and along with SLAC there are ten monitoring sites and sixteen remote sites consisting of the ten monitoring sites and six others.

The IPERF recently published a white paper entitled "Customer View of Internet Service Performance: Measurement Methodology and Metrics". See the website for links to the document.

Internetworking

The TCP congestion avoidance mechanisms ... are not sufficient to provide good service in all circumstances (RFC 2309).
The Internet has grown rapidly since it became harnessed by commercial projects. In recent years email for example, has become essential to the smooth running of business as well as the Internet's academic creators. So, when the packets that carry the information between the clients and the servers on the Internet are passed from router to router or between networks it is obviously essential that they do so without introducing serious delay. However, each network controls its own routers and has very little influence on the routers in other networks. A network entirely under the control of a single operations and maintenance organization is called an autonomous system (AS) precisely for this reason.

The metrics used in the studies of the data gathered for the PingER project are end-to-end and deal with packet loss and delay between two end node machines, very little attention is paid to the routers that connect them because it is often very difficult to know what is happening inside someone's network or AS and it is only the effect on performance that informs us of a change. The situation is complicated by route changes. Either deliberately or accidentally a router or a network may change the path which it uses to forward packets to the next router or network. Furthermore the routes may change frequently between two or more paths, this is called flapping. Often changes in performance can be associated with route changes or flaps.

Network Snapshot

Network Snapshot (click for full size image)

This image below shows routes between seven of the XIWT/IPERF monitoring sites based on results from thetraceroute program. Each hexagon represents a network, so packets between two end sites may have to cross up to seven separate networks, and up to fifteen routers in order to communicate.

Unlike the well designed routes in a given network, the commercial Internet can take on a very complex routing topology. This may provide ample opportunity to delay packets !

Packet Loss

Packet Loss between end nodes is critically important to the performance of any network application.
By far the main cause of packet loss occurs because packets must queue at routers to be processed, and if the queue is full the packet is discarded. This is often an issue at the Internet Exchange Points, where packets are passed from one network to another, because they are often very busy.

Packet Loss (click for full size image)

This graph shows the percentage of packets lost on each day from mid-April to the last week of May 1998 for a connection between a lab on the Energy Sciences Network (ESnet) and a University. The sudden drop in packet loss (and by definition improved performance) around May 1 is due to the University changing its connection from a provider that was connected to the ESnet via a congested Internet Exchange Point to a high performance research network with a much better connection to ESnet.

Packet Loss (click for full size image)

This graph shows the percentage of packets lost during each day of January 1999 for two separate sets of XIWT pairs. One is consistently good, and packet loss rises above zero on only four occasions, each time one packet was lost from one sample. It is expected network applications will work well over this link. The other link is the opposite extreme with packet loss varying wildly. There is a pattern consistent with a congested link where the packet loss is high during the working week and lower at week ends. One of the differences between these pairs is that one link passes through a highly congested Internet Exchange Point and the other doesn't. There are other factor, but the hazards of using congested Internet Exchange Points is clear.

Response Time

Ideally, traffic should traverse the Internet at the speed of light in glass. However, because of delay in transmission, it very rarely does.
Unlike packet loss where it is possible to reduce losses to zero, it is never possible to reduce the round trip time (RTT) to less than the time taken for light to travel the distance along the fiber. In addition to this minimum imposed by the laws of physics, there may also be a delay caused by queueing in routers.

Response Time (click for full size image)

This graph shows the average ping round trip time for packets sent from one XIWT host to another and back for each day of January 1999. The path between the two sites crosses several third-party networks and a congested Internet Exchange Point. If the end networks which contain the hosts require better connectivity, they can negotiate a mutually beneficial arrangement to connect directly; this is called peering.

Response Time (click for full size image)

This graph shows a dramatic change in round trip time. The time taken more than tripled from around 70ms to over 300ms for about one day. It is quite likely this was due to a change in route. Either deliberately or accidentally, the routers sent the packets on a much longer journey. This will have certainly made interactive applications perform much slower, and data throughput would also have been significantly impacted.

Further Information

To learn more about the PingER project, and related monitoring activity or for information on Internet-working in general, please visit
http://www-iepm.slac.stanford.edu

For more information regarding the XIWT, please visit

http://www.xiwt.org

Presented at INET99, San Jose, California, June 1999. Also available in postscript versions. Enquiries to warrenm@slac.stanford.edu

Weblint reports no errors with this HTML