In this paper we present the Netlab Quality Management Framework (NQMF), a set of tools especially designed to support controlled experimental analysis of voice-over IP (Internet Protocol) implementations. NQMF implements an integrated methodology to generate, in a controlled way, packet loss and delay sequences which effectively reproduce the fluctuating conditions of the Internet, and especially the high-frequency phenomena that mainly influence speech intelligibility. Our generator is based on a large set of measured sample traces, recorded in a variety of conditions over the network, and takes into special account the so-called 'coffee break' peaks. We also discuss a validation procedure that we carried out to account for the improvement that our generator has over simpler ones already proposed in the literature.
Real-time multimedia services over the Internet have become feasible thanks to the availability of new programming languages and integration environments, efficient data compression schemes and the provision of high-speed networks. However, a critical element for such services is still the sometimes poor and always unpredictable quality of the service delivered to the final user. Therefore the problem of assessing and establishing quality thresholds for real-time, interactive, multimedia communications over the Internet has been extensively addressed by the network community. Research activity has mainly focused on quality of service (QoS) issues at the network level, and has led to both the definition of Internet Protocol (IP) performance metrics and the proposal of new frameworks to differentiate the treatment of IP flows, e.g., Integrated and Differentiated Services Frameworks. Only recently, the problem of evaluating QoS from the user point of view has been tackled. Till now, only a few unsuccessful attempts have been made in this direction .
In this paper we discuss a set of issues and results connected to the transmission of voice-over IP (VoIP). These activities relate to a project that was started in 1997 by NetLab, a Multimedia Network Laboratory of the Italian National Research Council, with the aim of identifying, designing and implementing a framework to perform subjective tests of quality in a controlled laboratory environment. A major result of the project was the design and the prototype implementation of the Netlab Quality Measurement Framework (NQMF), a flexible measurement and application testing framework, with user-friendly interface. The framework modules are available for most hardware and operating system platforms, and therefore easily deployable on a large number of hosts over the Internet. Moreover, they introduce a low overhead and have a low management cost. Not being a goal of the project to develop new measurement tools, we actually relied, whenever possible, on already available tools, which have been only modified and extended to meet our needs.
One of the main goals of the project was to attain the ability of generating in the laboratory, in a controlled way, the packet delay and loss patterns experienced over the Internet. This would finally allow the performance of extensive testing of multimedia network applications, over a wide range of situations, and therefore the definition of thresholds at the network level, that correspond to the appropriate level of QoS. This is opposed to the usual practice to rely only on measured traces, and use them to feed the applications during the experimental testing, with the obvious disadvantage of not being able to cover an adequate range of network conditions, most notably the extreme ones.
A major challenge we have faced in setting up the framework has been to reproduce correctly the fluctuating network conditions that are notably experienced on the Internet. As a consequence of that, we decided to first concentrate on the development of practices and tools to investigate the high-frequency phenomena, i.e., having a dynamic range significantly wider than the human voice. These have a considerable impact on the quality of voice perception and human interaction. It is indeed well known that voice services are sensitive to the regular delivery of packets and can be disturbed by abrupt variations of delays and by a concentration of losses.
This has led us to the definition of new statistical metrics, which we introduce according to the international recommendations, to quantitatively characterize this phenomena, and to form a parameterization basis for our laboratory generation environment. Moreover, these metrics may become meaningful service-level indicators at the network level, to be used in a VoIP QoS assessment context.
We report in this paper some major results of our activity, together with some interesting remarks and conclusions that could be, in our opinion, useful to other researchers acting in this field. More precisely, we discuss:
Measurements have mainly been performed according to the suggestions of the Internet Engineering Task Force (IETF) Internet Protocol Performance Metrics (IPPM) Working Group, even if, due to the particularities of the investigated context, metrics and methods used to collecting samples are not always consistent with its recommendations. Nevertheless, we always took as a reference the IPPM terminology and definitions, to allow a better and common understanding of our results.
We have investigated the network behavior in terms of packet delays and losses. Packet delays are the sum of a constant component and a dynamic component. Since, as we clearly pointed out in the introduction, we are mostly interested in studying the high-frequency phenomena that have a relevant impact on the quality of voice perception, we can avoid measuring accurately very slow frequency delay fluctuations. Therefore we can avoid measuring the constant component of the delay, and then we do not have to deal with clock synchronization problems. Moreover we decided to prefer the flexibility of the measurement system, in order to be able to analyze a larger variety of situations. Accordingly, avoiding the installation of hardware synchronization instruments, such as GPS cards, and the modification of the operating system (OS) kernels, turned out to be an important element in extending the number of remote hosts we were able to involve in our measurements. Anyway, we have attenuated, as much as possible, as we shall discuss in the next section, the effect introduced by the skew between the sender and receiver clocks.
The measurement methodology is mostly application (i.e., VoIP) oriented. Isochronous test flows, which are notably better in modeling continuous VoIP packet flows, have been preferred to random sampling. All measurements are performed at the application level. As a consequence of that, the results take in account all the components of the delay, including the ones introduced by the host operating systems and hardware equipment. Measures were taken both on end-to-end delays and round-trip times. In what follows, for a given packet pi, we use the symbols etei, rtti and di to refer respectively to its end-to-end delay, round-trip time and generic delay.
As for the technical details, in the isochronous test flows User Datagram Protocol (UDP) has been adopted instead of Transmission Control Protocol (TCP), since it does not interfere with the emission rate. The packet size is fixed to 32 bytes, the minimum UDP packet size, in order to not substantially influence the network performance. The emission rate is equal to 50 packets per second, the maximum tolerable rate by the hardware/software platforms involved in the experimentation activity. Such a rate will reveal anyway the high-frequency phenomena of interest.
Almost all measured traces showed at a visual inspection evidence of some high-frequency phenomena, peaks with a peculiar saw-toothed shape, which apparently cannot be explained just by an increase in traffic. This could not just be a case of poor network dimensioning, since apparently hundreds of milliseconds are not enough to handle the traffic generated in just a few milliseconds. A reasonable explanation for this phenomenon, and it has already been suggested in the literature, is that routers periodically take a leave, for instance to update their routing tables. This is the reason why these peaks will be dubbed, and referred to in the rest of the paper, as "coffee break peaks."
Figure 1: A typical delay sample
A coffee break peak starts with a packet pi such that |di - di+1| >= alpha * its, where its is the inverse of the emission rate and alpha is a constant greater than 1, and ends with the first packet pk, k > i+1, such that |di - dk| >= its. Less formally, any of such peaks begins with a sudden increase in the delay and ends when the delay goes back to a value close to that prior to the beginning of the peak. Again, directly from visual inspection it is also evident that the delay and loss distribution in the coffee break peaks are substantially different from those of the remaining part of the trace. Figure 1 shows one of our measured traces in which we may notice a number of coffee break peaks. The graph associates to each packet, whose sequential number appear on the x-axis, its end-to-end delay. Lost packets have been assigned by convention a null delay. Figure 2 shows a single coffee break peak in detail.
Figure 2: A typical coffee break peak
This kind of remark readily convinced us that the only way to effectively characterize the delay and loss behavior of the network was to analyze and model separately the 'normal' region and the coffee break peaks. This approach was taken in the design of our experimental apparatus as we discuss in more detail in the next section. We then collected separate measures for the whole trace and the peak region alone, and were able to validate our approach by showing a better distribution fit when the segmentation strategy is taken, as we will see in Section 4.
We discuss in this section the measurement and post processing framework which has been designed and implemented with two main purposes:
Figure 3 shows the architecture of the measurement and post-processing framework in the configuration for the measurement of end-to-end delays. An isochronous flow of IP packets is transmitted from a sender to a receiver through the network. The receiver traces the measured end-to-end delays. The produced traces are the input to the post-processing modules, whose output is deeply studied to characterize the nature of the investigated phenomena.
Figure 3: Measurement and statistical framework
Measures are performed by using the NetDyn-1.0 tool, whose source code is available at the ftp site of the University of Maryland. NetDyn-1.0 effectively deals with isochronous flows of UDP packets and automates the measurement and post-processing functionality. In addition it can be very easily integrated by ad hoc designed and implemented statistical tools. NetDyn-1.0 comes with some "awk" scripts making it possible to evaluate the minimum, maximum and average value of the measured delays and their auto-correlation and to count lost packets.
These basic capabilities have been extended with a set of post-processing modules, some of which we briefly describe in this section.
This module attenuates, as much as possible, the effect introduced by the skew between the sender and receiver clocks. Such a skew reveals itself as a shift component in the measured traces. Traces have been filtered by a linear regression method. Given any trace, the module computes its regression line y=mx+q and subtracts the mx component to the trace itself. The parameters m and q have been computed by an algorithm of high numerical stability based on the householder transform .
Given any trace and any positive value, alpha extracts the coffee break peaks from the trace, producing two new traces. As already stated in the previous section, a coffee break peak starts by a packet pi such that |etei-etei+1| >= alpha * its and ends by the first packet pk, k > i+1, such that |etei-etek| >= its. Furthermore the module computes:
Finally the module produces a trace of all the etei's such that pi is the first packet of a coffee break peak.
Given any trace estimates the conditional and the unconditional loss probabilities respectively defined as L/T and CL/L, where L is the number of lost packets, T the number of packets in the trace, and CL the number of pairs of adjacent lost packets.
Given any trace and any positive integer N, partitions [etemin, etemax], the delay variability interval, by N equally sized sub-intervals I1 I2, ..., In and produces a histogram in which the number of packets whose delay is in I1 is associated to the middle point of the same subinterval.
Given any trace, the module verifies how well the relating delay distribution can be approximated by a gamma distribution. For any given trace, it computes the empirical i-th/100-quantile qei of the distribution relating to the trace and the i-th/100-quantile qti for the following gamma distribution, with scale parameter equal to 1 and form parameter equal to s:
with , and , , respectively, expected value and standard deviation of the empirical distribution.
If the empirical distribution is well approximated by a translated gamma distribution
then the sequence of points (qei, qti) can be approximated by the straight line y=mx+c. Such a test is implemented by the module by computing the linear regression straight line of the qei's with respect to the qti's and that of the qti's with respect to the qei's. If the slopes of the two lines are almost equal then the two set of quantiles can be assumed to be linearly correlated.
The module implements a second approach in verifying how well the delay distribution can be approximated by a gamma distribution, by computing the determination coefficient
where y=m*c1+c0 is the linear regression straight line of the qti with respect to the qei. The more approximates 1, the better the fit. The module returns the value of s, m and c for the best-fitting gamma distribution.
Given any trace, the module produce two graphics for the set of points (etei-etei-1, etei+1-etei) and the set of points (|etei-etei-1|, |etei+1-etei|).
In this section we will discuss the initial results of our measurement and post-processing methodology. These initial tests were aimed to:
First, we have carried on two different types of preliminary measurements:
For such a purpose we first selected a set of heterogeneous measurement hosts on which to run the NetDyn processes:
We then considered the paths that connect these hosts and classified as follows:
In Table 1, we present a set of 15 selected measurements that represent quite well a large variety of situations, being all heterogeneous in terms of paths and kind of measurement machines. The following table describes the main characteristics of the selected samples. We shall refer, in the rest of the section, to the post processing and the analysis of these sets of measurements.
Table 1: The selected set of measurements
|Path||Measure ID||Source loc.||Dest. Loc.||Time day||Num. of packets||Sampling Time (ms)||Type|
First, delay samples have been analyzed by the peak extractor. As an example, for a specific trace, figures 4 to 6 display respectively the whole original trace, the normal (without the peaks) portion of the same trace, the samples assigned to the peak intervals and the peak portion alone. From a simple visual inspection of the trace generated by the peak extractor we may clearly see that:
Figure 4: The original sample
Figure 5: The "normal" portion of the trace (without peaks)
Figure 6: The peak portion alone
The main problem in tuning the peak extractor was how to classify a packet loss close to a transition edge. This loss could be considered as belonging either to the peak portion, either to the normal portion or neither of them.
We finally assumed the following criteria:
The attribution of losses close to the Normal to Peak edge is error prone. In order to avoid incorrect assumptions, we measured the probability to have a loss of the last packet before the edge. This probability turned out to be very low (order of 10-4).
This leads to the following interesting remark: peaks are not announced by an increasing packet loss probability. This can be easily explained since, just before a coffee break, there are no special reasons to have a critical congestion in the network and therefore routers generally have enough space room.
Table 2: Peak probability and PeaktoNormal Transition Probability
|Path||Sample ID||Peak probability (%)||Peak-Normal Transition Prob.||AvgPeak Duration (ms)|
According to these remarks and to the IPPM Recommendation , we propose the definition of three new statistical metrics:
Note that the last two metrics are directly dependent on each other and are equivalent metrics.
Table 2 contains the values of these metrics for the selected set of measurements. It is clear that the peak phenomena are, in some cases, negligible. For example, this is true for samples 4, 5, and 14 which actually have less than two peaks each. We tried to classify the samples on the basis of the Peak Probability (see Table 3):
Table 3: Sample classification
|Sample Class||PeakProb Range||Sample IDs|
|Negligible peaks||< 0.1%||4,5,14|
|Low peak density||0.1% - 1%||1,6,9,15|
|Medium peak density||1% - 10%||8,2,7,12,13|
|High peak Density||> 10%||3,10,11|
The AveragePeakDuration proved to depend mostly on the path type. For instance, in our measurements, all samples referring to Cross Country paths have peaks with an average duration of 60-90 ms, and the one referring to Country path has an average duration of 240 ms. This leads to the following interesting remark: the AveragePeakDuration is an invariant characteristic of a path. We suppose that its value is strongly dependent on the complexity (number of routers, number of hops) of the network the path traverses.
Figure 7 shows an example, for a specific trace, of the result of the phase plot analysis. Each (x,y) point in the phase plot represents a couple of IP delay variations (Ipdv) referred to a given packet Pi. In particular, the X-coordinate represents the Ipdv between packet Pi and Pi-1 in the injected test flow; the Y-coordinate represents the Ipdv between packet Pi and packet Pi+1.
Figure 7: Example of Phase Plot Analysis
From the picture, we may clearly make the following remarks:
The phase plot analysis leads to the main interesting remark: packet misordering is a rare occurrence.
Loss analysis has been performed by analyzing separately the whole sample, the peak portion of sample alone and the normal portion of sample, without peaks.
For each portion of the sample, the Unconditioned Loss probability (Ulp) and the Conditioned Loss Probability (CLp) have been calculated. Table 4 summarizes the results the values of CLp and Ulp for all the sample.
Table 4: Loss Results Analysis
|Path ID||Sample ID||UlpT (%)||UlpN (%)||UlpP (%)||ClpT (%)||ClpN (%)||ClpP (%)|
As summarized in Table 5, the approach to separate loss analysis of peak and normal region is especially interesting when the peak density, according to our classification, is between low and medium. All together, although the peak segmentation is based only on packet delays, we found in all our samples a strong statistical dependence between packet delay and packet loss: unconditioned loss probability in the Peak Period is much greater than in the Normal Period.
Table 5: Sample classification
|Sample Class||PeakProb Range||Analysis|
|Negligible peaks||< 0.1%||Separate analysis useless but however applicable|
|Low||0.1% - 1%||Unconditioned Loss Probability in the peak period is always much greater than the Ulp in the normal period.|
|Medium||1% - 10%||Unconditioned Loss Probability in the peak period is always much greater than the Ulp in the normal period.|
|High||> 10%||Further investigation needed. Peaks dominate.|
At this stage of the analysis, we are not able to give any clear statement about the differences in conditioned loss probability from normal to peak periods. Further analysis is needed.
The results of the delay analysis strongly support our approach of dividing the sample into normal and peak portions. As a matter of fact, we found that we can fit a gamma distribution much better on the normal portion alone than on the whole sample. This is clearly displayed in Table 6, where, for each sample, we reported the following parameters:
Table 6: Gammafit results
RT and RN are regression coefficients and give a measure of how well the best gamma distribution is fitting the whole sample and the normal portion of the same sample, respectively.
The detailed analysis leads to remark that the normal portion of the packet delay sample fits a Gamma distribution much better than the whole sample.
In this section, we discuss the main characteristics of the stochastic model we propose for the loss and delay generation. The model has been especially designed to effectively reproduce the high-frequency phenomena of interest. We synthesized a four-state extended markovian model, in which two pair of states represent the normal and peak condition. In each pair, the first state represents a condition of packet loss and the latter a condition of no loss.
The parameters of the model are directly connected to the main metrics we have defined and measured on our samples as discussed in section Sections 3 and 4. The model generates samples, i.e., sequences of delay and losses, which are stochastically equivalent to the measured samples with the same values of the parameters. The generated samples have the following characteristics:
We validated the model in several ways. First we compared the measured and generated samples with the same values of the parameters by visual inspection and investigation of first-order statistical properties. Visual inspection displayed an evident similarity between measured and generated samples. Investigation of the first order statistical properties, such as expected values and standard deviation, loss probabilities, moments of the delay distributions and stochastic dependence between delay and losses, gave good results as well. This means that we have proved that the delay-loss sequences generated by the model are stochastically equivalent to the ones in measured samples with the same values of parameters.
We then performed a second kind of validation, this time at application level, to prove that the effects at this level of a measured loss-delay sequence and of an 'equivalent' one generated by our model are close enough. This is obviously even more interesting since it tells us that using generated sequences to test and tune up applications is equivalent to using measured traces. For such a purpose we defined an ad-hoc methodology, based on the observation that receivers always use a playout delay algorithm to smooth the delay jitter. The effectiveness of such algorithms is strongly influenced by clusters of losses and clusters of relevant variations of delays. Given a measured sample S and an integer number of packets N, we proceed as follows:
To account for the improvement that our generator has on simpler ones already proposed in the literature we also compared, with the same validation methodology, the measured sample with samples generated by state-of-the-art modeling approaches, based on autocorrelated gamma distribution functions. Figures 8 and 9 display the result of this validation procedure on a specific measured sample for N=5 (that corresponds to a typical talkspurt length) and a delay threshold set to the 90th percentile. It is clear from the picture how the trace produced by our generator has a considerably better fit on the measured trace. As a matter of fact the trace produced by the conventional autocorrelated gamma generator tends to distribute both losses and more delays more evenly, i.e., less delay and fewer losses on a larger number of intervals. That means that, if this trace were used to test VoIP applications, this would result in an optimistic evaluation, if compared to the effects of the measured trace. The results of this validation procedure on our extended set of measured samples have been indeed very encouraging.
Figure 8: Model validation (delays)
Figure 9: Model validation (losses)
The generator has been implemented in C++ and integrated in a high-performance tool for the generation of delay and losses over real multimedia and multicast streams. The integrated generation tool runs on a high-performance Digital Unix workstation and has been already used in experiments aimed to measure, in a controlled environment, the effect of loss and delays on voice packet streams on the user-level QoS .
In this paper we have presented the Netlab Quality Management Framework, a set of tools especially designed to support controlled experimental analysis of VoIP implementations. NQMF implements an integrated methodology to generate, in a controlled way, packet loss and delay sequences which effectively reproduce the fluctuating conditions of the Internet, and especially the high-frequency phenomena which mainly influence the speech intelligibility. The loss and delay generator that we embedded in our framework is based on a large set of measured sample traces, recorded in a variety of conditions over the network, and takes into special account the so-called 'coffee break' peaks. To effectively characterize this region of the traces, we introduced new metrics that were taken as a parameterization basis for the generator. A validation procedure has also been carried out to account for the improvement that our generator has on simpler ones already proposed in the literature.
We are currently extending our work in two main directions. First, we are providing for a better integration of our tools to automate the measurement, analysis and generation process. Second, we are setting a more ambitious goal by planning and performing a set of controlled experiments to formally assess a relationship between network-level QoS, based on objective loss and delay metrics, and user-level QoS, based on subjective metrics.
 Jean-Chrysostome Bolot, Hugues Crépin. "Analysis and control of audio packet loss over packet switched networks." Proceedings of NOSSDAV 95 pp. 163-174, Durham, NH.
 J.-C. Bolot. "End-to-End delay and loss behaviour in the Internet." Proceedings ACM Sigcomm '93, pp. 289-298, San Francisco, CA, Sept. 1993.
 "Instantaneous Packet Delay Variation Metric for IPPM <draft-ietf-ippm-ipdv-03.txt>" C. Demichelis, P. Chimento, June 1999.
 Sue B. Moon, Jim Kurose, Paul Skelly, Don Towsley. "Correlation of packet delay and loss in the Internet." UMASS CMPSCI Technical Report #98-11.
 M. Draoli, P. Filosi, C. Gaibisso, M. Lancia, A. Laureti Palma, "A framework for the subjective assessment of audio communication on IP switched networks: definition and validation," ICCT'98: proceedings of the International Conference on Communication Technology , Beijing, China, October 1998;
 S. Kalidindi, M.J. Zekauskas "Surveyor: An Infrastructure for Internet Performance Measurement," INET'99.
 Julie Pointek, Forrest Shull, Roseanne Tesoriero, Ashok Agrawala. "NetDyn revisited: a replicated study of network dynamics." Computer Networks and ISDN Systems 29 (1997) pp. 831-840.
 Ashok K. Agrawala and Dheeraj Sanghi. "Network Dynamics: an experimental study of the Internet." Proceedings of IEEE INFOCOM '92, Florence, Italy, May 1992.
 D. Sanghi, A. K. Agrawala, O. Gudmundsson, B. N. Jain. "Experimental Assessment of end-to end behaviour on Internet." University of Maryland, Computer Science Department, College Park, MD, 1992, Technical Report CS-TR 2909.
 Sally Floyd, Van Jacobson. "The synchronization of periodic routing messages." IEEE/ACM Transaction on Networking, April 1994. Extended Version in SIGCOMM 1993, September 1993, pp. 33-44.
 "Framework for IP Performance Metrics" RFC 2330, V. Paxson, G. Almes, J. Mahdavi, M. Mathis. May 1998.
 Jhon A. Zinky, Fredric M. White. "Visualizing packet traces." COMM '92-8/92/MD/USA.
 A. Mukherjee. "On the dynamics and significance of low-frequency components of the Internet load." TR CIS-92-83 Univerity of Pennsylvania, December 1992.
 A. Watson, M. A. Sasse. "Evaluating audio and video quality in low cost Multimedia Conferencing systems." To appear in "Interacting with Computers."