Adaptive Loss Concealment for Internet Telephony Applications
Henning SANNECK <email@example.com>
Today's Internet is increasingly used not only for e-mail, ftp and the World Wide Web, but for interactive audio and video services (MBone). However, the Internet as a datagram network offers only a "best effort" service, which can lead to excessive packet losses under congestion. Internet measurements have shown that the overall probability of loosing one packet is high, however drops significantly for the loss of several consecutive packets.
In this paper we consider this Internet loss characteristic and the property of long-term correlation within a speech signal together, to mitigate the impact of packet losses. This is accomplished by an adaptive choice of the packetization interval of the voice stream at the sender. When a packet is lost, the receiver can use adjacent signal segments to conceal the loss to the user, because a high similarity can be assumed due to the adaptive packetization at the sender. The subjective quality of the proposed scheme as well as its applicability within the current Internet environment (high loss rates, common audio tools, standard speech codes) are discussed.
Packet-switched networks are increasingly used for audio and video transmission beside "classical" services like electronic mail. However, datagram-oriented networks typically offer only a "best effort" service, which does not make any commitment about a required minimum bit-rate or a maximum delay allowed. Consequently, when the network gets congested, real-time packets may arrive too late at the receiver or may be dropped due to buffer overflow at routers or bit errors (wireless networks). In the case of the transmission of telephone-quality audio for conferencing applications, which we will further explore in this paper, packet loss causes signal dropouts which are very annoying for the listener. To tackle the loss problem, different techniques have been proposed, which can be divided as follows:
In this section we want to briefly describe these different methods (especially with regard to the Internet environment) and finally introduce our approach.
Real bandwidth adaptation, i.e., varying the coder output bit-rate according to (RTCP, ) loss reports by receivers, is currently not feasible for speech transmission, as no standardized scalable audio codec is available. However, such codecs (e.g., wavelet codecs) are under development (), but haven't found wide deployment yet. Additionally, when considering the use of such a scalable codec (i.e., where the quality to bit-rate relation is continuous), one must realize that the bandwidth range of such a codec is usually one order of magnitude lower than the video coders in use today. Thus the overhead for this scheme (RTCP control traffic) does not seem to justify the possible gain in available network bandwidth.
When using the constant (low) bit-rate codecs available in the current MBone tools (vat , rat , FreePhone , NeVoT ), no output bit-rate adaptation in response to temporary congestion is possible. However,  proposes to switch between available codecs (PCM, ADPCM, LPC, etc.) for noncontinuous bit-rate adaptation. We argue that this is problematic due to the nonlinear (or even noncontinuous) relation between the bandwidth and the subjective quality of the codecs: e.g., GSM (13 kBit/s, ) sounds subjectively better than G.723 (ADPCM, 32 kBit/s) and different (not necessarily inferior to) from A-law PCM (64 kBit/s). Additionally, considering the service model, when switching codecs the choice of the codec/subjective quality is taken away from the user and it could be argued to take always the codec with the best quality/bit-rate relation (assuming sufficient computing power).
Recently, much work has been devoted to the Internet Integrated Services (IIS, ) model and its resource reservation setup protocol RSVP. Assessing the effectiveness of reservations for speech traffic, we have identified two major drawbacks.
First, the IIS mechanisms have to be deployed in every IP router along the path from the source to the sink. Then, for each flow, state has to be installed in every participating node. Considering numerous low bandwidth voice flows, this results in a high per-flow state overhead to bandwidth ratio. Because the properties of voice flows (constant bit-rate, loss sensitive) are known in advance and could, for example, be identified by the RTP payload type, the IIS traffic characterization objects (Sender and Receiver TSpec) are largely redundant.
Secondly, a mismatch between the properties of the currently existing Internet service classes and the requirements of telephone-quality speech traffic can be observed: The IIS Guaranteed service is intended for nonadaptive flows which need a strict delay bound. The Controlled Load service strictly requires the loss rate to be near 0%. However, subjective testing has shown that, to a certain amount, tolerance for delay can be traded against loss tolerance (i.e., that applications can repair isolated losses: , ). Additionally, all typical voice applications can adapt fairly well to changing delay (jitter).
These methods "piggyback" redundant information of earlier packets on the current packet to be sent (). Two different methods are proposed:
The overhead of these schemes is relatively high with respect to the additional data to be transmitted (to accommodate losses, the bit-rate has to be increased first in proportion to the number of consecutive losses to be repaired). Yet, the scheme is useful for reconstructing small bursts of lost packets, as well as for larger packet sizes (when concealment can't be applied), and (for source-coded redundancy) all existing codecs in tools can be used.
A simple method to increase the audibility of a loss-distorted signal is interleaving, i.e., sending parts of the same signal segment in different packets, thus spreading the impact of loss over a longer time period. The following schemes exist:
Interleaving always needs resequencing at the receiver, thus introducing higher latency (as I is also the number of packets needed to regenerate the entire signal segment).
A speech signal can be (roughly) partitioned into voiced and unvoiced regions. Voiced signal segments show high periodicity (pitch period). When packetizing, the contents of consecutive packets resemble each other. Concealment algorithms try to exploit this by processing the signal segments around the gap caused by a lost packet and then filling the gap appropriately. This can be done, for example, simply by repeating a signal segment of pitch period length throughout the missing packet ("Pitch Waveform Replication": PWR, ), possibly supported by (per subband) LPC analysis/synthesis ().
Usual concealment schemes are receiver-only, i.e., they do not introduce additional processing and data overhead at the transmitter and are well suited for heterogeneous multicast environments. This means that transmitters may use different audio tools than the receivers, and receivers can mitigate packet loss according to their specific quality requirements.
However, the applicability is limited to isolated losses of small- to medium-sized packets (the quasi-stationary property of the signal can be assumed with a high probability only for speech segments smaller than 40ms). To conceal with a high output speech quality, a high number of successfully received packets are necessary after the gap, resulting in additional playout delay. As the (fixed) packetization interval is unrelated to the "importance" of the packet content and changes in the speech signal, some parts of the signal cannot be concealed properly due to the unrecoverable loss of entire phonemes.
In this work, we want to facilitate concealment by the processing of the undistorted signal at the sender resulting in adaptive packetization. A very low amount of redundancy, not for reconstruction, but to support the possible concealment operation, is added. Thus it is possible to exploit long-term correlation properties of speech not only for coding, but for loss recovery. We therefore propose to use Adaptive Packetization and Concealment (AP/C, ) to enhance applications' loss resiliency and discuss its applicability in the Internet/MBone environment.
The part of the sender algorithm interfacing to the audio device copies PCM samples from the audio input device to its input buffer and returns the position of the maximum of the auto-correlation function p(c) of the input segment of a size of at least 2 pmax (pmax being the correlation window size; c being the "chunk" number; evaluation of the auto-correlation function starts at pmin to constitute a lower bound on possible chunk/packet sizes). Then, the input buffer pointer is moved by p(c) samples (thus constituting a "chunk"), c is incremented and if necessary new audio samples are fetched from the audio device.
If no periodicity was found in the signal (i.e., the content of the "chunk" is unvoiced speech or noise), p(c) is close to pmax (figure 1). Thus, by applying a fixed bound pu (minimal length of a chunk classified as "unvoiced") to p(c) and p(c-1), as well as applying another bound to the first derivative of p(c), it is possible to detect speech transitions. The detection routine may run in parallel and can be combined with silence detection.
To alleviate the incurred header overhead, which would be prohibitive for IP if every chunk is sent in one packet, two consecutive chunks are associated to one packet (see figures 1 and 2, s(n): time domain signal n: sample number).
If a voiced/unvoiced (vu) transition has been detected, the "transition chunk" is partitioned into two parts ca and cb (8a/b in figure 1) with p(ca) set to p(c-1) and p(cb)=p(c)-p(ca) (p(c) being the original chunk size). Note that if cmod2 = 0, the chunk c-1 (no. 7 in figure 1) is sent as a packet containing just one chunk.
When an unvoiced/voiced (uv) transition has taken place, backward correlation of the current chunk with the previous one (no. 3 in figure 2) is tested, as it may already contain voiced data (due to the forward auto-correlation calculation). If true, again the previous chunk is partitioned with p(cb-1)=pbackward(c-1) and p(ca-1)=p(c-1)-p(cb-1) (pbackward is the result of the backward correlation). Note that the above procedure can be performed only if cmod2 = 0; otherwise the previous chunk has already been sent in a packet. A solution to this problem would be to always retain two unvoiced chunks and check if the third contains a transition; however, the gain in speech quality when concealing would not justify the incurred additional delay.
With the above algorithm, "more important" (voiced) speech is sent in smaller packets and thus the resulting loss impact/distortion is less significant than using fixed size packets of the same average length, even without concealment (assuming that the network's loss probability is independent of the packet size and the mean number of packets sent remains the same). To enable concealment at the receiver, it is necessary to transmit the intra-packet boundary between two chunks (i.e., p(c) of the first chunk in the packet) as additional information in the packet itself and the following packet.
With our scheme, the packet size is now adaptive to the measured pitch period. Frequency distributions of packet sizes (weighted by the packet size itself to show the contribution to the entire test signal) for four different speakers in Fig. 3 show that the parameter settings can accommodate a range of pitches, as their overall shapes are similar to each other (parameters were: pmin=30 samples (start offset point of the auto-correlation); pu=120; pmax=160; note that pmin <= l <= 2 pmax). The most common packets contain two voiced chunks (vv packets), as distributions are centered around a value that is twice the mean pitch period (i.e., the mean of voiced chunks).
When detecting a lost packet (by keeping track of RTP  sequence numbers), the receiver can assume that the chunks of a lost packet resemble the adjacent chunks, because of the pre-processing at the sender. To avoid discontinuities in the concealed signal, the adjacent chunks are copied and resampled (using a linear interpolator) to exactly fit the lost chunk sizes, which are given by the packet length and the transmitted intra packet boundaries. No time-scale adjustment () is necessary because the chunk sizes are small. Because the sizes of the lost and the adjacent chunk most probably only differ slightly (and thus the respective spectra), no significant audible impact of the operation can be observed. Fig. 4 shows the concealment operation in the time domain.
Transitions in the signal might lead to extreme expansion/compression operations, because the length of an unvoiced chunk of a transition packet (denoted v|u or u|v) will usually be significantly smaller than in u|u packets (two unvoiced chunks). This is due to the chunk partitioning described in section "Adaptive packetization of speech transitions".
Table 1 lists the possible cases. va, ua are (the relevant) voiced/unvoiced available chunks, and vL, uL are (the relevant) voiced/unvoiced lost chunks. A u (u|v) packet is a packet where the second chunk contains an unvoiced/voiced transition that was not recognized by the sender algorithm. To avoid high compression, adjacent samples of the relevant length are taken and inserted in the gap. An audible discontinuity which might occur can be avoided by overlap-adding the concealment chunk with the adjacent ones. High expansions can be avoided by repeating a chunk until the necessary length is achieved and then again overlap-adding it.
Two properties of modern, frame-based speech coders do not allow a straightforward application of AP/C:
The first problem can only be alleviated by either trading higher loss-resiliency against higher bit-rate (using a nonadaptive codec: PCM) or, as a compromise, using a hybrid codec (waveform/parametric), where the impact of a packet loss to subsequently decoded speech is less severe (see  with regard to the G.729 codec).
The second issue should be tackled in the long term by a close integration of coding and packetization, as well as decoding and concealment (+FEC) functions. However, to allow operation together with existing codecs, we evaluate a simple fragmentation scheme.
Fig. 5 shows the packetization, when speech boundaries found by the AP algorithm are used to associate frames of length F to the actual packets sent over the network. As AP packets overlap the frame boundaries, a significant amount of redundant data, as well as additional alignment information (si), needs to be transmitted (yet redundant data can be used in a possible concealment operation: e.g., by overlap-adding it to the replacement signal). To allow analysis, we assume a constant AP packet size of l=kF+n, k and n being positive integers.
The fragmentation data "overhead" associated with packet i can then be written as follows:
For a sequence of N packets, this results in
With F mod n = 0 (0 < n < F), we have Of = N(F-n). Assuming n << F, Of' = Of / (2 pv N) gives an indication of the relative fragmentation overhead which can be expected for different speaker/ranges of packet sizes (pv being the mean pitch period). Table 2 compares that value to measurements. The fragmentation scheme results in an increase: e.g., for the G.729 codec, from 8 kBit/s to 12-14.4 kBit/s. If this increase is justified by an increased speech quality or if, for example, the built-in concealment of the G.729 is to be used instead, it should be evaluated by a separate subjective test for each codec.
To evaluate the properties and performance of AP/C, a subjective test was carried out. Test signals were the four signals (with different speakers) of approximately 10 seconds each, also used for the objective analysis (PCM 16 bit linear, sampled at 8 kHz). The new technique was compared with silence substitution (i.e., an adaptive packetization without concealment) and the simple receiver-based concealment algorithm "Pitch Waveform Replication" (PWR), which is the only one able to operate under very high loss rates (isolated losses). For PWR we used the same algorithm and fixed packet size (160 samples) as in .
Thirteen nonexpert listeners evaluated the overall quality of 40 test conditions (4 speakers x [3 algorithms x 3 loss rates + original]) on a five-category scale (Mean Opinion Score). Tests took place in a quiet room with the subjects using headphones.
The same packet loss pattern was applied to all input signals for one speaker (note that the sample loss pattern is different due to PWR working on fixed packet sizes only). To allow complete concealment and thus a relative evaluation of the algorithms, only isolated losses were introduced. Therefore we used a drop function which satisfies the condition Pi(i|i-1) = 0 (Pi(i|i-1) is the conditional probability of packet i being lost when packet i-1 has been lost) and approximates at the same time an equally distributed loss behavior with a given sample loss rate ().
Before testing started, an "Anchoring" procedure took place, where the quality range (Original = 5, "Worst Case" signal = 1) was introduced. For this test we used the unconcealed 50% loss signal (with AP) as the "Worst Case" signal.
Figures 6-8 show the mean MOS values for the three algorithms (Silence Substitution, Pitch Waveform Replication, and AP/C). Figure 9 gives the respective standard deviations of the MOS. As loss values we give the actual sample loss rate instead of the packet loss rate, as we deal with variable size packets. The pitch frequency axis refers to the measured mean of voiced chunks.
It can be seen that for all speakers, AP/C leads to a significant enhancement in speech quality compared to the "silence substitution" case, which is maintained also for higher loss rates. However, for speakers with high pitch frequencies, the relative performance (vertical distance between the surfaces if put in one graph) decreases. A reason for this is the chosen start offset point pmin (= 30 samples) of the auto correlation computation, which constitutes a lower bound on the chunk/packet size to avoid excessive packet header overhead, but also limits the accurateness of the periodicity measurement (note the small distance between the peak of the packet size distribution and the lower bound in Fig. 3 for the highest pitch frequency speaker: "female high").
The PWR algorithm performs well for loss rates of about 20% (cf. ); however, speech quality drops significantly for higher loss rates, as the specific distortions introduced by that algorithm become increasingly audible.
Subjective tests have been performed with PCM samples; this carries the implicit assumption that the speech immediately after the gap is decoded properly (see section "Support for frame-based codecs").
Objective measurements are clearly inappropriate for PWR (no aim at mathematical approximation of the missing signal segments). AP/C is not a reconstruction scheme as well; however, the adaptive packetization and subsequent resampling should perform better concerning mathematical correctness. Calculated overall SNR values for PWR (for the examples which are presented in this paper) are always below those for the distorted signal. SNR values for AP/C are always above those for the distorted signal and at least 4dB higher than for PWR. This confirms our conjecture, yet conclusions about speech quality should be based only on the subjective test results.
Table 3 gives the packet header overhead for different speakers, based on the sum of actually measured packet sizes. For a low average pitch period, we see that the overhead is comparable to a typical parameter setting in IP networks (160 bytes (=20ms) G.711 PCM audio in an IP/UDP/RTP packet [20+8+12 bytes header], resulting in 20% packet header overhead). However, it increases with an increasing mean pitch period. But even for higher pitch voices, the additional packet header overhead stays below 10%, which is comparable to adding a very low bit-rate additional source coding to reconstruct isolated losses ().
To support a possible concealment operation it is necessary to transmit the intra-packet boundary between two chunks as additional information in the packet itself and the following packet. That amounts to two octets of "redundancy" for every packet, that could, for instance, be transmitted by the proposed redundant encoding scheme ().
When the frame length F is significantly smaller than the mean packet size (section "Support for frame-based codecs"), support for frame-based codecs can be assured with a reasonable amount of additional data.
The maximum additional delay introduced in the current implementation consists of
The computational complexity is low at sender and very low at the receiver as only simple operations (auto-correlation, sample rate conversion) have to be performed (thus dC,max << dS,max + dR,max). This makes the scheme well-suited for multicast environments with low-end receivers.
Backwards compatibility to existing audio tools is ensured, as most tools can receive properly variable length PCM packets (and then mix them into their output buffer); however, delay adaptation algorithms might need to be modified.
<proposed RTP encaps>
A technique for the concealment of lost speech packets has been presented. The core idea of preprocessing a speech signal at the sender to support possible concealment operations at the receiver has proven to be successful. It results in an inherent adaptation of the network to the speech signal, as predefined portions of the signal ("chunks" assembled to packets) are dropped under congestion.
The subjective quality, when using AP/C in conjunction with existing frame-based codecs, needs to be evaluated in further subjective tests. However, a more efficient scheme integrating the coder and appropriate packetization should be devised. We also plan to test more sophisticated speech classification/processing algorithms, yet always taking into account the compromise of quality and computational complexity.
From the perspective of the network, the presented application-level scheme could be complemented by influencing loss patterns at congested routers (queue management), thus also supporting more fairness between flows by avoiding bursty losses within one flow.
We are grateful to members of the GloNe (Global Networking) research group at GMD Fokus for discussions and participation in the subjective test.
This work was funded in part by the BMBF (German Ministry of Education and Research) and the DFN (German Research Network) and in part by the EEC within the ACTS project AC012 MULTICUBE.