Robust Audio Streaming Over IP

Andrew G. DAVIS <andrew.g.davis@bt.com>
Rory S. TURNBULL <rory.turnbull@bt.com>
Matthew D. WALKER <matthew.d.walker@bt.com>
BT
United Kingdom

Abstract

The effects of packet loss on the transmission of streamed audio over Internet protocol (IP) can be mitigated by the inclusion of redundancy in the transmitted stream. Layered coding schemes, where output audio properties can be built up by the addition of successive sub-streams, are ideally suited to this application, with the redundancy consisting of one or more sub-streams of the primary layer bit-stream. This paper introduces the design of a layered audio codec based around MPEG-2 layer-2 and describes a scheme for its use in robust audio streaming over IP. A subjective evaluation of the scheme is presented for music under simulated packet-loss conditions.

1. Introduction
2. MP2S layered codec
3. MP2S for robust audio streaming
4. Subjective testing
- 4.1 Methodology
- 4.2 Results
  - 4.2.1 Analysis of results
  - 4.2.2 Interpretation of results
5. Conclusions
6. References

1. Introduction

1.1 Streamed audio over IP

Many Internet users have used some form of streamed real-time service, maybe in the form of Internet telephony, radio, or even television [1,2,3]. Such streamed services aim to maintain the real-time availability of continuous signals at receivers, typically, but not necessarily, audio or video for continuous output. Variations in packet arrival time are handled by suitable build-out buffering [4] within the decoder, at the cost of introducing delay to the streams. Late packets that cannot be handled by build-out must be considered as loss, and some form of packet-loss recovery process will generally be initiated. For the delivery of audio signals, the effects of packet loss will depend on the length of audio segment lost, the nature of the audio signal, and the packet-loss recovery process.

1.2 Packet loss recovery

Packet loss recovery processes may be classed as redundant or nonredundant. Redundant methods require the transmission of extra information that may be used when packets are lost. If no packets are lost then this information would not be used and would therefore be redundant. An example of such a scheme is the transmission of packets containing one new encoded audio frame together with a repeat of the previous audio frame [5,6,7]. The repeated information is generally coded to a lower bit rate than the primary stream. Fallback is a term given to the packet-loss recovery process that is initiated when this redundant information is used.

Nonredundant techniques require the transmission of no extra information, but instead rely on processing within the decoder, such as repeating previous waveforms [8] or parameters, to reduce the effects of packet loss.

1.3 Compression for streamed audio

Compression of an audio stream offers the advantages of not only reducing general network loading but also increasing the coverage of the service over lower bandwidth (integrated services digital network [ISDN] or public switched telephone network [PSTN] modem) connections. Using compression can also make the increased bit-rate requirements of transmitting redundant information more acceptable, offering streams of lower gross bit rate but higher robustness. Schemes can then be tailored to specific applications, using different rates and qualities of audio compression within each level of redundancy [6]. However, in the use of such techniques, careful consideration must be given to the transitions between codec types within the decoder. If the primary and fallback codecs are of different types, then how should inter-frame memory for the fallback decoders be handled? The options include using memoryless fallback codecs, appending state information onto appropriate packets, maintaining memory by running fallback decoders every frame, and using a memory reset prior to decoding. Each option has benefits and limitations.

The use of memoryless codecs is restrictive, especially so if low-rate voice coding is required, and appending state information might incur significant overhead. Running the primary and fallback decoders together increases processing requirements, which may or may not be an issue depending on processor and codec types and the loading from concurrent tasks. Reset and decode, where memory states are reset prior to a new fallback decode sequence, will work, but are not ideal. The amount of performance degradation will depend on the signal and codec type. However, layered coding schemes can supply both primary and fallback streams without any of the limitations outlined earlier.

1.4 Layered coding

A coding scheme can be considered to be layered if the encoded bit-stream can be split into a number of lower bit-rate sub-streams, combinations of which can be decoded to give valid output signals. Output signal properties, such as bandwidth and signal-to-noise ratio, can typically be scaled according to the combination of sub-streams selected. Decoding can be performed on complete layers, which are built from the primary sub-stream upward.

If the chosen codec type for a robust streaming application is layered, then layers can be used to form fallback streams of lower rates. Only one encoder is then required to produce all the streams, and in many designs memory continuity in the decoder can be preserved.

1.5 Robust audio streaming

The aim of the work presented here was to design a scheme for the robust streaming of audio over the Internet. Work was focused on utilizing the properties of layered audio coding to provide primary and fallback streams for redundant packet-loss recovery.

First, the design of a layered audio coding algorithm (MP2S) [9] is introduced. Next, the structure of a robust audio streaming tool is described. The tool offers redundant packet-loss recovery, with primary and fallback streams supplied by the layered MP2S encoder. A novel sub-stream repeat technique is described, which renders the transitions from primary to fallback streams less noticeable. Finally, an assessment of the performance of the audio streaming tool is presented for simulated packet-loss conditions, with consideration given to the benefits of the sub-stream repeat technique.

2. MP2S layered codec

The MP2S layered codec [9] is based around the MPEG-2 layer-2 audio coding standard [10]. MP2S encodes 16 kHz sampled audio signals within 72 ms frames (1152 samples) and uses psychoacoustic masking theory to allocate bits to 32 evenly spaced frequency sub-bands. Layered designs of sub-band sample quantization and bit allocation have been employed to produce four sub-streams, which can be combined as shown in table 1 to produce four layers.

**Table 1. Layers supported by MP2S**
Layer	Sub-streams	Total Bit Rate	Audio Bandwidth
1	1	8 kbit/s	1.25 kHz
2	1 & 2	16 kbit/s	2.5 kHz
3	1, 2 & 3	32 kbit/s	5 kHz
4	1, 2, 3 & 4	64 kbit/s	7.5 kHz

The layers increment both the audio bandwidth and signal-to-noise ratio. Layers 3 and 4 have sufficient audio bandwidth to be used as primary streams, while layers 1 and 2 are aimed predominantly at offering fallback streams. All lower sub-streams must be present at the decoder for any given layer to be decoded.

The MP2S decoder operates in such a way that, for any given layer, memory is inherently updated in the decoder for all lower sub-streams. Therefore, using MP2S for primary and fallback streams will avoid all memory continuity problems in the decoder.

3. MP2S for robust audio streaming

The MP2S codec has been used to form the core of an audio streaming tool in which the effects of packet loss on various audio streams can be simulated. The tool allows the evaluation of various packet-loss recovery processes including redundancy to be performed. In this case, packets can be constructed with both primary and fallback streams being selected from various combinations of MP2S layers. The tool allows packet losses to be injected manually, through a prestored profile, or via a mathematical model.

3.1 Encoder

The tool requires the user to specify a primary layer from the four supported by the MP2S codec. The primary layer defines the audio quality for zero packet-loss conditions, and it is intended that layers 3 or 4 be used for this purpose. The user then has the option of specifying a set of fallback layers, with bit rates equal to or less than that for the primary stream. In the event of a single lost packet, the first fallback layer will be used in place of the primary stream. If further fallback layers are specified, then these will be used when consecutive packets are lost.

3.2 Packet construction

The MP2S encoder will compress an input audio stream and generate sub-streams up to the specified primary layer number. The encoded layers can then be constructed by combining the relevant sub-streams. Encoded layers can then be assembled into packets to produce the desired primary and fallback configuration.

The MP2S encoder compresses the input audio stream on a frame-by-frame basis, where each input frame consists of 1152 samples. This means that all the output sub-streams and layers appear in discrete blocks, or encoded frames. The tool allows the encoded-frames-per-packet ratio to be defined, but consideration here is given only to the case of one encoded frame per packet.

One packet of encoded data is constructed for each encoded audio frame, using a combination of current and delayed frames. Figure 1 illustrates packet construction for a scheme with a layer 4 primary stream and layer 2 fallback. It is evident from figure 1 that the primary stream is delayed by one frame, as a result of applying one fallback layer.

Figure 1. Packet construction using MP2S for redundant packet-loss recovery.

Once the appropriate layers have been assembled, a 1-byte sequence number is appended to complete the packet. The sequence number is incremented for each successive packet and cycled through the 1-byte unsigned integer range of 0 to 255. The decoder uses the received sequence number value to organize build-out buffering and initiate packet-loss recovery.

3.3 Packet-loss recovery

The decoder build-out process aims to maintain valid data in the real-time play-out buffer. Build-out introduces delay so that some degree of variation in the packet arrival time can be handled. If after build-out the play-out buffer occupancy cannot be maintained, then a packet-loss recovery process is initiated. This can be caused by either a lost packet or one that arrives too late to be inserted into the real-time output.

The robust audio streaming tool includes three different packet-loss recovery processes:

Parameter repeat
Fallback
Fallback with sub-stream repeat

3.3.1 Parameter repeat

Here, the most recent valid frame may be passed through the decoder again. This effectively repeats the previously decoded frame, but with an additional contribution from interframe decoder memory. For many slowly varying "near-stationary" signals, a parameter repeat can render a single lost frame nearly inaudible. However, a parameter repeat performed on more quickly varying signals or for two or more successive lost frames can introduce annoying artifacts in the output audio signal.

3.3.2 Fallback

Unlike a parameter repeat, which requires no extra information to be transmitted, fallback requires redundancy to be added to the transmitted stream. For the example in figure 1, each received packet contains encoded versions of four consecutive audio frames. If the packet has been received within the build-out period, then the primary encoded frame will be decoded and the resulting audio buffered for output. The three accompanying fallback encoded frames are then added to a fallback buffer for possible future use. If the next packet is lost then the first fallback frame, layer 3 in this case, from the buffer is decoded. Similarly, if the next two packets are also missing, then the layer 2 and layer 1 encoded frames from the fallback buffer are also decoded. In this way, three consecutive lost packets would be handled by the decoding of layer 3, layer 2, and then layer 1 frames.

The delay offset between the primary and fallback streams and the quantity of redundancy applied can be adjusted so that a stream becomes more robust to the packet-loss characteristics of a specific network [11].

3.3.3 Sub-stream repeat

The use of the layered MP2S codec to support primary and fallback streams results in similar distortion characteristics for both streams. This is in contrast to the distortion that might occur when switching between an MPEG-type primary stream and a code excited linear prediction (CELP)-type fallback stream. However, while the decoded frequency bands for the MP2S layers may have similar noise characteristics, the reduction in encoded bandwidth when switching to a lower layer can produce annoying effects. This bandwidth switching distortion can be greatly reduced, and for many signals made imperceptible, by combining a sub-stream repeat with the fallback stream.

A sub-stream repeat reuses the higher frequency information from the last received primary encoded frame to enhance a reduced bandwidth fallback frame. Figure 2 shows a scheme using a primary MP2S layer 4 stream and redundant layer 2 stream. Here, fallback introduces a transition from a frame of 64 kbit/s 7.5 kHz bandwidth audio to 16 kbit/s 2.5 kHz bandwidth audio. Applying a sub-stream repeat will cause sub-streams 3 and 4 (2.5-7.5 kHz) to be repeated from the previous packet and combined with the fallback layer 2 stream (0-2.5 kHz). This will generate an audio output with full 7.5 kHz bandwidth when decoded.

Figure 2. Packet-loss recovery using redundancy and sub-stream repeat.

4. Subjective testing

To evaluate the performance of MP2S as a codec for robust audio transmission, a series of subjective tests were carried out using samples generated by the audio streaming tool. The results were analyzed to quantify any benefit gained through the use of redundancy and sub-stream repeats under simulated packet-loss conditions.

4.1 Methodology

The following packet-loss recovery processes were applied to an MP2S layer 4 stream:

Parameter Repeat (PR) [ref. 3.31] -- A nonredundant recovery method that involves repeating the previous layer 4 frame.
Redundancy (R) [ref. 3.32] -- Recovery relies on an additional MP2S layer 2 stream.
Redundancy and Sub-stream Repeat (R/SR) [ref. 3.33] -- As redundancy, but with the repetition of sub-streams 3 and 4 from the previous layer 4 frame.

Nineteen subjects were exposed to numerous pairs of short (8 sec) music samples. The first sample had been encoded using MP2S layer. The second sample in a pair had been subjected to packet loss with a rate of either 10 or 20 percent. In these tests, only one level of redundancy was applied, so the packet-loss profile was restricted to contain single packet losses. This is not necessarily representative of real network behavior [12]; however, the level or offset of the redundancy could easily be modified to cope with a specific packet-loss model [11] at the expense of additional delay and/or bandwidth.

Listeners were asked to grade the quality of the second sample when compared with the first in accordance with the degradation category ratings:

5: Degradation is inaudible
4: Degradation is audible but not annoying
3: Degradation is slightly annoying
2: Degradation is annoying
1: Degradation is very annoying

Listeners were given a variety of both classical (Rachmaninov) and pop music (Lightning Seeds) extracts to evaluate. The test sequence was randomized and a number of reference conditions were included, where the packet-loss rate was effectively zero percent. Also, every test started with a number of preliminaries, which were ignored when the results were analyzed. These preliminaries were inserted to allow the subjects to become accustomed to the 1 to 5 scale being used.

4.2 Results

The results obtained from each category were averaged to produce a degraded mean opinion score (DMOS) value. Table 2 shows these results along with the standard deviations (SDs) and 95 percent confidence intervals (CIs) for each of the individual categories. The graphs in figures 3 to 5 show these results with the mean DMOS scores plotted against packet-loss rate.

**Table 2. Subjective test results for MP2S under packet loss, organized by category**
Music Type	Recovery Method	Loss Rate (%)
		0			10			20
		Mean	SD	95% CI	Mean	SD	95% CI	Mean	SD	95% CI
Pop	PR	4.93	0.26	0.07	2.60	0.73	0.19	1.74	0.77	0.20
	R	4.93	0.26	0.07	2.07	0.73	0.19	1.39	0.53	0.14
	R/SR	4.93	0.26	0.07	3.63	0.62	0.16	3.21	0.70	0.18
Classical	PR	4.93	0.26	0.07	3.35	0.74	0.19	2.54	0.80	0.21
	R	4.93	0.26	0.07	1.98	0.77	0.20	1.61	0.62	0.16
	R/SR	4.93	0.26	0.07	4.77	0.42	0.11	4.46	0.63	0.16
Combined	PR	4.93	0.26	0.05	2.97	0.83	0.15	2.14	0.88	0.16
	R	4.93	0.26	0.05	2.03	0.75	0.14	1.50	0.58	0.11
	R/SR	4.93	0.26	0.05	4.20	0.78	0.14	3.83	0.91	0.17

Figure 3. Results of MP2S subjective tests with pop music.

Figure 4. Results of MP2S subjective tests with classical music.

Figure 5. Average of all MP2S subjective tests.

4.2.1 Analysis of results

The complete set of results were analyzed using linear regression analysis, which gives a measure of the relationships between the response variable Y (DMOS scores) and the p regressor variables X₁, X₂, ..., X_p (test conditions). For each y_i, there are values x_i = (x_i1, x_i2, ..., x_ip). The model takes the general form

where β = (β₁, β₂, ..., β_p) is a vector of p unknowns and ε_i is an error term associated with the ith subject. The aim of the analysis is to compute the β values such that the residual sum of squares (RSS) is minimized, where RSS is defined as

This was done using a generalized linear modeling software package.

A model was built using the forward search technique, starting with the null model and testing, at the 1 percent level, the introduction of each new regressor with its F value.

where model b is more complex than model a, and df_a is the degrees of freedom for model a.

This model was used, along with the 95 percent CIs, to formulate the following statistically valid statements:

The best recovery method is redundancy and sub-stream repeat (R/SR), followed by parameter repeat (PR), followed by redundancy (R).
Pop music results in subjectively poorer scores than classical music, but redundancy performs better for pop music than for classical music.
Increased packet loss results in subjectively poorer scores.
Parameter and sub-stream repeat performs better and is more robust as packet loss increases than the other methods are.

Note that this analysis has demonstrated that those trends suggested in figures 3, 4, and 5 are statistically significant trends and not merely random variation in the DMOS scores.

4.2.2 Interpretation of results

The results confirm that redundancy (R) alone does not conceal packet loss as well as other techniques. This most probably results from the momentary drop in audio bandwidth when the fallback layer is used. However, by using redundancy and sub-stream repeat (R/SR), the bandwidth is maintained and the overall robustness of the stream can be increased dramatically.

The results also show evidence that as the packet-loss rate increases, the performance gain from using redundancy and sub-stream repeat (R/SR) over the other methods is increased.

5. Conclusions

This paper has introduced some of the problems associated with the real-time streaming of audio over the Internet, in particular, the problem of concealing the effects of packet loss. We have adopted the idea of applying compression and redundancy to provide a robust stream, and have presented a layered codec that overcomes some of the technical problems currently associated with the application of redundancy.

An audio test tool was used in an assessment of the packet-loss robustness of the layered codec when used for streaming with redundancy. A novel sub-stream repeat technique was introduced as a method for increasing this robustness. The statistical evaluation of the results clearly showed the benefits of combining the sub-stream repeat and redundant streaming techniques, resulting in a significant performance gain over decoder-based parameter repeat techniques.

6. References

[1] Vocaltec Communications, Internet Phone, http://www.vocaltec.com/

[2] RealNetworks. RealAudio and RealVideo, http://www.real.com/

[3] Xing Technology Corporation. Streamworks, http://www.xingtech.com/

[4] B. Dempsy and Y. Zhang, Destination Buffering for Low-Bandwidth Audio Transmissions using Redundancy-Based Error Control, Proceedings of LCN, 21st Annual Conference on Local Computer Networks, October 1996, pp 345-54.

[5] J. Bolot, and A. Garcia, Control Mechanisms for Packet Audio in the Internet, Proceedings of Institute of Electrical and Electronic Engineers (IEEE) INFOCOM '96, Conference on Computer Communications, March 1996, pp 232-9.

[6] V. Hardman, M. Sasse, M. Handley, and A. Watson. Reliable audio for use over the Internet, Proceedings of INET'95, June 1995, pp 27-30.

[7] V. Hardman. Robust Audio Tool (RAT) Project, http://www-mice.cs.ucl.ac.uk/multimedia/software/rat/

[8] O. Wasem, D. Goodman, C. Dvorak, and H. Page, The Effect of Waveform Substitution on the Quality of PCM Packet Communications, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 36, no. 3, March 1988, pp 342-348.

[9] A. Davis and R. Turnbull, A Layered Audio Codec Operating at 8, 16, 32 and 64 kbit/s, Unpublished, September 1998.

[10] International Organization for Standardization (ISO)/International Electrotechnical Commission (IEC) MPEG, International Standard IS-13818-3, "Information Technology - Generic Coding of Moving Pictures and Associated Audio, Part 3: Audio"

[11] I. Kouvelas, O. Hodson, V. Hardman, and J. Crowcroft. Redundant Control in Real-Time Internet Audio Conferencing, AVSPN'97 -International Workshop on Audio-Visual Services over Packet Networks, September 1997.

[12] J. Bolot, H. Crepin, and A. Garcia, Analysis of Audio Packet Loss in the Internet, Proceedings of NOSSDAV '95, Fifth International Workshop on Network and Operating System Support for Digital Audio and Video, April 1995, pp 163-74.

Robust Audio Streaming Over IP

Abstract

Contents