Scalable and Reliable Multicast File Transfer Architecture

Manolo SOLA <sola@jet.es>
Waseda University
Japan

Masataka OHTA <mohta@necom830.hpcl.titech.ac.jp>
Tokyo Institute of Technology
Japan

Yoichi MURAOKA <muraoka@muraoka.info.waseda.ac.jp>
Waseda University
Japan

Toshinori MAENO <tmaeno@hpcl.titech.ac.jp>
Tokyo Institute of Technology
Japan

Abstract

In this paper we propose an end-to-end, point-to-multipoint, scalable, reliable, and transmission control protocol (TCP)-friendly file transfer architecture.

The host acting as the source encodes a file with a weak forward error correction code, packetizes it, and transmits all packets over several multicast groups at different transmission rates. As long as there are members for a multicast group, the host periodically retransmits all packets through that multicast group, each time in a different pseudo-random order.

A host acting as a receiver starts the reception process by joining the multicast group with the lowest transmission rate. If the host does not experience any packet loss during a period of time, the host leaves the current multicast group and joins the one with next higher transmission rate. If the host detects a packet loss, it leaves the current multicast group and joins the one with next lower transmission rate. If no packets are received in a period of time, the host interprets that connectivity has been lost and leaves the current multicast group. A host leaves the multicast group and finishes the reception process when it has enough packets to decode the original file.

The architecture scales to a huge number of receivers without modification to multicast routers, and, except for usual multicast control packets, needs no feedback between receivers and the source, nor among receivers.

In the paper we further rationalize our architecture and compare it with other reliable multicast proposals (SRM, DTRM, RMTP, LBRRM, TMTP, and RMDP).

1. Introduction
2. Brief description of the architecture
3. Rationale of the architecture
4. Detailed description of the architecture
5. Criteria for evaluating reliable multicast transport and application protocols
6. Comparison with other proposals
7. Conclusions
8. References

1. Introduction

Most of the traffic in today's Internet is transmission control protocol (TCP) traffic. Some links' measures within the MCI and vBNS (very high speed backbone network service) backbones show that TCP accounts for more than 95 percent of the total number of bytes transmitted over those links [Claffy98]. That means that reliable transmission of data between two systems constitutes the main service for which the Internet is being used.

An unreliable transport protocol such as user datagram protocol (UDP) could also be used for error-free transmission between two hosts given that reliability is included at both hosts' application layers. An advantage of TCP is the TCP's behavior of slow start, congestion avoidance, fast retransmission, and fast recovery, which allow TCP connections to efficiently share and adapt to the available bandwidth in the Internet. As TCP traffic adapts itself to the load of the Internet and UDP does not, UDP traffic unregulated by the application layer is considered harmful in the sense that it does not avoid or respond to traffic congestion, always forcing TCP traffic to back off.

Multicast is an alternative to unicast when a host wants to simultaneously transmit the same packets to more than one host. At the receiver side, however, as current multicast protocols are designed to work over UDP, the reception of all the packets by all hosts is not assured. As multicast is based on UDP, unregulated multicast is also considered harmful for TCP traffic.

Another difference between TCP and UDP is the scope of its application. TCP is used when reliability is a must, and UDP when real-time is the main objective. Reliability and real-time are basically opposite concepts. Real-time implies a stringent delay imposed on reception. If the information does not arrive before a given time, that information becomes useless and the receiver considers it lost and cannot wait for it anymore. In that sense, real-time applications cannot tolerate more than a given delay and, because of this fact, can and must tolerate some possibility of packet loss. On the other hand, reliability implies a strict information loss constrain imposed on reception. If some information is lost during transmission, the receiver must wait for a new retransmission, which may also be lost and would have to be retransmitted once again. In that sense, reliable applications cannot tolerate more than a given packet loss, and because of this fact, can and must tolerate any delay.

Thus, it is our explicit nongoal to have reliable real-time multicast.

TCP's reliability and flow control is provided only by hosts, not by routers. This approach follows the so-called "end-to-end argument." According to the Internet Architecture Board's view on the architectural principles of the Internet [RFC1958],

The basic argument is that, as a first principle, certain required end-to-end functions can only be performed correctly by the end-systems themselves. A specific case is that any network, however carefully designed, will be subject to failures of transmission at some statistically determined rate. The best way to cope with this is to accept it, and give responsibility for the integrity of communication to the end systems.

TCP provides reliability and adaptability to network conditions for point-to-point transmissions, and the purpose of this paper is to propose an equivalent architecture for point-to-multipoint transmissions. The four requirements that we impose to the architecture are (i) reliability, (ii) scalability, (iii) TCP-friendly adaptability, and (iv), following the "end-to-end argument," that TCP-friendliness be provided only by hosts, not by routers.

2. Brief description of the architecture

The host acting as the source encodes a file with a weak forward error correction (FEC) code, packetizes it, and transmits all packets over several multicast groups at different transmission rates. As long as there are members for a multicast group, the host periodically retransmits all packets through that multicast group, each time in a different pseudo-random order.

The architecture scales to a huge amount of receivers without modification to multicast routers, and, except for usual multicast control packets, needs no feedback between receivers and the source nor among receivers.

3. Rationale of the architecture

Let us briefly discuss how the proposed architecture verifies the four requirements that we imposed at the end of section 1.

3.1. Reliability

Packets sent by the source can be dropped by the network. Because of the scalability and the end-to-end requirements, there cannot be any feedback from receivers, so reliability has to be provided by the source retransmitting all packets into a multicast group while the multicast group has receivers. A receiver that has not enough packets to decode the original file should be expecting to receive those packets through one of the available multicast groups. The use of a FEC code by the source makes the number of retransmissions smaller.

3.2. Scalability

A huge amount of receivers may simultaneously inform the source that they want to receive the file, and the source may have to simultaneously transmit the same packets to a huge number of receivers. The architecture scales to this situation by using a multicast protocol to send packets from the source to receivers, and by not allowing receivers to send any feedback to the source or to other receivers. The architecture assumes that the multicast protocol can scalably detect and inform the source about the presence of at least one receiver for a multicast group. With CBT [RFC2189] or PIM-SM [RFC2362], it is necessary that the source be collocated with a Core/RP.

3.3. TCP friendliness

The source and all receivers should coordinate among themselves in order to behave in a TCP-friendly manner. As there is no feedback concerning rate information being merged on routers from receivers to the source, and receivers must be able to reduce or increase the speed at which the source is sending packets, the source must provide several multicast groups at different transmission rates.

3.4. End-to-end argument

Reliability and TCP-friendliness control is introduced at the application layer and only in hosts. The architecture assumes that the underlying multicast protocol scales to a huge number of receivers.

4. Detailed description of the architecture

4.1. Packet format

First, the source divides the file to transmit into segments and generates additional segments of FEC code. Each segment is sent in one packet.

As segments are sent in pseudo-random order, a segment number is appended to each segment to make possible the correct reordering of segments at the receiver. The source must generate the UDP checksum. The source also appends a transmission sequence number to each segment. Two packets sent consecutively in a multicast group must have consecutive transmission sequence numbers. Receivers will detect packet loss when two packets arriving consecutively in a multicast group do not have consecutive transmission sequence numbers.

    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |         UDP Checksum          |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    | Transmission sequence number  |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |        Segment number         |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |     File's version number     |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |        Segment content        |
    |      (file or FEC bytes)      |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Figure 1. Format of a packet for the end-to-end, point-to-multipoint, scalable, reliable, and TCP-friendly file transfer architecture.

4.2. Forward error correction code

The FEC should be weak enough just to cover packet losses caused by noise in lines, transient instability in routing protocols, and similar situations but never caused by continuous congestion. A stronger FEC will make the reception process more tolerant to higher packet losses, and will discourage receivers to move to a lower rate multicast group, making them not behave in a TCP-friendly way.

In the Internet, packet loss not caused by congestion is very small (much less than 1% [RFC2001]); therefore the loss of a packet is more likely to signal congestion somewhere in the network between the source and the receiver. A 1 percent packet-loss recovery strategy can be inexpensively implemented using a parity code. For every 99 packets of original data, 1 packet containing a parity check for the 99 packets also has to be sent. Note that FEC does not require detection of the modified data or the location of the lost packet because the packets contain sequence numbers and are protected by link/IPv4/transport checksum. If these 100 packets are consecutively sent, burst packet drops in the network caused by, for example, routing instability may produce more than 1 of the 100 packets to be dropped. In order to minimize the effects of burst packet drops, packets are sent in pseudo-random order which is also changed every time the file is retransmitted.

4.3. Number of multicast groups and their rates

The source decides how many multicast groups, let's say N, will be used. All packets are sent in pseudo-random order through multicast group i (i =0,1,...,N -1), with Internet protocol (IP) multicast address G_i, using the transmission rate R_i ( R_i < R_(i+1) ) as long as there are members in multicast group i. A different pseudo-random order is used for every retransmission of the file in order to destroy the correlation of burst packet losses between each retransmission. For the announcement of relevant information like IP multicast addresses, fastest and slowest transmission rates, and RPs or Cores' location, the approach proposed in session announcement protocol (SAP) [Maher98], static multicast [Ohta98], or simple multicast [Perlman98] can be used.

The maximum transmission rate is determined by the source, considering its own capabilities and access network.

In order to fairly share traffic with TCP, the multicast group to be joined at a given time by a receiver should be a multicast group with the same rate that will be used by a TCP connection between the receiver and the source at that moment. However, the bandwidth cannot be measured by having a TCP connection.

We try to emulate that behavior without using TCP and without any feedback sent from receivers to the source. We consider a wide range of rates in order to satisfy the different needs of individual receivers. The set of rates we propose is given by

R_i = R * 2^i bps, i =0,1,...,N-1

We take as our round-trip time (RTT) estimate the initial assumption of 6 seconds for the maximum RTT in the 4.4BSD-Lite TCP implementation (retransmission timeout for the initial SYN) [Richard], and make the source to send to G_0 only one packet every 6 seconds. For example, a maximum transmission unit of 512 bytes will imply a value for R equal to 683 bps.

4.4. Receiver's behavior

First, a receiver sets the Connected-channel (Cchn) to 0 , the Slow-start-threshold (Ssthresh) to N, and joins the multicast group G_Cchn.

A receiver must restart the Connection-lost timer each time a packet arrives or a new group is joined. We propose to use a value of 60 seconds for the Connection-lost timer. This value is big enough to allow a receiver to join a multicast group and receive the first packet.

A receiver must start the Upgrade timer when it receives the first data packet from the multicast group currently joined. If the receiver is in the Slow-start-phase, we propose a value for the Upgrade timer of 12 seconds. This value ensures that a receiver in the Slow-start-phase will change only from G_0 to G_1 after the reception of two packets with consecutive transmission sequence numbers. If the receiver is in the Congestion-phase, the Upgrade timer is set to 12*2^Ssthresh seconds. This value is used to emulate the behavior of TCP in the congestion avoidance phase and not to delay by an exponential factor the change to a higher rate multicast group.

If Cchn is lower than Ssthresh, the receiver is in the Slow-start-phase. If Cchn is not lower than Ssthresh, the receiver is in the Congestion-phase.

If, while in multicast group G_Cchn ( Cchn <N-1 ), the Upgrade timer times out, the receiver must leave the multicast group G_Cchn, make Cchn = Cchn +1, and join the multicast group G_Cchn.

If, while in multicast group G_Cchn ( Cchn 0 ), a transmission sequence number discontinuity is detected, the receiver must leave the multicast group G_Cchn, make Ssthresh = Cchn-1, Cchn = Cchn -1, and join multicast group G_Cchn. While in G_0, the Upgrade timer is restarted when a transmission sequence number discontinuity is detected, and if there is more than 10 percent packet loss or the Connection-lost timer expires nine consecutive times, the receiver must give up the reception process.

If the Connection-lost timer times out and Cchn 0, the receiver must leave multicast group G_Cchn, make Ssthresh = Cchn-1, Cchn = 0, and join multicast group Cchn.

When a receiver has received enough segments to decode the original file, the file transfer is successfully completed.

If a user decides to abort the process of reception of the file, the segments already received may be cached for the next time the user decides to receive the same file. The source should provide, such as by means of a file's version number included in the coding scheme, some way for receivers to detect if the segments that were stored in a previous time belong to the same file that is going to be received at the present time.

5. Criteria for evaluating reliable multicast transport and application protocols

The Internet Engineering Task Force (IETF) has described in RFC2357 [RFC2357] the procedures and criteria for evaluating reliable multicast transport and application protocols. This informational request for comments (RFC) published in June 1998 recognizes the strong application demand for reliable multicast, pointing to reliable bulk data transfer multicast as one of their most critical requirements.

RFC2357 requires reliable multicast proposals to explain in detail the protocol behavior with respect to scalability, congestion control, error recovery, robustness, security, and privacy.

According to the IETF, two aspects of reliable multicast make standardization particularly challenging. First, the meaning of reliability varies in the context of different applications. Second, if special care is not taken, reliable multicast protocols can cause a particular threat to the operation of today's global Internet.

According to RFC2357, different multicast applications have widely different requirements for reliability. Some examples follow [RFC2357]:

Some applications require that message delivery obey a total ordering while others do not.
Some applications have many or all the members sending data while others have only one data source.
Some applications have replicated data, such as in an n-redundant file store, so that several members are capable of transmitting a data item, while for others all data originate at a single source.
Some applications are restricted to small fixed-membership multicast groups, while other applications need to scale dynamically to thousands or tens of thousands of members (or possibly more).
Some applications have stringent delay requirements, while others do not.
Some applications such as file-transfer are high-bandwidth, while other applications such as interactive collaboration tools are more likely to be bursty but use low bandwidth overall.
Some applications will sometimes trade off less than complete reliability for more timely delivery.

But, as we have discussed in section 1, complete reliability means possibly infinite delay. Once we accept that we have to abandon the delay requirement we have,

packet ordering is unimportant and requirement 1 goes away.
the data can be gathered into a single source and requirement 3 goes away.
requirement 5 is bogus.
burstiness can be smoothed by buffering and requirement 6 goes away.
requirements 3 and 4 should be satisfied by the underlying unreliable multicast protocol, and our architecture does not give additional restrictions on it.
requirement 7 is satisfied by resource reservation and has nothing to do with reliable multicast.

6. Comparison with other proposals

A general design issue regarding Internet protocols is that all designs must scale readily to very many nodes per site and to many millions of sites [RFC1958]. The architecture we have proposed satisfies that general issue.

However, in subrate multiplexing (SRM) [Floyd97], when a host detects a loss, it must send a repair request to the source at some time in the future to request the retransmission of that packet. This feedback between receivers and the source makes the protocol not to scale to a huge number of receivers. The same problem appears also in deterministic timeouts for reliable multicast (DTRM) [Grossglauser97], reliable multicast transport protocol (RMTP) [Paul97], log-based receiver-reliable multicast (LBRRM) [Holbrook95], and tree-based multicast transport protocol (TMTP) [Yavatkar95]. An alternative solution is to multicast requests and repairs with a limited scope by means of the TTL field in the IP header. However, this method for local recovery does not work if the underlying multicast algorithm is based on a unidirectional multicast tree, as is the case of PIM-SM, and what is worse, as the amount of improvement is only of a small constant factor, the proposals do not scale to a huge number of receivers. The same problem appears also in RMTP [Paul97] and TMTP [Yavatkar95].

We believe that in an environment combining a network packet-loss rate of, say, 1 percent, a huge number of receivers simultaneously expecting to complete the reception of a file, new receivers joining at any moment, and other leaving the reception process, there is a high probability that every packet will not arrive to some receiver, so all packets will have to be retransmitted. This makes it unnecessary to have feedback, even as much as the number of data packets, from receivers to the sender, which does not scale.

In DTRM [Grossglauser97] receivers must also send negative acknowledgments (NACKs) to the source, so this proposal does not scale to a huge number of receivers, either.

RMTP [Paul97] is based on a hierarchical structure in which receivers are grouped into local regions or domains and in each domain a special receiver called a designated receiver (DR) is responsible for sending acknowledgments periodically to the sender, for processing acknowledgements from receivers in its domain, and for retransmitting lost packets to the corresponding receivers. Although only the DRs send their acknowledgments to the sender, this only reduces the amount of feedback from receivers to the sender by a constant factor so it does not prevent acknowledgement implosion and does not scale to a huge amount of receivers. The same problems appear in LBRRM [Holbrook95].

RMDP [Rizzo97] can be classified as a hybrid FEC-NACK protocol, and still some feedback is needed from receivers. Vicisano [Vicisano98] further develops this last approach by introducing TCP-like congestion control and removing any feedback. However, the underlying multicast data transfer architecture is such that for a decrease in network packet-loss probability, higher transmission rates are used but for sending more redundancy and not original file data at higher speeds. However, the proper behavior should be the opposite; that is, for a decrease in the packet-loss probability, higher data transmission speeds but with less or at most equal FEC redundancy should be used. RMDP uses a strong FEC and packets are transmitted in a well-known order. In other words, it is a not-so-unreliable real-time multicast protocol.

We have not introduced any synchronization mechanism among receivers, as some proposals do [Vicisano98]. Reliable multicast traffic will be competing with a dynamic and unknown pattern of TCP connections, so even if there is a common bandwidth bottleneck for some group of receivers, at any moment TCP could cause congestion at any router below the bottleneck and break synchronization. Also, as there can be a huge amount of simultaneous receivers expecting to complete the reception of the file, any receiver can join or leave the reception process at any time making the synchronization effort useless. If we assume that the maximum reception rate below a bottleneck is X bps, then without synchronization the maximum assured reception rate for any receiver would be X/2 bps. The next best figure is X bps, but it cannot be assured even when using a synchronization mechanism because some situations, such as TCP traffic causing congestion in any link below the bottleneck, may force some receivers to back off. This will produce packet loss at the bottleneck and will also make all receivers at X bps back off to resynchronize. However, without any synchronization, receivers at X/2 would never back off if their links were not affected.

To provide reliability and congestion avoidance in multicast, some proposals slightly modify the multicast routing protocols at routers, adding some new functions that will be used only for this particular service [Grossglauser97]. Our approach is to use multicast routers as they are. Changing the behavior of routers takes time and is complex and costly, and that change would be justified only if it helps the general and global development of the Internet.

7. Conclusions

In this paper we have proposed an end-to-end, point-to-multipoint, scalable, reliable, and TCP-friendly file transfer architecture. We provide scalability using an underlying scalable multicast protocol and not allowing feedback among receivers from receivers to the sender. Reliability is obtained by the retransmission of all packets through all multicast groups with receivers, and TCP friendliness is achieved by the response of receivers to packet loss and congestion.

We plan to validate and further discuss the different parameters through simulation, and to submit the detailed protocol description as an Internet draft for further discussion inside the IETF.

8. References

[ Claffy98 ] "The Nature of the Beast: Recent Traffic Measurements from an Internet Backbone," K. Claffy, G. Miller, K. Thompson, INET'98, Geneva, Switzerland.

[ Floyd97 ] "A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing," S. Floyd, V. Jacobson, C. Liu, S. McCanne, and L. Zhang. Proceedings of ACM SIGCOMM '95, August 1995.

[ Grossglauser97 ] "Optimal Deterministic Time-outs for Reliable Scalable Multicast," M. Grossglauser, IEEE Journal on Selected Areas in Communication, vol. 15, pp. 442-433, April 1997.

[ Holbrook95 ] "Log-Based Receiver-Reliable Multicast for Distributed Interactive Simulation," H.W. Holbrook, S.K. Singhal, D.R. Cheriton. Proceedings of ACM SIGCOMM'95, pp. 328-341, October 1995.

[ Maher98 ] "Session Announcement Protocol: Version 2," M.P. Maher, C. Perkins, Internet-Draft MMUSIC WG, work in progress.

[ Ohta98 ] "Static Multicast," M. Ohta, J. Crowcroft, Internet-Draft, work in progress.

[ Perlman98 ] "Simple Multicast: A Design for Simple, Low-Overhead Multicast," R. Perlman, C-Y Lee, A. Ballardie, J. Crowcroft, Z. Wang, T. Maufer, Internet-Draft, work in progress.

[ Paul97 ] "Reliable Multicast Transport Protocol (RMTP)," S. Paul, K.K. Sabnani, J.C. Lin, S. Bhattacharyya, IEEE Journal on Selected Areas in Communications, vol. 15, no. 3, April 1997, pp. 407-421.

[ RFC1598 ] "Architectural Principles of the Internet," B. Carpenter, Editor, June 1996.

[ RFC2001 ] "TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms," W. Stevens.

[ RFC2189 ] "Core Based Trees (CBT version 2) Multicast Routing -- Protocol Specification --," A. Ballardie, RFC 2189, September 1997.

[ RFC2357 ] "IETF Criteria for Evaluating Reliable Multicast Transport and Application Protocols," A. Mankin, A. Romanow, S. Bradner, V. Paxson.

[ RFC2362 ] "Protocol Independent Multicast-Sparse Mode (PIM-SM): Protocol Specification," D. Estrin et al., RFC 2362, June 1998.

[ Richard ] "TCP/IP Illustrated, Volume 1/2/3," W. Richard Stevens, Addison-Wesley, ISBN 0-201-63346-9/0-201-63354-X/0-201-63495-3.

[ Rizzo97 ] "A Reliable Multicast data Distribution Protocol based on software FEC techniques (RMDP)," L. Rizzo, L. Vicisano, Proc. of the Fourth IEEE HPCS'97 Workshop, Chalkidiki, Greece, June 1997.

[ Vicisano97 ] "One to Many Reliable Bulk-Data Transfer in the Mbone,'' L. Vicisano, J. Crowcroft, in Proc. of the Third International Workshop on High Performance Protocol Architectures ( HIPPARCH '97 ), Up.zip ala, Sweden, 12-13 June 1997.

[ Vicisano98 ] "TCP-like Congestion Control for Layered Multicast Data Transfer," L. Vicisano, L. Rizzo, J. Crowcroft, INFOCOM 1998, March 1998.

[ Yavatkar95 ] "A Reliable Dissemination Protocol for Interactive Collaborative Applications," R. Yavatkar, J. Griffioen, M. Sudan, in Proceedings of ACM Multimedia '95, pp.333-344, 1995.

Scalable and Reliable Multicast File Transfer Architecture

Abstract

Contents