Scalability of Internet Multicast Protocols
Manolo SOLA <firstname.lastname@example.org>
Current multicast proposals do not scale to provide multicast over the whole Internet. In this paper we propose a multicast protocol that scales well in this environment. This protocol is based on PIM-SM or CBT, and DNS. PIM and CBT are independent of unicast routing hierarchy, and DNS gives a scalable and hierarchical mechanism to find RPs or Cores with the recent extension of dynamic updating and tight security. We also propose a scalable method for multicast address allocation based on DNS.
The multicast model for the Internet is dynamic and location independent. That means that members of a multicast group may join or leave the multicast group dynamically and that multicast is not tied to any place in the Internet. Members must explicitly join a multicast group to receive packets from that group, while send-only members do not need to join multicast groups in order to send packets to the group. These are general assumptions about what multicast must be in the Internet. It is also strongly desired that, if multicast happens to be local and all senders and receivers are confined within a certain geographical area, all data and control packets for the multicast group do not go beyond that area.
The main problem is that current multicast proposals do not follow this model. Intra-domain multicast routing protocols are location dependent because they are designed to run only locally inside domains, and in the case of the inter-domain proposal, a multicast group is tied to a root domain. Despite this, they do not scale to the whole Internet.
Our approach is to analyze the necessary complexity of the multicast model for the Internet and develop a multicast protocol that achieves that complexity. The protocol we propose is based on CBT [Ballardie97] or PIM [Deering97, Estrin97], solving their scalability problems through DNS [Mockapetris87]. With our approach it is not necessary to introduce scopes such as organization local scopes. This property makes it possible to reuse the global database of DNS, the one that is used to uniquely allocate unicast addresses, to allocate multicast addresses also.
Although the paper is described assuming IPv4, the results are also valid for IPv6, and sometimes they are even better because IPv6's multicast address space is huge and as large as that of unicast.
Unicast routing tables can be aggregated by allocating aggregated unicast addresses. This does not mean that multicast routing tables can be aggregated by allocating aggregated multicast addresses.
One of the most important problems related to the Internet protocol was the size of the routing tables; they became too large. If unicast IP addresses are assigned to hosts without any rule, the only way to know which is the next-hop towards a given destination is to explicitly have the full unicast IP address of the destination together with the address of the next-hop in the routing table of all hosts in the Internet. This does not scale.
In order to reduce the size, it is necessary to provide some sort of aggregation for routes. Aggregation means to have a set of 2n continuous unicast IP addresses sharing the same next-hop, so only one entry is needed in the routing table instead of 2n. The entry is formed by 3 items: the prefix of the unicast IP destination address; the prefix length, that is, 32-n; and the next-hop towards the set of destination addresses.
This scheme implies that the way of assigning IP addresses to hosts must be hierarchical. A block of consecutive IP addresses is assigned to each organization so with only one entry in the routing table all the hosts inside the organization can be reached from outside. Organizations at the same level of hierarchy also receive a consecutive block of addresses so aggregation is also possible at the next level, and size of routing tables remains under control.
As the routes advertised outside domains are different from the ones advertised inside domains, different protocols in charge of distributing routing metric information are used to distribute routes inside (i.e., RIP, OSPF) and outside (i.e., BGP-4), although the nature of the information being distributed, the metric, the way of building routing tables, and the forwarding method remain the same.
With multicast everything is different.
A multicast protocol distributes information on how to reach members of a multicast group. This information is organized at each router in some kind of multicast routing table, and based on that information, multicast forwarding is performed. In the case of multicast, the route followed by packets from the source to the destination may not be on the smallest metric path.
Because there can be more than one member in a multicast group, maybe more than one interface will have to be used to forward the packet. The multicast forwarding process has to decide what interfaces the packet must be sent through. In order to do that, the multicast protocol must know about the location of members of that multicast group.
If a host desires to receive packets from a multicast group, it must inform the multicast protocol entity running on that host. It is a matter of the multicast protocol to take the necessary actions and communicate with the rest of the multicast protocol entities running on other hosts in order to finally receive packets for that multicast group.
Multicast forwarding is done based on the multicast IP address. The final objective is to deliver the packet to all the receivers of the multicast group associated to the multicast IP address. But, as receivers can be located anywhere in the Internet, there is no other alternative than have one entry per multicast IP address in the multicast routing table; that is, it is not possible to aggregate multicast IP addresses. This is an intrinsic scalability problem of multicast that cannot be avoided (Fig.1).
In Fig. 1, a continuous block of multicast IP addresses has been assigned to one domain. Although in this simple example packets from sources can be aggregated to the domain owning the set of multicast IP addresses (aggregated multicast addresses can be allocated to a domain), the routing table at router R3 needs to store different forwarding state for each multicast group (multicast routing table cannot be aggregated).
However, there is some work in progress [Estrin97b] in order to dynamically and randomly distribute aggregatable blocks of multicast addresses. Allocation of multicast addresses is a fundamental problem in multicast. In Section 5 we propose a method to distribute multicast addresses with DNS that scales well to the whole Internet.
The important point is that multicast routers must maintain an amount of state proportional to the number of multicast groups running on it. So a reasonable design goal is to have a multicast protocol which scales in that way. However, existing protocols scale worse, which causes other kind of problems.
There are several proposals for multicast protocols currently under discussion and development: DVMRP [Pusateri97], MOSPF [Moy94], CBT, BGMP [Thaler97], and PIM. Some scalability figures for these protocols and DNS are shown in Table 1.
A problem for some intra-domain multicast protocols is that O(G) is too large. A domain typically consists of a great number of organizations, each of which operates many multicast groups within or between organizations. This makes G large even if g is small. As a workaround to the problem, introduction of scopes, such as organization local scope, has been suggested. In this approach, with extra configuration and administration work, it is possible to reduce the effective value of G visible outside organizations to the number of multicast groups between organizations. However, it is just a workaround. There may be a large amount of inter-organization, but still local, multicast groups, especially with mobile hosts.
Organization or domain local multicast addresses make it impossible to use a global database, such as the in-addr.arpa. domain of DNS, to map from multicast addresses to multicast names.
On the other hand, with a multicast protocol having O(g) complexity, it is automatically assured that local multicasts do not affect the scalability at routers or links at a distance.
PIM-SM and CBT have a bootstrapping problem related to how to find the RP or core for a multicast group. The current proposal randomly chose the RP from multiple candidates within a domain. The candidates may be shared by multiple groups. That is, the location of the RP or core cannot be controlled group by group unless C is O(G), which is an unacceptable scalability figure. If location of RP or core is randomly determined within a domain, the strongly desired property of multicast is lost; that is, local multicast may have to use a core or RP at a distance, which generates nonlocal data or control traffic and worsens scalability, as happens with the introduction of source-specific joins to shortcut the RP.
These are the scalability problems of existing intra-domain multicast protocols.
The Grand Unified Multicast (GUM) protocol, or, as it has been called since October 1997, the Border Gateway Multicast Protocol (BGMP) [Thaler97], is a protocol intended to solve the scalability problems of multicast. Its introduction would be natural if we need different multicast protocols for different unicast routing protocols, as with MOSPF for OSPF, but as exemplified by PIM and CBT, multicast protocols can be independent of unicast protocols.
As we have already seen, some intra-domain multicast protocols need a hierarchy of organization local scope. So the introduction of inter-domain multicast protocols to add at least another level of hierarchy can only be rationalized if they can solve the scalability problems.
However, the introduction of GUM/BGMP cannot only reduce the scalability down to O(g) but also makes the scalability problem even worse. Moreover, the unifying approach of GUM makes it possible to use the intersection, not union, of properties of unified components.
GUM/BGMP is basically a complex version of CBT allowing also source-specific unidirectional trees and designed for inter-domain multicast routing. In BGMP some routing inefficiencies due to tunneling can arise when it is combined with other intra-domain multicast routing protocols. If a domain loses connectivity with the root domain for a group, according to the current specification of BGMP, all state related to that group will be deleted and multicast traffic for that group will not be able to be routed among connected domains with senders and registered receivers.
As happens with CBT and PIM, BGMP has only one root domain per multicast group, making the protocol location dependent.
When GUM/BGMP is combined with DVMRP or MOSPF, a packet for a multicast group will be sent to a neighboring domain through all the links between them. This happens because AS boundary routers join groups as wildcard receivers within MOSPF to pull out packets to other neighboring domains, and also because when doing the RPF test at each router of the domain receiving the packets, any AS boundary router can appear as the next hop towards the source. This can be considered a waste of bandwidth in inter-domain links.
Also, some problems appeared in CBT version 2 [Ballardie97] when a CBT domain was a transit domain. As a CBT component interpreted a source-specific join as if it were a join for all multicast group's sources, if that router was not already on the core-tree for that multicast group, then it sent a CBT join message for all sources towards the core for G. As a consequence, this action pulled all data for the group G down to the CBT domain even when no members existed.
Although this problem has been corrected recently in CBT version 3 [Ballardie98], there still are some inefficiencies when a CBT is a transient domain. In that case, the path followed for the multicast traffic goes from the ingress Border Router (BR) for a source to the Core, and then from the Core to the set of BRs that joined that source. However, the shortest path for the transient domain is the one that goes directly from the ingress BR for a source to the set of BRs that joined that source. With CBT, this last path is only possible if the Core for the source's multicast group is placed at the ingress BR for that source. But because there can be more than 2 BRs, and sources for a multicast group can be distributed in a random way throughout the Internet, the same problem is likely to appear for other multicast group's sources and also for other sources of other multicast groups associated to that core.
Another problem comes from the fact that the place and capabilities of the RP/Core are decided by local administrators of each domain, and they don't know the performance requirements of each multicast group. In the next section we show a way to let the group administrator have control over the location and capabilities of the RP/Core.
With respect to PIM, all domain border routers join all RPs with (*,*,RP) state. That means that they will receive all packets from all groups' sources internal to the PIM domain and will inject them appropriately into neighboring domains. If there is more than one link between two domains, that packet may enter the receiving domain by two different border routers, and they will send it to the domain's RP for the group. After receiving the first of these packets, the RP will mark the sending border router as the one allowed to send packets for that source and will send register stop messages to the rest of the border routers.
As a consequence of prunes in the sending domain, only one of the available links between the two domains will finally be used for that group's multicast. PIM with GUM/BGMP presents a bottleneck problem as a consequence of using only one inter-domain link among many. This problem becomes more serious when QoS is needed.
In order to avoid such kind of interoperational problems, it can be argued that, instead of creating a new protocol for the inter-domain case which, as a matter of fact, is a complex version of an existing intra-domain multicast protocol, it is better to build a unique multicast protocol or to adapt an existing one so it could be run both intra- and inter-domains, all over the whole Internet.
Taking PIM or CBT as the common multicast protocol for the whole Internet, we propose a method for finding the location of RPs or Cores with DNS. Furthermore, we also propose a scheme to assign multicast addresses using DNS. Both methods scale well to the whole Internet.
Today's scalability issues of PIM and CBT are related to BSR messages and Candidate RP-advertisements. To scalably advertise the locations of RPs or Cores, we certainly need some hierarchical indexing method. The method should have a certain degree of scalability, security, and dynamic updatability.
By introducing two new RR types, RVP and CORE, DNS can advertise RPs or Cores and multicast addresses as shown in Fig. 2. We give an example only for the RVP resource record; the CORE resource record has the same functionality.
In Fig. 2, the administrator of the bbc.com. domain declares that there is one RP for a multicast group provided by the BBC. The name of the multicast group is bbc.com. and the multicast address that has been assigned to this multicast group is 18.104.22.168. Applications knowing the name of the multicast group will look up the RP and the multicast address through DNS.
This kind of relationship between the owner and the user of the multicast group address is something similar to MX and mail servers. The mail server for a domain (that is, the mail server declared with the MX resource record) can be placed in a different subnet or organization with respect to the name server for the domain.
It is necessary to reverse lookup addresses or to map from multicast addresses to multicast names, for example, to verify that 22.214.171.124 is actually assigned to bbc.com.
Allocation of addresses in general has a very complex political problem. We may be able to say that the problem in the unicast case is now solved. But it is a problem of administration, not of the protocol. DHCP may assign an address to a host but the assignment is meaningful only when an administrator of the DHCP server has been delegated a block of valid unicast addresses. In the case of multicast, we have to develop a policy for multicast address allocation.
Our proposal is to reuse the policy, mechanism, hierarchy, and established delegation of unicast addresses allocation for the policy, mechanism, hierarchy, and delegation of multicast addresses allocation. That is, we use in-addr.arpa. domain and PTR RR (Fig. 3).
In this example, the first byte, 225, indicates that this is a multicast address for rather static DNS-based allocation. An administrator having been delegated the unicast addresses 126.96.36.199 to 188.8.131.52 is automatically delegated the multicast address 184.108.40.206. It should be noted that just as the administrator can delegate 64 unicast addresses further to someone else, he also can delegate the multicast address to anyone else. An ISP owning 4 class C blocks can simultaneously support, through DHCP, 1024 users and 4 simultaneous multicast address requests from users. Of course, it is also possible to reserve larger, up to 16 times more, address space for DNS allocation even with small IPv4 multicast address space. With IPv6 the multicast address space is as large as unicast address space so the assignment scheme does not present any problem.
As DNS continues to scale for unicast address lookup of all the hosts in the Internet, it is assured to have enough scalability for multicast lookup too. As DNS is UDP-based, simple query-and-reply protocol, it is likely to be able to support all the queries of all the people on the earth, including those not yet connected to the Internet.
Using DNS, we can now control the location of the core or RP near to the expected locations of senders. It is the responsibility of the administrator of the multicast group to locate the core or the RP efficiently taking into account the properties of the group. For example, in Fig. 4, the average number of hops from sender S to the receivers through the RP is 6.4 if the RP is located at A, but it is 4 if the RP is located at S. If it is located at C, it is 8.8. So if an administrator thinks that the sender of a multicast group is likely to be located at around S, S should be designated as the RP. If all the hosts are equally likely to be senders, A should be designated as the RP.
However, a single core or RP has an obvious problem of fault tolerance. Moreover, the distribution pattern of senders may not be predicted very precisely.
In PIM-SM only one RP is allowed and the approach taken to reduce the average number of hops among sources and receivers is to change from the shared tree to source-specific trees. This method worsens the scalability of the protocol (Table 1).
Thus, it is desirable to have multiple cores or RPs.
Note that it does not improve the fault tolerance or the efficiency to have a large number of cores or RPs. For the fault tolerance of DNS, having at least 2 name servers is a requirement and 5 is, perhaps, more than enough. So it is for PIM. For the sake of efficiency, if in Fig. 4 we designate B, C, D, and E as RPs for the same multicast group, the average number of hops from S to receivers will be 4.8, if S chose D to send data to the multicast group. This number is close enough to the minimum average of 4. However, even if we designate 13 RPs (A to M), the average number of hops only goes down to 4.4. In general, there is a law of diminishing returns, so having too much RPs is not meaningful.
That is why we propose Multicenter PIM/CBT for better fault tolerance and efficiency assuming multiple but small number of RPs. A CBT specification allowing more than one core was proposed for CBT version 1, but in version 2 this feature was deleted possibly because of loops. This problem has been addressed in OCBT [Shields]. For the rest of the paper, we will concentrate in the case of PIM-SM with no source-specific trees; that is, packets from sources to the RP are sent using unicast encapsulation.
The small numbers of RPs can be designated very easily through DNS as is shown in Fig. 5.
A problem is how senders and receivers choose one RP. We assume that they choose the nearest one with regard to the unicast metric. As now it is possible to have more than one RP, the following situation may appear: a join message for a multicast group directed to a RP can arrive at a router that already has state for that multicast group but for a different RP. As the router can only use one RP for the multicast group, we solve the tie by selecting the nearest RP to the router. Then, the remaining problem is the way RPs communicate with each other.
This can be accomplished in several ways. In this paper we address two approaches. We will explain them using PIM and the example commented upon so far:
In this paper we have proposed a multicast protocol to be applied to the whole Internet. This protocol is based on PIM-SM with several RPs per multicast group, RPs' lookup based on DNS, and an Inter-RP Protocol for forwarding packets among RPs. Instead of PIM, other intra-multicast protocols, as for example CBT, could also be used. We believe that our approach scales well to the whole Internet. We don't think that a solution based on different intra- and inter-domain multicast protocols, as in the case of combining PIM-SM, CBT, and other intra-domain protocols with GUM/BGMP, will work well over the Internet.
We are grateful to Prof. Jon Crowcroft for his advice on current multicast activities. This work was supported in part by the Japanese Ministry of Education, Science, Sports and Culture (MONBUSHO) and by the Research for the Future Program of the Japan Society for the Promotion of Science under the Project "Integrated Network Architecture for Advanced Multimedia Application Systems" (JSPS-RFTF97R16301).
[Ballardie97] A. Ballardie, "Core Based Trees (CBT version 2) Multicast Routing -- Protocol Specification --," RFC 2189, September 1997.
[Ballardie98] A. Ballardie, "Core Based Trees (CBT version 3) Multicast Routing -- Protocol Specification --," Work in Progress, March 1998.
[Deering97] S. Deering et al. "Protocol Independent Multicast Version 2, Dense Mode Specification," Work in Progress, May 1997.
[Eastlake97] D. Eastlake, C. Kaufman, "Domain Name System Security Extensions," RFC 2065, January 1997.
[Estrin97] D. Estrin et al. "Protocol Independent Multicast-Sparse Mode (PIM-SM): Protocol Specification," Work in Progress, September 1997.
[Estrin97b] D. Estrin, M. Handley, and D. Thaler, "Multicast-Address-Set advertisement and Claim mechanism," Work in Progress, November 1997.
[Mockapetris87] P. Mockapetris, "Domain Names: Concepts and Facilities," STD 13, RFC 1034 (and RFC 1035), November 1987.
[Moy94] J. Moy, "Multicast Extensions to OSPF," RFC 1584, March 1994.
[Pusateri97] T. Pusateri, "Distance Vector Multicast Routing Protocol," Work in Progress, March 1998.
[Shields] C. Shields, J. J. Garcia-Luna-Aceves, "The Ordered Core Based Tree Protocol," in Proc. of IEEE Infocom97, Kobe, Japan, April 1997.
[Thaler97] D. Thaler, D. Estrin, and D. Meyer, "Border Gateway Multicast Protocol (BGMP): Protocol Specification," Work in Progress, March 1998.
[Vixie97] P. Vixie, S. Thomson, Y. Rekhter, J. Bound, "Dynamic Updates in the Domain Name System (DNS UPDATE)," RFC 2136, April 1997.