Masataka Ohta <firstname.lastname@example.org>
Tokyo Institute of Technology
Lower layer forwarding is a useful technique for routers to map L3 next hop and a lower layer label in advance, which allows the forwarding of data by looking only at the lower layer labels and the skipping of L3 and higher layer header processing. As a result, the performance of routers can be improved by a small but constant amount. Although the technique is most useful for packets that need complex higher layer processing, such as those in an IPv4 RSVP data stream, care must be taken when applying the technique to packets with less higher layer processing. "Step" is an efficient and scaleable way to use lower layer forwarding technique for best-effort traffic.
Lower layer forwarding is a useful technique for routers to map, in advance, L3 (such as IP [Internet Protocol] addresses) and higher layer labels (such as UDP port numbers) and lower layer labels (such as ATM VPI/VCI or Ethernet MAC addresses), which allows forwarding of data by looking only at the lower layer labels and skipping of L3 and higher layer header processing.
In this paper, such routers are called switching routers. Lower layer refers to layer 1 (physical layer) or 2 (datalink layer) and higher layer refers to layer 3 (internetworking layer), 4 (transport layer), or above. For the clarity of argument and terminology, this paper assumes IP (Internet Protocol) to be the only layer 3 protocol. But the argument is equally applicable to any layer 3 technologies.
Lower layer forwarding is expected to reduce the load of data forwarding for routers by a small but constant factor. However, the mapping between the lower and the higher layer is not free. Thus, when designing a switching router, the obvious question is when and how to map the lower layer and higher layer labels. If this mapping requires a large effort, any improvement may be negated.
The other requirement for switching router architecture is that processing for lower layer labels must be considerably easier and faster than for higher layer labels. As such, the first proposal [CSR] concentrated on the easiest case, where the request for special treatment is explicitly required by the end user through RSVP or ST2. In such a case, the higher layer header processing, in general, involves TCP/UDP port numbers and is complex. Moreover, the end user expects a long-lived data stream that is worth the special advance setup of the mapping between lower and higher layer labels. Finally, an ATM cell has a L1 label, VPI/VCI (it should be noted that VPI/VCI of an ATM cell varies physical layer by physical layer as the cell travels over L2 ATM switches and has an L1 label), which has a 28 (or 24) bit constant length and is easy to forward by small, quick dedicated hardware, readily available on the market.
On the other hand, lower layer forwarding of best-effort traffic is the most difficult, because there is no external information that lets us know when to map. In any case, forwarding based on the L3 label is less time consuming than forwarding with the lower layer label. Several proposals have been made [IPS], [TAG] that have scaleability or performance problems when seen from this analysis. Hence, this paper proposes and analyzes the properties of a "step," a simple and straightforward way of best-effort traffic forwarding through switching routers.
On the Internet, routers are usually connected to the next hop routers through physical links. The idea of step is to have virtual links that directly connect routers to the second, third, and n-th next hop routers, where n is a small constant integer. Such a virtual link constitutes a step, and a router reachable by a single step is a next-step router. The intermediate physical links are assumed to support lower layer labels that distinguish the virtual link of a step, and the intermediate routers of a step perform lower layer forwarding of data in the virtual link. Thus, higher layer header processing is performed only once in n hops to decide which next step should be used to forward the packet, and consequently the forwarding performance will be improved. Each router runs an appropriate protocol, discussed in the section Routing Protocol for Stepping, to know over which virtual link the packets should be forwarded.
First of all, although several architectures, such as a brouter (bridge router) or VLAN router [VLAN], exist that combine bridge and router capabilities, a switching router simply forwards data beyond the L2 boundary with an L3 or higher layer routing algorithm.
The concepts of brouters, VLANs, and switching routers are orthogonal. As such, it is possible to construct, say, a switching brouter. The first proposal of switching routers appears in [CSR]. Its primary purpose was to show that all the capability of cell-relaying ATM, low head-of-line delay, hardware switching, QoS guarantee, and so on, can be retained over IP routers; in other words, all the property of the CATENET-based Internet can be retained over ATM switches.
At the time the proposal was written, it was believed that the merits of ATM, if any, were lost when end hosts did not belong to the same subnet and there were intermediate routers. Other work [CSR] shows that, in the same way that legacy ATM switches are Q.2931 signaled data link layer switches, it is possible to have a CSR (cell-switching router) and RSVP or ST2-signaled routing switch. So while it is true that the merits of ATM, if any, are lost when intermediate routers between ATM subnets are packet routers, CSRs can forward data cell by cell between ATM subnets just as legacy ATM bridging switches do. As methods, such as NHRP, to have shortcuts over large ATM clouds do not scale [SUN], [CSR] is the only way known to the author to scaleably use IP with a large-scale ATM WAN (wide area network).
In [CSR], best-effort packets are relayed packet by packet. Lower layer forwarding of best-effort traffic is not considered, partly because it makes the architecture unnecessarily complex and partly because the author thinks that on the future Internet, where resource reservation will be widely deployed, the amount of best-effort traffic will be less than resource-reserved traffic.
Similar proposals on ATM have appeared in [NEWMAN] and [PARULKAR]. More recently, other proposals of switching routers have appeared [IPS, TAG]. Surprisingly, they concentrate mainly on handling of best-effort traffic and only briefly refer to switching router work with RSVP.
Obviously, it is important to increase the forwarding performance of best-effort traffic by a factor of, say, 2, if we can do it today. But it is bewildering that the technique has gathered a lot of interest, even to the formation of an MPLS working group in the Internet Engineering Task Force (IETF). Although clever use of lower layer forwarding can improve the performance of routers and has its own usefulness, it is not sufficient to affect the Internet's fundamental scaleability. Indeed, unless the entire architecture of lower layer forwarding is carefully designed, it may not scale well at all.
[IPS] tries to automate detection of a flow, a persistent traffic to a certain destination, by monitoring IP addresses and TCP/UDP port numbers and assigning a dedicated chain of lower layer forwarding devices to each flow. Although the approach may detect some persistent traffic (85% or more of all the traffic is expected to constitute a flow, according to the authors of [IPS]), the approach does not scale so well. That is, on a high-speed backbone link where forwarding performance is required most, the number of flows in the link is large, and a large number of mappings must be established over such a link. Note that, with best-effort traffic, unlimited numbers of flows may exist. Moreover, considering the rate-adaptive behavior of TCP, as the load to the network increases the speed of each link is lowered and the number of flows increases.
Of course, we can always give up lower layer forwarding and fall back on normal but less efficient hop-by-hop forwarding. However, this would entail positive feedback, in that performance degrades at exactly the time of congestion, just where performance is most needed. Thus, we really have to support a large number of flows at once. Furthermore, it is often the case that the size of the mapping table is limited, which means that less of the total traffic can be forwarded efficiently. For example, typical ATM switches today support only 1,024 to 4,096 VCs on each interface. Yet another problem is that the mapping is dynamic and needs the exchange of some control packets with periodic refresh.
On the other hand, the [TAG] proposal tries to assign lower layer labels based on the routing table entry. Thus, the number of lower layer labels is bounded by the number of routing table entries. When some routing table entry is segregated along the data forwarding path, the lower layer label is no longer useful and L3 labels must be examined for further forwarding. The problem of [TAG] is that the number of top-level routing table entries on the Internet today is about 105 and is still increasing.
Route aggregation, of course, is a way to reduce the number of routing table entries. But as the aggregation point is also the segregation point, the more aggressively we aggregate, the more L3 layer processing becomes necessary. For example, while [TAG] suggests decoupling the interior and the exterior routing, it will make the domain border router the point of a lot of route segregation, resulting in severe load concentration. There is also no easy way to estimate how often packets are forwarded without L3 label processing.
[TAG] also contains the misunderstanding that lower layer forwarding can make hierarchical routing easy. [TAG] says:
Tag switching allows the decoupling of interior and exterior routing, so that only tag switches at the border of a domain would be required to maintain routing information provided by exterior routing, while all other switches within the domain would just maintain routing information provided by the domain's interior routing (which is usually significantly smaller than the exterior routing information). This, in turn, reduces the routing load on non-border switches, and shortens routing convergence time.
This is true if a domain has only a single border router or a single default border router, so that interior routers have to maintain only a small interior routing table. But it has nothing to do with lower layer forwarding.
Regardless of whether lower layer forwarding is performed, if an interior router must select the proper border router from multiple candidates, the interior router must have rich information about exterior routing.
To support this functionality, tag switching allows a packet to carry not one but a set of tags, organized as a stack. A tag switch could either swap the tag at the top of the stack, or pop the stack, or swap the tag and push one or more tags into the stack.
but although some interior routers may put both the intradomain lower layer label for the appropriate border router and the interdomain lower layer label for the destination domain, it does not reduce the L3 processing cost on the border router, the segregation point of the destination domain. Attempts within the source domain to cache the routing information of the destination domains only increase the size of the cache up to the size of flat, unaggregated routing tables and reduce scaleability, without being able to invalidate the cache even though the routing tables in the destination domains are updated.
Thus, despite the added complexity of stacked lower layer labels and accompanied increase in processing time, which may cancel all the benefit of lower layer forwarding (note that the IPv4 address is only 4 bytes long), the amount of L3 label processing is reduced by half.
So, the problems common to [IPS] and [TAG] are that:
The first problems derive from the existence of many lower layer forwarding chains on the Internet, many of which coexist on a given link. The obvious way to suppress the number of the labels is to limit the length of the chain.
In other words, we shouldn't "jump."
Note that this limitation is not applicable to flows with explicit QoS, because it is expected that such flows will reserve a certain amount of bandwidth, which will be a more severe restriction than the number of lower layer labels. For example, over a link of 156 megabits per second (Mbps), at most 2,437 64-Kbps (kilobits per second) reservations can be made. The second problem can be avoided if almost all the chains have the same number of hops. Thus, the step architecture mentioned in Section Brief Overview of "Step" is derived.
Given a constant upper limit n of the hop count of a single step, the scaleability of the step approach is almost obvious. That is, next step forwarding is a purely local operation within the n-hop neighbor of the router and is unaffected by the size of the network. So step forwarding works equally well in both small and large networks. But the efficiency of the step approach depends on n. The larger n is, the more hops are skipped through. The proper value of n depends on several factors and is discussed in the next section.
Suppose that forwarding with lower layer labels is l times faster than that of L3 ones. Then the performance improvement of lower layer forwarding of best effort traffic is, at most, l. Note that l is a small constant, preferably larger than 1 and unlikely to be larger than 10.
It should be noted that over a packetized lower layer, l is unlikely to be smaller than 1. But over ATM, l may actually be smaller than 1 when a packet consists of many, say 1,000, cells, unless cell VPI/VCI processing is 1,000 times faster than packet label processing. If that is the case, the best choice is not to use ATM. Note that the performance bottleneck is at the VPI/VCI mapping stage usually performed at the incoming interface of an ATM switch and is unaffected by parallel switch pipelining of typical ATM switches in the latter stages.
With n-hop step, (1-1/n) of traffic is forwarded with lower layer labels and the rest (1/n) is forwarded with higher layer labels, as higher level processing is only performed once every n hops. The performance improvement with lower layer forwarding compared with only higher layer forwarding is:
For example, if l=3, making n 5 improves forwarding performance by about 2.1. That is, although it is certainly good to make n as large as possible, the return is diminishing. Roughly speaking, it gains little to make n much larger than l. The limiting factor of n is the number of locally available lower layer labels. If each router has d neighbor routers, d*(d-1)(n-1) channels are necessary to have n-hop steps (n>=2). For example, to support up to 5-hop steps with 4 neighbor routers, 480 labels are necessary; up to 6-hop step 1,452 labels; and up to 7-hop step 4,368 labels.
Even though it may be possible to develop a complex routing protocol for n-th next hop determination, it is important to use existing mechanisms as much as possible. The simplest way to provide routing for a step is to run the same routing protocol over both the virtual and physical links, though the simplification increases the amount of routing processing and may make the value of n much smaller than can be supported by the number of lower layer forwarding labels. The metric of virtual links should equal the sum of unit physical links that make up the virtual link and the routing protocol should be modified to prefer the longest link when there are multiple links with equal metric.
To establish an n-hop virtual link over the intermediate routers, some signaling mechanism is necessary. Such a mechanism depends heavily on the hop-local link property, such as how the lower layer labels are implemented. Thus, a lot of effort is required to develop a new signaling protocol for every link type. As the signaling is over IP routers, what we need is an IP-based signaling mechanism. Hence, it's better to reuse an existing mechanism, RSVP, for the signaling of virtual links.
As the virtual link is best effort, all we need is a specification of "best effort" service class. A possible interpretation is that the absence of service class in RSVP messages means "best effort" class. Just as a physical interface for a physical link has to absorb short-term traffic variation and a certain amount of best-effort output buffer, which may be dedicated to the interface or shared with other interfaces, virtual links with "best effort" service also need to be dedicated or shared. Formally speaking, it is also necessary to have a "wild card" filter spec to indicate that the destination address and ports in the signaled lower layer channel may be anything. But because the channel is at the lower layer and no IP headers are checked within the channel, this specification is a formality. Given the currently standardized "guaranteed" and "controlled load" service classes, additional support for the "best effort" service class is easy.
One way of multicasting with step is to use an n-hop step only when there is no branch point of multicast tree within the next n hops. Then, at the branch point, IP header processing may be performed. But too much IP header processing may be introduced if the multicast tree has a lot of branches.
The other way of multicasting with step is to have n-hop steps of all the possible local branching patterns of multicast tree. But this approach is not efficient, because the number of next "steps" increases double exponential to n and n can't be so large. With n-hop "step" and each router having D neighbor routers, the number of next "steps" for each physical interface is:
Thus, multicasting with "step" may not be so efficient. Instead, an obvious and efficient way of multicasting is to have a lower layer forwarding tree for individual multicast groups and multicast senders. But, as discussed in the history section, the number of VCs and the amount of state on routers is increased. However, this is not a problem of lower layer forwarding but of multicast in general. Regardless of any attempt to reduce the number of VCs or the amount of L2 state, multicast forwarding requires state information at L3 for each multicast group and multicast sender. And the amount of state required for L2 and L3 is comparable. Thus, there is no point in trying to reduce the amount of L2 state. In other words, best-effort multicasting needs the dedicated resource of a limited L3 forwarding table and requires reserved resources.
Because step forwarding skips L3 label processing in intermediate routers, there is a security issue if firewalls are skipped. If there is not a lot of traffic over the firewall, disallowing any virtual link establishment is the simplest solution. The other possible solution is to have n firewalls cascaded to catch inappropriate traffic on at least one of the firewalls. An intermediate solution is to have shorter "steps" near the firewalls, that is, to have m (<n) firewalls cascaded, each disallowing virtual link establishment longer than m.
It is shown that next-step forwarding is a scalable method for lower-layer forwarding of best-effort traffic.
The author is grateful for the valuable comments and suggestions of Professor Jon Crowcroft, which improved the quality of the paper.