[INET'99] [ Up ][Prev][Next]

Deterministic IP Table Lookup at Wire Speed

Andrey BELENKIY <abelenki@notes.cc.bellcore.com>
Bellcore / New Jersey Institute of Technology
USA Necdet UZUN <uzun@oak.njit.edu>
New Jersey Institute of Technology
USA

Abstract

Internet address lookup is one of the limiting factors that keep the speed of the Internet Protocol (IP) routers down. Traditional algorithms, such as Patricia Trie, Berkeley Software Distribution tree, and content-addressable memory lookup, used for the address lookup so far do not comply with ever-increasing speed demands and do not scale well. Migration to IPv6 with 128-bit addresses will only make the problem worse and more apparent.

In this paper, we introduce a novel approach to both unicast and multicast address lookup. Although most recently proposed forwarding algorithms concentrate on the average and best case scenarios and have unpredictable performance in the worst case, our algorithm focuses on a definite worst case, which complies with "wire-speed routing," meaning that lookup can be performed while the IP packet is transmitted from the input interface to the output interface of a router without buffering for the purpose of route lookup. Unlike many other algorithms, in which insertion and deletion of the entries is overlooked, ours also has deterministic time for insertion and deletion, which allows timely updates of the routing table.

Contents

1. Introduction

1.1. Background

Digital search techniques

One of the most common approaches for address table lookup is a Patricia trie, which is a special case of a digital search tree. The idea behind this data structure is basically a binary search tree. Depending on the value of a particular bit in the search string, the search will take a left or right subtree. With some modifications, the order of complexity may be slightly reduced [10], yet none of those modifications produce significant improvement from O(n), where n is number of bits in a search string, which in the case of an Internet Protocol version 4 (IPv4) address is 32, and in the case of an IPv6 address is 128.

Content-addressable memory-based searches

Content-addressable memory (CAM) is a perfect solution for a really small search space. These special-purpose chips are fast and reliable [5,6,9]. However, this solution is very expensive and not at all scalable, which is why even IPv4 routers use different approaches. Because an IPv6 address is 128 bits wide, it will further reduce the number of entries per CAM chip. Clearly, this solution is not suitable for the purposes of IPv6 address lookup.

Hashing

The idea behind hashing is to compress a long search string to a shorter one, for example, a 32-bit IP destination address may be compressed to an 8- to 10-bit string, thus making the table small enough to search very fast. There must be some kind of software process running in the background that performs the compression, insertion, and removal of the entries of the table. The obvious problem is that the number of the entries that the table can accommodate is rather small. Also, once this number exceeds a certain point, the hashing has to go to its second and third choices, which degrade the performance.

Accurately implemented hashing can give excellent results in most of the cases; however, none of the hashing algorithms can guarantee such performance in the worst case [4].

Frequent insertion and removal of entries with hashing can result in recalculating of the hashing function that degrades the total performance of the algorithm [10].

Direct memory lookups

Algorithms based on the direct memory lookups are fast and can guarantee the worst case [3]. Today memory is relatively cheap, and with some intelligent hierarchy, very high speeds can be archived with moderate memory requirements. However, these algorithms are only a temporary solution, because direct lookup will not be feasible for IPv6 addresses. Insertion and removal can be very expensive, especially in the case of a short prefix. Such operation may require thousands of memory cycles.

1.2. Problem definition

Forwarding is the decision of which interface a given packet should be forwarded to, based on its destination address (DA) field, which a router has to make for every packet.

An important characteristic of IP forwarding algorithms is that they have to be able to forward packets based on the longest matching prefix. The forwarding database contains the route information for every valid prefix. Valid prefixes are usually shorter than the length of the IP destination addresses. If there is no exact match of the DA being looked up in the route table, the route for the longest prefix contained in the DA must be used. To illustrate the concept of the longest matching prefix, assume that the forwarding database contains four entries: "0010," "10," "101," and "10110." Suppose also that we are looking at the DA "101000." In this case, routing information associated with the entry "101" will be used. "10" is also a valid prefix of "101000"; however, "101" is longer. This issue will be addressed in detail in Section 2.2.

The constraints of a table lookup algorithm are

The rest of this paper is organized as follows: Section 2 describes the algorithm itself. Specifically, Section 2.1 describes the basic algorithm; Section 2.2 describes the enhancement, which allows the trie to handle prefixes of variable length; Section 2.3 describes another enhancement to the trie that allows it to handle multicast, and Section 2.4 mentions several ways to improve the performance of the algorithm. In Section 3, the worst-case performance analysis is carried out. In Section 4, some implementation details are discussed. Unfortunately, complete implementation description is beyond the scope of this paper. The conclusion is presented in Section 5.

2. Algorithm description

It is important to keep in mind that the algorithm described here is designed to be easily realized in hardware. Prototype entities have already been written in VHDL and been simulated [2]. The algorithm does not contain any complex computational functions, only data transfer and bit-wise logical operations. Even though this paper is oriented toward IP routing, the algorithm may be used for other purposes, such as label-to-flow mapping, address filtering, and so forth.

2.1. Basic algorithm

The algorithm uses the concepts of segment and offset. Refer to Figures 2 and 3 for easy follow-up. Physical memory is divided into segments, which can be uniquely addressed. For the rest of this paper, the terms segments and nodes are used interchangeably. Each of those segments has an identical number of words, and each word within any given segment can be identified by an offset. Destination address (DA) is what is called a search string. The task of the algorithm is to find the best possible route associated with a search string in the most efficient way. In this case, the desired information is the port address of the router.

The search string, in this case DA, is divided into K equal substrings. Each of them will become an offset during the search. For this basic version of the algorithm, every node will contain 2K words in order to accommodate every possible combination of a K-bit substring. Each word contains the following fields as shown in Figure 1:


Figure 1: Representation of the node

Figure 1 shows the schematics of the nodes in conventional memory representation, with words aligned horizontally on the left. For convenience, the nodes are sometimes shown rotated by 90°, as in the right side of Figure 1. Also, addresses may be omitted, although the leftmost vertical word always has an offset of 0. Some figures may not show some fields at all. Those figures illustrate the concepts rather than detailed examples. The node presented here assumes only unicast traffic. It is not practical to store multicast patterns in every node. Handling of multicast traffic is discussed in Section 2.3.

The following procedure details a route search.

(1)

In equation 2, "&" signifies concatenation and "Forward Pointer at level n - 1" is 0.

In other words, the incoming DA is split into several substrings. Each of these substrings will be an offset in an appropriate node. The address of the first node is always fixed at 0. By concatenating 0 with the first substring of DA, the address of the first visited entry in memory is obtained. The address of the node of the next level is stored in the FP field, and the address of a single entry can be obtained by concatenating the content of FP from the previous level with a second substring. Again, the FP will be obtained by reading this memory location, and by concatenating it with a third substring to obtain the address of the word, belonging to the node of the second level. (The root node is thought of as level 0.) When the search reads the word in memory, it checks all the flags. If V is "1," then R will be stored in the internal register, so that at any point, the last valid R is available. The process continues until the S = "1" is hit. At this point, the search knows that it has to stop, and the R field at this location is used to forward this packet if V = "1." If V = "0," the last valid R stored in the internal register is used as the forwarding address.

It is essential that the DA is divided into substrings of the same length. Variable-length substrings are also possible; however, forcing them constant makes memory management very simple. Insertion and deletion of the nodes is discussed later. The concept of "search" is illustrated by the following example.

Figure 2 is a sample search trie. A search string of six bits is used in this trie. Figure 3 is an actual physical representation of the trie. In other words, it shows the physical memory, which contains the trie. The trie in Figure 2 has three levels. Level 0 consists of only one node, the root node. Level 1 is an intermediate level. Level 2 is a leaf level. The shaded words in Figure 2 contain a valid R field. For simplicity, Figure 3 shows only the valid FP fields. To illustrate the search, other fields are left blank because they are not essential.


Figure 2: Schematics of the sample trie

Assume that the entry "001101$rightarrow$a" was added to the trie. Sometime later, the router receives the packet with DA "001101." This address is going to be divided into three 2-bit substrings. The algorithm concatenates segment "0000," for root, with the offset, which is the first 2-bit substring of the address "00." Then it performs a read operation at memory address "000000" (see Figure 3). Examining the flags of this entry, the search saves R field in the internal register, because V = "1," and proceeds with the search to the lower levels of the trie, because S = "0." The FP field of the word at this location contains "0001." This segment of the address is concatenated with the second 2-bit substring to create the address "000111." Memory is read at this address, and a FP of "1001" is obtained. Finally, it is concatenated with the last 2-bit substring, "01," to get a memory address of "100101." When memory is read at this address, the algorithm encounters S = "1" and V = "1." It looks at the R field to find a value of "a" stored there. Then it outputs "a" as the result of this search.

Assume that now the packet with the DA "001111" comes into the router. The search procedure is identical to the one described in the previous paragraph except for the last offset. Instead of looking in the address "100101," this time the search will look at the address "100111." After this entry is examined, V = "0" and S = "1" flags are obtained. This combination means that the search has to stop now and that valid R was not obtained on the last level. The value of R, from the memory address "000000" that was stored in the internal register, is outputted as the result, in this case ending the search procedure.


Figure 3: Trie representation in physical memory

This simple example illustrates the search procedure. We can see that the number of memory cycles required to perform the search operation is proportional to the number of substrings into which the address is divided. However, making the number of required memory cycles small results in large substrings and substantial memory waste. Complete analysis of this trade-off is carried out in Section 3.

To look at insertion and removal of the nodes, the issues of memory management should be addressed. Originally, memory used for this trie is empty. As the router learns new routes, it builds the trie, and as the routes become obsolete, the router has to remove some nodes. Initially, the whole memory is divided into segments, which are the size of a node. The first word of each segment, the word with an offset of 0, contains an address of the next node. The available memory is organized in a linked list. For the rest of this paper, this list is called an idle linked list (ILL), because it contains only empty nodes, which do not carry any routing information. To easily identify the beginning and the end of this list, two registers are used. The head pointer (HP) contains the address of the first location on the ILL, and the tail pointer (TP) contains the address of the last location on the ILL. Functionally, ILL can be thought of as a queue. Nodes that are removed from the search trie are appended in the tail, and when new nodes need to be added to the search trie, this queue is served with a node from its head.

Insertion, or updating, of an entry in the trie can have two cases. (Note: Every example described here assumes the original trie, depicted in Figure 2, not reflecting changes made to it by the previous modification.) The easiest case is the one to which no new nodes need to be added. In Figure 2, if entry "001110 $rightarrow$b" is to be added, then the algorithm performs a search for this entry until it encounters S = "1" on its path. Starting with the next level, the new nodes need to be added. In this case, however, the algorithm encounters S = "1" only on the last level. All it needs to do is change V to "1" and insert "b" into the R field in the memory location "100110" and then increment the CC field of the node at location "100100" from 2 to 3. Here is another example of addition of an entry that does not require introducing a new node into the trie. Consider adding an entry "01 $rightarrow$c." The algorithm encounters S = "1" on the root level of the trie; however, this is as far as it needs to go. The only modification that is going to be introduced to the trie is changing V to "1" and inserting "c" in the R field in the memory location with the address "000001." The CC field of the node will then be incremented from 2 to 3.

If the entry "100101 $rightarrow$d" needs to be added to the trie, the sequence of steps taken by the algorithm becomes more involved. Examining location "000010" in the root node, the algorithm will encounter S = "1." Because it needs to proceed further down the trie, but there are no nodes to go along the desired path, the node addressed by the HP will be removed from the ILL and added to the trie. The FP at location "000010" will get the segment of the current HP, "100000," which is "1000." HP will get the value of FP of the location, where it currently points concatenated with 0 offset; thus, it will assume the value of "101000." Notice that the HP will lose its original value, but the node is not lost, because now it is being pointed by the FP of location "000010." Insertion of the new node changes S bit in location "000010" from "1" to "0," and as a consequence, CC of the root at location "000000" is incremented by 1 and becomes value 3. Having obtained a node, the insertion operation proceeds to search it, but again encounters S = "1" at location "100001." It needs to go further down the trie, and to do this, it must get another node from the ILL, in a manner similar to what was just described: the FP of location "100001" gets a segment value of the current HP, namely "1010," and then the HP assumes the value of the FP of the word, where it currently points to concatenated with 0 offset, thus assuming a value "001100." This operation, triggers changes in S flag in location "100001" from "1" to "0" and in CC count of this node, at location "100000," by 1, from 0 to 1. The insertion operation proceeds on to the newly added node. At this point, the only changes to be done to it will be to change V to "1" and to put "d" in the R field of location "001101," and also to reflect it in CC field of the node by incrementing it from 0 to 1 . Note that insertion of "b" and "c" are special cases of a general procedure illustrated by the insertion of "d."

Removal of the entry is completely opposite from addition. The entry can be removed from the search trie, yet the node containing this entry may or may not be removed. For example removing "001101 $rightarrow$a" is accomplished by making V = "0" and CC of this node, at location "001100" one less, 1. The node "1001" may not be deleted because it also contains a valid entry at offset "00." If that entry is removed, then the node "1001" will have no valid entries, and because it is a leaf node, it will not have any children. These are conditions for removing a node and appending it to the ILL. The deletion process has to decrement CC of node "0001," and because it is still greater than 0 after it has been decremented, the removal process stops here. To perform deletion correctly, a deletion stack will need to be used. Basically, all the addresses that the search will traverse are written in this stack. After the search, the addresses pop up one by one, and the nodes containing them are examined. The removal process stops when it encounters a first node with CC $neq$0, which needs to be kept.

Note that techniques described so far are very similar to the ones described in references [1], [11], and [7]. The rest of the paper presents unique techniques to handle sub-prefixes, which are not multiples of K bits and multicast.

2.2. Sub-prefix enhancement

A serious drawback of the scheme described in Section 2.1 is its inability to handle prefixes other than the multiples of K. They are to be called sub-prefixes for the rest of the paper. In the example of Figure 2, K = 2, which means that if we want to insert a prefix of length 3 -- for example, "101 $rightarrow$g" -- we would have to insert two entries into the trie: "1010 $rightarrow$g" and "1011 $rightarrow$g." The search would handle this situation well. The problem arises in the following situation: a longer prefix is inserted into a trie and is then deleted from the trie. Projecting it to our example, "1010 $rightarrow$h" is inserted into the trie, and shortly after that, it is removed. Clearly, the information stored in the corresponding memory location is invalidated; however, there is no way to restore the information about the "101" prefix, so it will be lost. (Refer to [2] for detailed explanation of this scheme.) In this section, an enhancement enabling this trie, to handle a sub-prefix of any size is introduced.


Figure 4: Variable prefix enhancement

The best way to explain this enhancement is by the example. (Refer to Figure 4 for the following discussion.) The first necessary observation to make about sub-prefixes is that their number is always two less than the number of offsets (or words) in the node, 2K. In cases where K = 4, the number of words in the node is 16, and there are the following possible sub-prefixes: "0," "1," "00," "01," "10," "11," "000," "001," "010," "011," "100," "101," "110," and "111." There are only 14 of them, which allows them to be distributed among 16 words. It is desirable to distribute the prefixes in some orderly fashion and, if possible, to have a high correlation with full prefixes. To determine the offset of a sub-prefix, the following rule applies: 3-bit sub-prefix & "1," 2-bit sub-prefix & "10," 3-bit sub-prefix & "100," where "&" stands for concatenation, constituting the offset of the word in the node where the route for the sub-prefix is to be found. The unused locations of the CC field can be used for this purpose. This field is to be renamed as the route for sub-prefix (RS). This modification remedies the waste of memory, which was generated by the fact that only one CC field per node carried useful information. Now a sub-prefix is inserted into the appropriate RS field of the last node encountered by the insertion operation in which the number of bits in the prefix is not a multiple of K. In Figure 4, the sub-prefix "0110 010 $rightarrow$k" was inserted in the trie. In that instance, the following procedure was used. On the root level, there were enough bits for the full prefix; on level 1, however, only 3 out of 4 bits were present. This means that a sub-prefix insertion was performed. Thus, applying the rule stated above for the 3-bit sub-prefix, it was concatenated with "1" to obtain "0101," so that the RS field of the entry gets "k." The following are two new fields that are added to the node for the sub-prefix support:

The search procedure for the enhanced trie remains the same. Some modifications to insertion and deletion of the entries have to be introduced. Those modifications are discussed next.

Insertion of the sub-prefix is carried out the same way as for the basic algorithm discussed in Section 2.1, until the last node is reached. For this last node, there are not enough bits to perform a full prefix insertion. Therefore, a sub-prefix insertion is done. First, the offset of the entry is determined by the rule given above, by concatenating appropriate bits to the sub-prefix. Routing information is written to the RS field of the determined location. Then, depending on the length of the sub-prefix and its value, the insertion procedure affects several other entries of this node. Namely, it is necessary to expand the sub-prefix into all possible full prefixes. For example sub-prefix "01" is expanded into "0100," "0101," "0110," and "0111." Notice that after expansion, offsets are sequential. In terms of hardware, it may be implemented simply by concatenating values of the counter to the sub-prefix. Also, because the sub-prefix has a length of 2 bits, the second bit in VSP fields is affected. The insertion procedure must go to every relevant offset and perform logical OR operation on the VSP field and the binary number, where all bits are "0" except for the one corresponding to the length of the sub-prefix. In the case of sub-prefix "01," VSP in every affected word will be OR'ed with "010" (second bit asserted). Then the insertion procedure has to decide if it needs to replace the R field. If V = "0," then routing information associated with the inserted sub-prefix is stored in R, V flag is asserted to "1," and CC will be incremented, depending on S of this offset. If it was "0," then this word has been accounted for already and CC should remain unchanged. If it is "1," then the corresponding word has not been accounted for yet, in which case CC should be incremented. If V of the given offset was "1," then the R field will be updated only if the F flag is "0" and if no higher bits in VSP are asserted. In other words, insertion of the sub-prefix only makes changes to routing information if it replaces routing information of a shorter sub-prefix, or a prefix of the same length. Routing information of the full prefix, or longer sub-prefixes, are not replaced. In the case of "01," R is the field with information associated with the sub-prefix only when V = "0" or when V = "1" and VSP = "100" or "010."

The following example illustrates this process in detail. Entry "0110100011 $rightarrow$r" is to be inserted into the trie of Figure 4. The steps taken on level 0 and on level 1 are identical to the ones described for the basic algorithm in Section 2.1. A detailed procedure for steps taking place on level 2 is provided next. Having obtained a pointer S2, the insertion process looks at the last substring of the search string, "11." It realizes that sub-prefix insertion has to be performed on this node. Because "11" is a 2-bit sub-prefix, it is concatenated with "10." The RS field at offset 1110 of the level 2 node gets value "r." Then all the potentially affected entries, "1100," "1101," "1110," and "1111," have to be considered. The VSP at offset "1100" are OR'ed with "010" and become "110." Next, flags of this word are looked at. First, it is observed that the word contains valid R, since V = "1." Next, the fact that the F flag is "0" tells the insertion process that the entry contains routing information for the sub-prefix. Now it has to determine, by looking at the VSP, whether the current information in R is for the longer sub-prefix than "11." The bit of the VSP associated with the 3-bit sub-prefix is "0," so all the conditions for updating R have been met. R at location "S21100" gets value "r." The insertion moves on to offset "1101." There it changes VSP to "110," and following the identical line of reasoning, as in the previous word, "r" is inserted in the R field. At offset "1110," the VSP pattern takes value "111." Then, the flags of this entry are examined, and it is observed that the R field contains routing information for the full prefix; R remains unchanged. At offset "1111," the VSP takes the value "111." Examining the flags, the insertion procedure learns that valid routing information for a sub-prefix is stored in the R field. The VSP shows that the routing information is associated with the 3-bit prefix and is not altered. This concludes insertion of "0100100011 $rightarrow$r." Notice that CC was not changed during this insertion because before the insertion occurred, all the words encountered had V flag set to "1."

Deletion of a sub-prefix works in the following way. The sub-prefix is expanded to its full prefixes. Then, all of the entries addressed by these prefixes are examined. First, the VSP pattern is going to be changed. Corresponding bit, depending on the length of the sub-prefix, is deasserted by performing a logical AND operation with one's complement of the binary number, which has all bits "0," except for the one corresponding to the length. To clarify, if we try to remove the sub-prefix "10," then all the VSP fields associated with the full prefixes obtained by expanding "10" are AND'ed with one's complement of "010," which is "101." In simple terms, the deletion operation makes the second bit of every VSP "0." Having done that, the deletion operation examines the F flag, and if it is "1," it moves on to the next word. If it is "0," however, the deletion operation has to look at the VSP once again. If any of the higher-order bits, corresponding to the longer prefixes, are asserted, the deletion procedure moves on to the next word. With any of the lower-order bits, it has to determine the sub-prefix and look up RS field of the corresponding entry. If no bits of the VSP are "1," the deletion operation has to invalidate the R, by changing V to "0" and decrementing CC of S of this entry by "1." After that, the deletion moves on to the next word.

The following example illustrates deletion operation. Sub-prefix "01101000111" is to be deleted from the trie of Figure 4. Steps taking place on level 2 of the trie are of interest. Steps taking place on levels 0 and 1 are the same as described in Section 2.1. Expansion of "111" to "1110" and "1111" occurs. The deletion procedure looks at the word with offset "1110." Its VSP then becomes "100." Looking at the V flag, which is "1," and F flag, which is "1," it is understood that R field of this word contains a full prefix, which is not to be deleted. The deletion procedure moves on to the next word. In the next word, the VSP is made "100" also. After examining the V and F flags, which have values "1" and "0" respectively, the deletion process determines that a sub-prefix is stored in R; looking at the VSP again, it sees that there are no bits of the higher order asserted (in this case, there are no bits of the higher order at all). It finds that there is a lower bit asserted, namely the bit corresponding to the 1-bit sub-prefix. The 1-bit sub-prefix of "1111" is "1." The offset of "1100" is determined by concatenating "1" with "110," as mentioned in the beginning of this section. Therefore, R at offset "1111" has to be replaced by the value of the RS field at the offset "1100." At offset "1111," R gets value "m." Notice that CC has not changed, because no entries have been invalidated. Given that the node at level 2 is not to be deleted, because CC is not 0, this concludes the deletion of "01101000111."

2.3. Multicast enhancement


Figure 5: Representation of multicast node

Multicast has become a significant part of the IP traffic. It is expected that in the future, the amount of multicast traffic will continue to grow for IPv4. IPv6 has a wide range of multicast addresses for different purposes. Therefore, the issue of speedy multicast forwarding should be addressed.

In IPv4, multicast traffic has the designated address class D, where the first four bits are "1110." In IPv6, for a packet to be multicast, it has to have "FF" in hex or "11111111" as a first byte of the DA [8]. In Section 3, it is suggested that the value K was chosen to be 4 for IPv4 and 8 for IPv6. Such choice of node size allows for the following: one of the FPs of the root level points to the multicast trie. The algorithm follows the link to the multicast trie, and somewhat different procedures are executed.

The difference is the following. Instead of storing values for R and RS, the pointers to multicast pattern values are stored in the combined R & RS field. This field should be sufficient to address large enough numbers of the multicast pattern. Refer to Figure 5. The width of the unicast word is 40 bits. If we consider the 64 $times$64 switch, the multicast pattern occupies two memory words. Multicast patterns are located in a different part of memory. When the multicast entry is to be inserted in the trie, the unicast procedure is executed, except that when the R or RS field needs to be read or written to, the R & RS field points to the place where the actual values are stored. In the case of Figure 5, there are two multicast nodes shown. The node with the segment X0011 is a multicast node; if the search needs its R value, then it uses the R & RS field as a pointer to the memory location that keeps actual values of both R and RS, and it follows this link. In the case of R, the search reads as many words as a multicast pattern takes, starting with offset 0. In the case of RS, the search starts from the middle of the segment holding multicast patterns for R and RS, because it is aware of the multicast pattern size. In this case, it starts reading with offset "10."

Memory management will also be complicated by this enhancement. Instead of one ILL, there will be two: one holding nodes and another holding multicast patterns. The problem with this scheme is that while the nodes' ILL may run out very fast, the multicast patterns' ILL can still have a lot of memory, and there will be no way to transfer this resource. The problem can be resolved by more complex schemes of memory management. In Figure 5, only one ILL can be used, because the nodes themselves and the multicast patterns occupy four words. However, this is coincidental, and two ILLs will have to be used.

2.4. Performance enhancement

There are several performance enhancements that can be applied to the trie. First, it can be observed that there are no prefixes shorter than 8 bits. Therefore, for IPv4, it may be beneficial to combine levels 0 and 1 and just keep it in the small fast cache. This enhancement saves one clock cycle. Other enhancements may also be applicable to the trie.

3. Performance analysis

As mentioned before in the previous subsections, the appropriate values of K for IPv4 and IPv6 are 4 and 8, respectively. In this section, the worst- and the best-case timing and memory requirements are presented. The worst-case requirements for timing are defined as the time necessary for the operation with a 29-bit prefix for IPv4 and a 121-bit prefix for IPv6. These require the longest time. The worst case for memory requirements occurs when all the entries are located in the individual nodes of the trie. At some point, nodes each have only one valid child organizing a number of linked lists.

First, calculations for IPv4 are carried out. Assume that the forwarding table consists of 80,000 entries. In the worst case for K = 4, there is 1 node in the level 0, 16 nodes in the level 1, and then 256, 4096, 65,536, 80,000, 80,000, and 80,000 on levels 2, 3, 4, 5, 6, and 7, respectively. Therefore, the total number of nodes will be 309,905. Every node consists of 16 words of 40 bits (or 5 bytes) each. Thus, 309,905 $times$16 $times$5 $cong $24 megabytes. In actuality, the worst case should also consider multicast. In the worst case, all 80,000 entries are multicast entries. This means that 5 $times$4 $times$80,000 $cong $1.52 megabytes. Thus, the total memory requirement is about 26 megabytes.

The worst case for the search time has 8 memory cycles. For insertion of the 29-bit prefix, the following number of memory cycles are spent on each of the eight affected levels: 2 for obtaining a node from the ILL, 2 for updating an offset, and 2 for updating CC. In addition, 8 words will have to be updated on the leaf node, with each update requiring 2 memory cycles. The total number of clock cycles is 8 $times$(2 + 2 + 2) + 8 $times$2 = 64 memory cycles. For deletion, the calculation is carried out in a similar way with the exception that the deletion stack has to be built, which takes an additional 8 memory cycles. Thus, 72 memory cycles is the worst case for deletion of the entry.

If we carry out similar calculations for IPv6 with K = 8, 1.3 terabytes of memory are used. In practice, this should not be the case because of the hierarchical address structure of IPv6. More exact figures could be provided by means of simulations. As for the time requirements, the worst case for the search is 16 memory cycles; for insertion, it is 352; and for deletion, it is 368. Table 1 summarizes these findings.

Table 1: Worst-case scenario: 80,000 entries
IPv4 (K = 4) IPv6 (K = 8)
Memory requirements 26 megabytes 1.3 terabytes
Search time 8 16
Insertion time 64 352
Deletion time 72 368

The best case for memory requirements occurs when all nodes used in the trie are complete fields with valid information. So for IPv4, with K = 4, 80,000/16 = 5,000 nodes is used, with every node occupying 80 bytes. 5,000 $times$80 $cong
$400 kilobytes. For IPv6, with K = 8, 80,000/256 = 313 nodes is used, with every node occupying 1,280 bytes. 313 $times$1,280 $cong $400 kilobytes also.

For the best-case time requirement for IPv4, the shortest prefix used is 8 bits. Therefore, all the operations happen on the second level. Search of an 8-bit prefix takes 2 memory cycles, insertion of an 8-bit prefix takes 4 memory cycles, and removal take 5 memory cycles. For IPv6, everything can happen on the root level, when K = 16. The search time is 1 memory cycle, insertion is 2 memory cycles, and removal is 3 memory cycles. Table 2 summarizes performance metrics for the best-case scenario.

Table 2: Best-case scenario: 80,000 entries
IPv4 (K = 4) IPv6 (K = 8)
Memory requirements 400 kilobytes 400 kilobytes
Search time 2 1
Insertion time 4 2
Deletion time 5 3

It is anticipated that the actual performance is somewhere in between the best and worst cases. Memory requirements should be much closer to the best case, and time requirements should be closer to the worst case. It is necessary to run the simulation to determine real values.

4. Implementation

For speed purposes, it is recommended that this algorithm be implemented in hardware. The circuit is responsible for three main operations: search, insertion, and removal of the entries. Whereas for insertion and removal of the entries, it is necessary to interface the microprocessor, for the search, it is not necessary. The circuit should be informed of the arrival of the new packet with an appropriate signal; at the same time, necessary fields should be supplied to it on a bus. Once the signal constituting the beginning of the new packet is asserted, the search process starts. It does not matter whether the circuit is busy with some other activity -- insertion, for example. In case the circuit is busy with insertion of the entry, all the values that insertion acquired is saved in the temporary registers, and search starts immediately. Once it is finished, the insertion resumes from the point at which it was interrupted. Remember that an immediate search is a key to the wire-speed routing, which allows IP quality-of-service (QoS) implementation. Other techniques, such as clock doubling and clock alignment, were used in the prototype circuit. (See reference [2] for a detailed explanation of these implementation details.)


Figure 6: Block-diagram of hardware implementation

Figure 6 shows the block-diagram of hardware implementation of the forwarding part of the router. The table lookup controller interfaces memory, where the trie is stored; microprocessor via microprocessor interface; and the circuits that are located before and after it. The speed of the memory is a main factor affecting the speed of the lookup. For the prototype, entities modeling Hitachi SSRAM chip HM67S3632 were used. Cost-efficient design can use slower memory. For IPv4, using 20 nanoseconds (ns) of static random access memory (SRAM) requires only 160 ns for a search. For IPv6, using the same type of memory requires 320 ns for a search. For example, on an OC-24 interface of a IPv4 router, the smallest packet that can be received is Ethernet frame, which is at least 64 bytes long, $frac{64 times 8 bits}{ 1.25Gbits /sec} cong410$ns. Therefore, after the search is done, there are 250 ns for other tasks, such as insertion or deletion of an entry, collection of statistical data, and so forth. Note that deletion of an entry requires 48 memory cycles, or 960 ns. This means that in the critical conditions of shortest possible packets coming back to back, one insertion or deletion can be performed for every four lookups. For IPv6, the same hardware allows one trie update per 80 lookups. This means that it is possible to perform more than 30,000 updates per second, which is more than enough to satisfy the most demanding routing algorithms. Stressing this again: these numbers are guaranteed for the worst case with respect to the length of the incoming packets and the location of the entries in the trie.

Any general purpose microprocessor can be used with the controller. Depending on its characteristics, such as clock rate, width of the data and address bus, its interface has to be designed accordingly.

First In, First Out (FIFO) systems in the controller are used to store instructions from the microprocessor and to send acknowledgements back. Because the updates are not so time critical, they can be queued, unlike searches.

5. Conclusion and future work

In this paper, we presented the algorithm for effective table lookup techniques. The most obvious application is IP forwarding; however, others are also possible. The basic version of the algorithm was enhanced to handle variable-size prefixes and multicast. The algorithm does not use any complex functions; the only operations it uses are logical bit-wise operations, concatenation, and memory operations. It enables simple hardware implementation. Worst-case analysis showed that wire-speed routing could be accomplished using inexpensive random-access memory chips.

Simulation of the algorithm using real routing tables is a next step in this work. For long-term future work, we are going to look at layer 4 routing and IP QoS. In both areas, we would like to investigate aspects of router architecture and issues related to performance of the whole network, such as end-to-end delay. Our algorithm may be used in both of these areas.

Bibliography

1
Andrey Belenkiy.
Fast Memory Look-Up Controller for a Scalable IP Switching Router.
Summer Research Report, Center for Advanced Telecommunication Technologies (CATT), Polytechnic University, Oct. 1997.
2
Andrey Belenkiy.
Fast Memory Look-Up Controller for a Scalable IP Switching Router.
Master's thesis, Polytechnic University, June 1998.
3
Pankaj Gupta, Steven Lin, and Nick McKeown.
Routing Lookups in Hardware at Memory Access Speeds.
White Paper, Computer Systems Laboratory, Stanford University, 1997.
4
B. Lampson, V. Srinivasan, and G. Varghese.
IP Lookups using Multiway and Multicolumn Search.
Infocom'98, San Francisco, Mar.-Apr. 1998.
5
Antony McAuley and P. Francis.
Fast routing table lookup using CAMs.
INFOCOM'93, pages 1382-1391, Mar.-Apr. 1993.
6
Antony McAuley, Paul F. Tsuchiya, and Daniel V. Wilson.
Fast multilevel hierarchical routing table using content addressable memory.
United States Patent # 034444, Jan. 1995.
7
Radia Perlman.
Interconnections, Bridges and Routers.
Addison-Wesley, 1992.
8
Camilo Sars.
Address Assignment and Management in the IPv6 Environment.
Internet Publication, Helsinki University of Technology, 1996.
9
Quality Semiconductors Staff.
Content Addressable Memory ICS Expand QSI's Networking Offerings.
Press Release, Quality Semiconductors Inc., 1998.
10
Torrent Networking Technology Corporation Staff.
High-Speed Routing Table Search Algorithms.
Technical Paper, Torrent Networking Technology Corporation, 1998.
11
Hugh M. Wilkinson, George Varghese, and Nigel T. Poole.
Compressed Prefix Matching Database Searching.
United States Patent # 5781772, Digital Equipment Corporation, 1998.

[INET'99] [ Up ][Prev][Next]