Tatuya JINMEI <email@example.com>
Kazu YAMAMOTO <firstname.lastname@example.org>
Jun-ichiro HAGINO <email@example.com>
Internet Initiative Japan Inc.
Munechika SUMIKAWA <firstname.lastname@example.org>
Yoshinou INOUE <email@example.com>
Kazushi SUGYO <firstname.lastname@example.org>
Shoichi SAKANE <email@example.com>
Yokogawa Digital Computer Corporation
The KAME Project is a joint effort by seven companies in Japan to develop a referential implementation of advanced networking protocols such as Internet Protocol Version 6 (IPv6) and IP security. The products are distributed as free software and are widely used in the 6bone. They are based on Berkeley Software Distribution (BSD) variants and have several improvements to the original logic, which cannot handle advanced protocols. This paper describes such new mechanisms as efficient header processing, loop prevention for tunneling, and mobility support for IPv6 plug and play.
With the rapid growth of the Internet, more and more significant and unexpected problems have appeared. In response to some of these problems, Internet Protocol security (IPsec) was developed as a security mechanism while Internet Protocol Version 6 (IPv6) was designed to resolve the exhaustion of the IPv4 address space. As Berkeley Software Distribution (BSD) variants promoted to deploy Transmission Control Protocol (TCP)/IPv4, a free referential implementation of IPsec and IPv6 is necessary for their wide and early deployment.
For this purpose, the KAME Project was initiated in April 1998 in Japan by seven Japanese companies co-operating in the WIDE (Widely Integrated Distributed Environment) Project. It aimed to provide free, working, and "specification conformant" code based on BSD variants.
The logic of the network code of the current BSD variants are unsuitable for the new protocols. For instance, using the old logic, efficient processing of IPv6 extension headers is not possible. Also, it is time consuming to determine if an incoming packet should be accepted or forwarded. To overcome these difficulties, the KAME Project has explored some original approaches in its implementation.
This paper describes the implementation techniques developed by the KAME Project. It is organized as follows: Section 2 briefly describes the KAME Project itself. Section 3 explains the characteristics of KAME implementation. We describe future plans in section 4 and conclude in section 5.
The WIDE Project has organized the IPv6 working group and has been deeply involved in IPv6 since 1995. More than 100 researchers from universities and companies have committed themselves to the working group. The working group has developed nine independent IPv6 implementations, and the group has tested the interoperability between them through its testbed, called the WIDE 6bone .
Such heterogeneity was important in the early stages of IPv6 development because it helped to improve each implementation and to find problems of specification. As IPv6 has been deployed, heterogeneity has performed its role, and instead, efficient development of products and early deployment of the products have become important.
Thus the WIDE IPv6 working group organized the KAME Project. Eight core developers from seven companies came together to develop a single more efficient referential implementation of the advanced networking protocols. The project chose several BSD variants as its platform: FreeBSD, BSD/OS, and NetBSD. At the start, the product of the project was provided in a form of patches for the original operating systems, but the project also aimed to incorporate the product into official releases of the operating systems.
To promote these advanced protocols, the product is freely distributed via the project web site (http://www.kame.net/). The product is distributed in the following three ways:
The KAME product is actually used in the worldwide 6bone. It has also been tested for interoperability and conformity at the InterOperability Laboratory at the University of New Hampshire. The KAME Project is planning to start its own group to evaluate the product. Official releases will be made after quality assurance testing by the group.
This section describes some of KAME's remarkable network software characteristics including basic specifications and neighbor discovery. Loop prevention is related to IPsec as well as to IPv6.
One remarkable feature of IPv6 is its flexible extension headers  combined with the constant length IPv6 header. The limitations of each extension header's length and of the number of extension headers are looser than those of IPv4 options. Thanks to this flexibility, we can try to introduce a new optional function as a separated extension header or as an option of some existing options header.
Traditional BSD variants store an incoming packet in one or more mbufs. There are two types of mbuf. In this paper, we refer to them as an internal mbuf and an external mbuf, respectively. By default, an internal mbuf can contain only 100 to 108 bytes of data. An external mbuf has external storage, called a cluster, whose size is 2048 bytes by default. If an incoming packet fits in a single internal mbuf, the network driver simply stores the packet in the mbuf. Otherwise, the network driver stores the packet into mbufs one of two ways as follows:
In the latter mode, there is a possibility that the first mbuf cannot contain all the headers from the IPv6 header to an upper layer protocol header (e.g., a TCP header). Moreover, a header may be divided and spread over two or more mbufs (Figure 1).
Figure 1. An extension header may be spread over two mbufs
BSD variants usually expect that the network layer header and the transport layer header of an incoming packet lie on a contiguous memory space. If this is not the case, they perform the "pullup'' operation, which copies a specified length of data into a contiguous space in a single mbuf. If there are some IPv4 options, they strip the options before the operation.
This was a good idea when memory resource was a scarce commodity, but the overheads of the copy operation are heavy. In addition, the pullup operation is not suitable for IPv6 environments from another point of view. We must retain all the intermediate extension headers until the input operation terminates, because if we encounter an error while processing an incoming packet, we have to return an Internet Control Message Protocol Version 6 (ICMPv6) error message including as much of the offending packet as will fit . However, if we follow the original convention of BSD variants, some intermediate headers may have been stripped or even discarded before the pullup operation as mentioned above.
Because the cost of memory continues to rapidly decline, processor efficiency should be favored over memory efficiency. From this perspective, we require network drivers to store all the headers in a contiguous space in order to avoid pullup operations. That is, a network driver that is suitable for the KAME network software is required to store an incoming packet in the former of the two modes mentioned above.
To satisfy this requirement, we can write an input routine for a header more simply than before. Each input routine first determines if the requirement is satisfied, and if not, it discards the packet. Otherwise, the routine gets the beginning of the header by the head of the incoming packet and the offset to the header. Then the routine can process the header and, if necessary, the data following the header without any copies or deletions of other headers.
Though we have to rewrite a network driver if it does not satisfy this requirement, most recent network drivers do not have to be rewritten. This is a reflection of the current fashion wherein code simplicity is regarded as more important than memory efficiency.
For an incoming IPv4 packet, traditional BSD variants first determine whether the packet is destined for the node or not. If the packet is destined for the node, the receiving node passes the packet to an appropriate transport layer. Otherwise, the node tries to forward the packet if it is configured to act as a router.
This determination is usually done by comparing the destination address of the incoming packet with each address assigned to the interfaces of the node. Since the comparison linearly examines the addresses, in the worst case, the node has to examine all the addresses. However, this is not very harmful to IPv4 environments because even an IPv4 router does not have many addresses. Moreover, it is a relatively inexpensive task for today's computers to compare IPv4 addresses since the addresses are 32-bit integers.
In contrast, a linear comparison has a more significant effect on IPv6 environments. IPv6 capable nodes may assign multiple IPv6 addresses on a single interface, and consequently, routers may have numerous IPv6 addresses. Also, because today's typical computers are based on 32-bit (or at most 64-bit) architectures, it is fairly time consuming to compare IPv6 addresses, which are 128-bit integers.
Therefore, we have adopted a different scheme for deciding whether an incoming IPv6 packet should be accepted or forwarded. The kernel first looks up the radix routing table to resolve the next hop of an incoming packet and the outgoing interface to the next hop. If the interface is a loopback interface (usually called lo0 in BSD variants), it accepts the packet. Otherwise, it uses the next hop information to forward the packet.
For a router, our method is efficient since the number of packets to be forwarded is much higher than accepted packets, and there is no additional cost for forwarded packets. For a host, if an incoming packet is destined for the host itself, our method needs a few bit tests (to reach a leaf node of the radix tree) and a comparison of two IPv6 addresses. Figure 2 roughly depicts the process. The cost of the comparison can be ignored because it is definitely necessary at least once for any method. Also, it can be safely assumed that the radix tree for a host is not complex since a host usually has only a few host routes in addition to the default route, and, consequently, the cost of the tests can be considered relatively low. Even when the host accidentally receives a packet destined for another node, the host can quickly detect it thanks to the simplicity of its radix tree.
Figure 2. The cost of radix tree lookup is bit tests on the internal nodes and a comparison of addresses at the leaf node
Another improvement intended to avoid linear search is to use a hash technique. A node calculates in advance a hash value of each address assigned to it, then makes a hash table. Each entry of the table is a list of addresses that have a same hash value. For an incoming packet, the node calculates the hash value for the packet's destination address, and compares the address to each address that belongs to the list for the hash value. If we have a good hash algorithm, the comparison is fast enough. In some cases, it will be faster than the radix tree-based comparison. However, if a node is acting as a router, our approach is more efficient.
There are some notions of tunneling in advanced internetworking such as IPv6 and IPsec. For example, IPv6 over IPv4 tunneling is an essential technique during the transition period from IPv4 to IPv6. The IPsec tunnel mode is typically used to construct a VPN (Virtual Private Network).
To implement such tunnels, it is necessary to encapsulate one packet inside another. The implementation may use a pseudo network interface, and the output function associated with the interface receives a packet from a network layer (e.g., IPv4 layer), encrypts and/or authenticates the packet if necessary, and encapsulates it in an outer header. The input function, in contrast, decapsulates an incoming packet, decrypts and/or authenticates the packet, and passes it to a corresponding network layer. When a route for a destination is configured to the pseudo interface, all packets to the destination will be transferred through the tunnel associated with the pseudo interface.
Such an approach may, however, fall into an infinite loop of header creation. An infinite loop is essentially a sequence of processes that eventually reaches an already passed point. Because an output to a pseudo network interface may cause an output to the same interface, the sequence of the outputs has a possibility of an infinite loop.
As a simple example, suppose that we want to construct an IPv4 IPsec tunnel for a destination D and that we, perhaps mistakenly, configure the pseudo interface, for example I, with the same destination D. As we already mentioned, a route for destination D is set up to I. When we try to send a packet to D, the packet is passed to I according to the routing table, encrypted and/or authenticated in I, and encapsulated into a new header, whose destination address is also D. Then the encapsulated packet is sent to the IPv4 output routine, which sends the packet to I again. Thus we enter an infinite loop (Figure 3).
Figure 3. Example of an infinite loop
To deal with this problem, we implemented the IPsec tunnel mode not as an interface, but as a separate routine that is called from a network layer output function only once (Figure 4).
Figure 4. The IPsec encapsulation routine is called only once to prevent a loop
As a result, an IPsec tunnel can be constructed only once in a single node, though this is not a strict restriction because IPsec tunnels are not created more than once in a practical configuration.
An infinite loop may occur when constructing a tunnel that encapsulates a network protocol into a different network protocol, such as an IPv6 over IPv4 tunnel. For example, suppose that we construct two tunnels. One is an IPv6 over IPv4 tunnel, and the other is an IPv4 over IPv6 tunnel. Also suppose that the IPv6 over IPv4 tunnel encapsulates a packet to an IPv6 address D6 into a packet to an IPv4 address D4, and that the IPv4 over IPv6 tunnel encapsulates a packet to D4 into a packet to D6. If we try to send a packet to D6 we encounter a loop which infinitely constructs the two tunnels one after another (Figure 5).
Figure 5. An infinite loop consisted of two tunnels
However, such a combination of tunnels like the above example does not appear in a typical configuration. Hence we implemented a generic framework consisting of a pseudo interface and used it to construct IPv6 over IPv4 tunnels. To avoid possible loops, we introduced a counter in the output function for the interface, which is incremented each time the function is recursively called. If the counter reaches some limit, the function discards the packet, records that it detects a loop, and returns an error to the user. The flexibility of the pseudo interface is decreased to some extent by the use of this trick, but in this instance, we believe robustness should take priority over flexibility.
Plug and Play based on Neighbor Discover Protocol  is one of IPv6's major improvements. When a node attaches to an IPv6 network, the node can automatically configure its IPv6 address(es) and discover one or more available routers. Moreover, the configuration process needs no state machine in the server. Network managers do not have to configure a special server such as a DHCP server. They only have to configure network prefixes for routers, which is eventually necessary to achieve global connectivity.
Once a router is configured, it periodically advertises the configured network prefixes to each link attached to the router. Each host on the network receives the advertisements and configures itself using the prefix(es) contained in the advertisements. A host also regards the sender of the advertisement as a default router, thus achieving connectivity to the global IPv6 Internet.
The specification defines two types of lifetime for an advertised prefix. One is preferred lifetime (default value is 7 days) and the other is valid lifetime (default value is 30 days). They are used to invalidate prefixes smoothly. If the preferred lifetime of a prefix has expired, the addresses generated by the prefix should not be used as a connection's source address unless it is already established. If the valid lifetime of a prefix has expired, the addresses generated by the prefix become completely invalid; that is, the addresses must not be used as a source address for any connections. This mechanism is useful, for example, when a site is renumbering its addresses.
The default values of the lifetimes are, however, relatively long for mobile nodes. Consider this problematic case as an example. Suppose that there are two networks, N1 and N2, and that prefixes P1 and P2 are advertised in N1 and N2, respectively. Now consider that a mobile node M moves from N1 to N2. When M attached to N1, an IPv6 address P1:M was configured for M. Then M moves to N2 and configures a new address P2:M. If the movement happens quickly, which is usually the case, the lifetimes of the old prefix P1 do not expire and the old address P1:M is, as a result, still valid.
Let us assume that M sends a packet to an off-link node D through a router R in N2. There are two possible addresses as the source address of the packet; P1:M and P2:M. But if the former is used and D tries to send a response to it, D will send the response to P1:M. Since the prefix P1 lies on the network N1, M is not able to receive the response unless the host route for P1:M is fully advertised. Figure 6 depicts this situation.
Figure 6. A mobile node can send a packet, but cannot receive responses
There is another problem if M tries to send a packet to a node D in N1 whose address is P1:D. In this case, because D and M have the same prefix P1, M will regard D as on-link and try to send the packet to it directly instead of sending to the neighboring router R (Figure 7). But the packet will not reach D because in reality D is off-link.
Figure 7. A mobile node regards an off-link node as on-link by mistake
One simple solution to these problems is to force the node to discard the old prefix and the old address after moving. But this approach may cause an unexpected deletion of prefixes and addresses that should remain valid. For instance, suppose that there is a failure of the router on a network while a laptop computer is suspended. When the laptop's operation is resumed, it should discard the prefix and the address that were advertised before suspension, because there is a possibility that the laptop has moved from the network. The laptop, however, cannot get a prefix anymore because the router has already stopped; it therefore fails to communicate with any nodes, even with those within the network.
Thus we implemented a mobility support that is a bit more complicated. We introduced a data structure for each prefix, which has a list of routers that advertise the prefix. Note that the list may be empty when, for example, all routers that advertise the prefix are unreachable. Then we defined two states for a prefix, attached and detached. A prefix is called attached if its router list is not empty or if all the prefixes including the prefix do not have a non-empty router list. Otherwise, the prefix is called detached. A detached prefix is not regarded as on-link and the address derived from the prefix is not used as a source address even if the lifetimes of the prefix are not expired.
Based on the above model, we implemented a mechanism in the kernel to control the state of each advertised prefix. For instance, if a router turns out to be unreachable, which can quickly be detected using IPv6 Neighbor Unreachability Detection, the kernel automatically changes the state of each prefix advertised by the router.
To return to the above problems, since the prefix P1 of the mobile node M of Figure 6 is detached after detecting unreachability of the router in N1, the old address P1:M will be never used as a source address. Also, since the detached prefix P1 is no longer regarded as on-link, M will pass a packet to a neighboring router when the packet is destined for a node in N1. Now let us consider the case of the failure of the router. If the router R of Figure 6 stops, both P1 and P2 are attached since there is no other prefix that has an associated router. In this case, M can only communicate with nodes in N2, and there is no problem since M can use the addresses derived from the attached prefixes.
It is important to ensure the source-level portability of applications on various operating systems. Portability encourages people to develop a variety of applications, and consequently we have many opportunities to test and improve interoperability, which is also an important notion on the Internet.
Application Programming Interfaces (APIs) play a key role to ensure portability. There are two types of API defined by IETF: The first is the basic API, which defines some library functions, data structure, and macros for developing typical TCP and UDP applications. The other is the advanced API, which is designed to develop and support "advanced" applications such as routing daemons and network management tools. Interfaces designed to use IPv6 extension headers are also defined in the API.
Because we attach importance to portability, we have quickly adopted the latest version of these APIs. We have also tried to use the APIs both in the kernel and in userland applications. For example, all our "advanced" applications such as routing daemons, a router advertisement daemon, ping, and traceroute use the advanced API.
Adopting the latest specification of such APIs does not necessarily provide portability, and even may cause confusion due to version mismatches. This is because the specification tends to change in the early stages of its standardization. We believe, however, that in order to encourage the early standardization and deployment of the APIs, it is more important to actively introduce the latest specifications than to stick to short-term portability.
IPv6 and IPsec have already become fairly standardized, but there are still some topics that are neither well-documented nor fully standardized.
It is difficult for network managers to renumber their sites, and hence they rarely change their network providers. Router renumbering is one of the key techniques needed to remedy this situation. Though its specification has not been fully standardized, we are now implementing it experimentally and are planning to run it on the WIDE 6bone.
IPv6 multi-cast routing is also in the early stages of deployment. Some documentation for the IPv6 PIM exists, but there are few instances of implementation and interoperability between different implementations that are fully established. So far, we have implemented PIM dense mode for IPv6 and have confirmed that it works to some extent. Next, we will have to test it in practical applications and check its interoperability with other implementations.
The WIDE IPv6 working group organized the KAME Project in order to develop a referential implementation of advanced networking protocols for BSD variants and to promote the protocols through their implementation.
Because some of the logic used in BSD variants was not well suited to the advanced protocols, we explored different approaches to implementing the KAME network software; we implemented incoming IPv6 packets processing in an efficiency-conscious manner. We prevented infinite IPsec header creation loops by restricting the header construction to a single instance. Infinite loops in other tunnels such as IPv6 over IPv4 tunnels, meanwhile, was avoided by introducing an upper limit for nesting. IPv6 neighbor discovery was implemented so that it would be more compatible with mobile stations.
In the future, we plan to continue the project with special emphasis on such areas as the implementation of more advanced technologies such as router renumbering and IPv6 multicast routing.
We are indebted to the members of the WIDE IPv6 working group,for their detailed contributions. We also thank all the users of the KAME network software for their valuable comments.