The most compact configuration, D16R, distills a real-world BGP snapshot with 513,644 IPv4 prefixes and 239 distinct next hops into a structure consuming only 932 Kbytes, less than 2 bytes per prefix, and exceeds 100 million lookups per second (MLps) on a single commodity CPU core in synthetic tests with uniformly random queries. Some other DXR configurations exceed 200 MLps per CPU core at the cost of increased memory footprint, and deliver aggregate throughputs exceeding two billion lookups per second with multiple parallel worker threads.
DXR significantly outperforms a software implementation of DIR-24-8-BASIC, has better scalability, and consumes less DRAM bandwidth.
For DXR configurations with K >= 16, the start of each range can be encoded using only 16 bits, since (at least) 16 most significant bits of the search key have already been resolved via direct table lookup. Thus, after reserving another 16 bits for storing an index into the next hop table, each address range entry can be encoded using only four bytes. Since cache lines in most contemporary CPUs contain 64 bytes, a single cache line can hold 16 such address range descriptors. As binary search for the address range matching the key progresses, the subseqent search iterations are more likely to be served directly from the L1 cache.
Because the majority of IPv4 prefixes announced in global routing tables are (still) /24 or less specific, a further optimization is possible for chunks containing only such prefixes and only pointing to next hop table entries which can be encoded using 8 bits. In such cases the 8 least significant bits of each range base are guaranteed to be all zeros, which permits range entries to be encoded using only two bytes, thus improving cache hit rates.
Additional details on data structures, incremental updating algorithms and performance analysis can be found in our CCR paper.
Note that there were other proposals predating DXR which explored range-based LPM, such as Lampson-Varghese multiway lookups. The novelty of DXR is in compact encoding of range lookup structures combined with a trivial search procedure.RangeIPLookup, an experimental IPv4 routing lookup element distributed with the Click modular router since 2005. We refined and ported it to the FreeBSD kernel where DXR was coupled to the historic BSD radix tree as its auxiliary database, which simplified adding support for fast incremental updating of lookup structures.
DXRIPLookup is now available as a Click element, along with a few other related modules: BSDIPLookup, which can be used as a standalone lookup module but also serves as DXR's auxiliary database; DirectIPLookup, a (re)implementation of the DIR-24-8 lookup scheme which unlike the original version now also supports fast incremental updating; and finally BSDIP6Lookup, a Click-IP6 wrapper around the traditional BSD radix tree code.
Click modules are available here: dxr_click_20140423.tgz
To install the modules, unpack the above archive in click's top-level directory. You'll have to delete the existing elements/ip/rangeiplookup.* files since that module depends on an older version of DirectIPLookup element, and is made obsolete by the new DXRIPLookup implementation. Then configure, build and install Click using the standard procedure.
By default DXRIPLookup is configured to resolve 20 most significant bits of lookup keys via direct table lookups (D20R). To experiment with different DXR configurations, edit file dxriplookup.hh around line 100 and set the macro DXR_DIRECT_BITS to the desired value, and then rebuild Click. The choice of DXR_DIRECT_BITS (or K) determines the balance between the overall memory footprint of the lookup structures and the number of binary search iterations required to find the next hop using range tables, as outlined in the graph below:
The table shows the distribution of range chunk lengths for values of K between 16 and 22. Chunks with a single range are resolved via direct table lookup, while longer chunks require up to log2(N) binary search iterations to complete. With the LINX BGP snapshot, for all values of K no range chunks contain more than 256 elements.
The SEQ (sequential) test introduces artificial dependencies between subsequent queries, which essentially prevent out-of-order execution circuitry on modern CPUs from pipelining cache / memory accesses from multiple search requests. Since Click currently processes packets one-by-one (just as most contemporary operating systems do), this test provides reasonably realistic estimates on how fast the lookup elements would work in real Click configurations.
The RND (random) test indicates how fast lookups could be performed if they would be done in batches, a technique which is becoming increasingly popular (and necessary) in (software) packet processing systems (see Netmap or PacketShader for example). There are no artificial dependencies between lookup requests so they can be transparently pipelined by CPU's OOO circuitry, which significantly improves the throughput compared to the one-by-one packet processing mode.
The REP (repetitive) test operates in the same mode as the RND test, but each key is used eight times within a short timeframe in an attempt to mimic locality in traffic patterns. Since after the first query lookup structures will migrate higher in the CPU cache hierarchy, subsequent requests for the same key will be resolved faster, thus yielding higher throughputs.
The benchmarks below were obtained using the LINX BGP snapshot on four different machines:
E5-1650: Intel Xeon, 12 MB L3 cache, 3.2 GHz
(turbo freq and HT disabled), FreeBSD 9.2
FX-8350: AMD FX, 8 MB L3 cache, 4.0 GHz (turbo freq disabled), FreeBSD 9.2
i7-4771: Intel Core i7, 8 MB L3 cache, 3.5 to 3.9 GHz (HT disabled), Debian 3.13.10-1
i5-3210M: Intel Mobile Core i5, 3 MB L3 cache, 2.5 to 3.1 GHz (HT disabled), FreeBSD 9.2
And here are the results:
|SEQ test, single thread, uniformly random keys, MLps throughput:|
|RND test, single thread, uniformly random keys, MLps throughput:|
|REP test, single thread, uniformly random keys, MLps throughput:|
How to reproduce:
Unpack one of the provided test configurations and start Click with a
control socket listener:
cpx% cd click cpx% ./userlevel/click -p 1234 linx.click
cpx% telnet localhost 1234 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. Click::ControlSocket/1.3 read rt.stat 200 Read handler 'rt.stat' OK DATA 428 DXRIPLookup (D20R): 513644 prefixes, 239 unique nexthops Lookup tables: 4194304 bytes direct, 512148 bytes range (9.1 bytes/prefix) Direct table resolves 93.8% of IPv4 address space Longest range chunk contains 176 fragments Physical chunks: 40247 short, 1416 long Physical fragments: 237728 short, 9173 long Aggregated chunks: 63313 short, 1477 long Aggregated fragments: 306484 short, 9354 long Last update duration: 909.6 ms
write rt.bench_sel 1 200 Write handler 'rt.bench_sel' OK read rt.bench 200 Read handler 'rt.bench' OK DATA 108 DXRIPLookup (D20R), RND test, uniformly random keys: 268435456 lookups in 1.406164355 s (190.9 M lookups/s)