路由缓存

声明：

本文非原创，是结合参考资料的内容整理而成，是作为route篇源码的读书笔记。

参考资料：

http://bbs.chinaunix.net/thread-1919577-1-1.html

输入理解linux网络技术内幕

source code :2.6.32

当要发送一个报文时，必定要查询发送接口，这个过程被Linux分为3个步骤：
第一个步骤是查询路由cache，
第二个步骤是查询FIB表，
第三步是将查询结果填入路由cache中以便将来查询。

现在来介绍一下路由cache。

一、什么是路由缓存
路由查询IP层最重要的工作，同时，它也是一件很耗时的工作，为了提高路由查询的效率。Linux内核引用了路由缓存，用于减少对路由表的查询。呵呵，在计算机世界里，cache是无处不在的。Linux的路由缓存（下文中可能会简称为DST）是被设计来与协议无关的独立子系统。一个典型的路由缓存如下：

root@kendo-ThinkpadT410:~# route -Cn
内核 IP 路由缓存
Source Destination Gateway Flags Metric Ref Use Iface
10.1.1.199 74.125.53.102 10.1.1.254 0 0 3 eth0
10.1.1.199 219.148.35.84 10.1.1.254 0 0 0 eth0
10.1.1.199 118.123.3.237 10.1.1.254 0 0 21 eth0
61.55.167.138 10.1.1.199 10.1.1.199 l 0 0 33 lo
10.1.1.199 203.208.37.22 10.1.1.254 0 0 3 eth0
10.1.1.183 10.1.1.255 10.1.1.255 ibl 0 0 1 lo
10.1.1.199 72.14.213.101 10.1.1.254 0 0 1 eth0
10.1.1.137 10.1.1.255 10.1.1.255 ibl 0 0 0 lo
10.1.1.199 61.139.2.69 10.1.1.254 0 0 53 eth0
10.1.1.199 8.8.8.8 10.1.1.254 0 0 45 eth0
10.1.1.199 220.166.65.249 10.1.1.254 0 0 21 eth0
10.1.1.199 207.46.193.178 10.1.1.254 0 0 3 eth0
219.148.35.84 10.1.1.199 10.1.1.199 l 0 0 2 lo
10.1.1.199 72.14.203.148 10.1.1.254 0 0 0 eth0
8.8.8.8 10.1.1.199 10.1.1.199 l 0 0 22 lo
10.1.1.199 207.46.193.178 10.1.1.254 0 0 1 eth0
10.1.1.199 219.232.243.91 10.1.1.254 0 0 2 eth0
10.1.1.199 118.123.3.236 10.1.1.254 0 0 21 eth0
……

二、路由缓存初始化
2.1 ip_rt_init
路由缓存使用hash表存储，初始化工作，最重要的就是分配hash表和表项所使用的SLAB，这个工作是在ip_rt_init中完成的：

int __init ip_rt_init(void)
{
……
/* 初始化DST SLAB分配缓存器 */
ipv4_dst_ops.kmem_cachep =
kmem_cache_create("ip_dst_cache", sizeof(struct rtable), 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
ipv4_dst_blackhole_ops.kmem_cachep = ipv4_dst_ops.kmem_cachep;
/* 根据系统内存容量，分配路由缓存hash表 */
rt_hash_table = (struct rt_hash_bucket *)
alloc_large_system_hash("IP route cache",
sizeof(struct rt_hash_bucket),
rhash_entries,
(totalram_pages >= 128 * 1024) ?
15 : 17,
0,
&rt_hash_log,
&rt_hash_mask,
rhash_entries ? 0 : 512 * 1024);
/* 初始化hash表 */
memset(rt_hash_table, 0, (rt_hash_mask + 1) * sizeof(struct rt_hash_bucket));
rt_hash_lock_init();
/* gc_thresh和ip_rt_max_size用于垃圾回收 */
ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
ip_rt_max_size = (rt_hash_mask + 1) * 16;
……

指针rt_hash_table指向缓存hash表，表的每一个桶是结构struct rt_hash_bucket，桶下的链表的结构是struct rtable。

/*
* Route cache.
*/
/* The locking scheme is rather straight forward:
*
* 1) Read-Copy Update protects the buckets of the central route hash.
* 2) Only writers remove entries, and they hold the lock
* as they look at rtable reference counts.
* 3) Only readers acquire references to rtable entries,
* they do so with atomic increments and with the
* lock held.
*/
struct rt_hash_bucket {
struct rtable *chain;
};

rt_hash_bucket只有一个struct rtable结构的成员，rtable用于描于一个缓存项（rtable和dst_entry结构体的解释可以参考深入理解LINUX网络技术内幕）：

struct fib_nh;
struct inet_peer;
struct rtable
{
union
{
struct dst_entry dst;
} u;
/* Cache lookup keys */
struct flowi fl;
struct in_device *idev;
int rt_genid;
unsigned rt_flags;
__u16 rt_type;
__be32 rt_dst; /* Path destination */
__be32 rt_src; /* Path source */
int rt_iif;
/* Info on neighbour */
__be32 rt_gateway;
/* Miscellaneous cached information */
__be32 rt_spec_dst; /* RFC1122 specific destination */
struct inet_peer *peer; /* long-living peer info */
};

Rtable由两部分组成：一份部分为协议无关、类型为struct dst_entry的成员，另一部分为其它的和具体协议相关的字段。路由缓存中，最为精华的部份就是DST的单独抽像，设计者将它设计成一个无协议无关的结构，协议无关，意味着不论是IPV4，还是V6，亦或其它网络层协议，都可以使用它。值得注意的是，dst成员被设计成union，结构dst_entry与rtable有相同的地址，同一个指针可以方便地在两者之前进行强制类型转换。整个hash表如下图所示：

2.2 hash表的分配
DST缓存的hash表的分配，是通过调用系统API alloc_large_system_hash实现的：

rt_hash_table = (struct rt_hash_bucket *)
alloc_large_system_hash("IP route cache",
sizeof(struct rt_hash_bucket),
rhash_entries,
(totalram_pages >= 128 * 1024) ?
15 : 17,
0,
&rt_hash_log,
&rt_hash_mask,
rhash_entries ? 0 : 512 * 1024);

对照以上代码，来分析alloc_large_system_hash的实现：

/*
* allocate a large system hash table from bootmem
* - it is assumed that the hash table must contain an exact power-of-2
* quantity of entries
* - limit is the number of hash buckets, not the total allocation size
*/
void *__init alloc_large_system_hash(const char *tablename, /* hash表名称 */
unsigned long bucketsize, /* hash表的每个桶的大小 */
unsigned long numentries, /* hash表的总的元素数目(即桶的数目） */
int scale,
int flags,
unsigned int *_hash_shift,
unsigned int *_hash_mask,
unsigned long limit)
{

函数前三个参数很清晰，后面几个参数在代码中逐步了解，一个有意思的是，实始分配的时候，并不需要指
明hash表的桶的数目。而这个数目，对于hash表来讲，是至关重要的。

unsigned long long max = limit;
unsigned long log2qty, size;
void *table = NULL;

如果没有手动指定hash表的大小，则根据系统内存大小自动计算hash表的元素总数。对于DST子系统而言，其值是一个
内核名命行参数rhash_entries，用户可以在内核引导时指定其大小:

/* allow the kernel cmdline to have a say */
if (!numentries) {
/* round applicable memory size up to nearest megabyte */
/**
* numentries的计算基数是nr_kernel_pages，它表示内存的dma和normal页区的实际页数
*/
numentries = nr_kernel_pages;
/**
* 这部份的计算是让numentries的值自动校正为其对应的最接近以MB字节单位的页面数的值，
* 以x86，32位的情况，一个1MB包含的页面数为1UL << 20 - PAGE_SHIFT，后面直接以256来行文了。
* 例如，如果 numentries为100，则会自动调整为256，如果为257，则会调整为512，如果为1000，则
* 会调整为1024……(假定)
*/
/**
* 这里 "+= 1MB包含的页面数" 意味着向上对齐，即如果原始是2，则会变成257(当然，通过后面的位移运算，会把它变成256)，
* 而不是变成0(向下对齐)，而-1则是一个调整阀值，对于一些边界值，如0，会保证它还是0，256还是256（而不是向上靠成512了）
*/
numentries += (1UL << (20 - PAGE_SHIFT)) - 1;
/* 右移左移移得人头晕，其实就是以256为边界对齐 */
numentries >>= 20 - PAGE_SHIFT;
numentries <<= 20 - PAGE_SHIFT;
/* limit to 1 bucket per 2^scale bytes of low memory */
/* scale只是一个当numentries为0时，计算numentries的滚动标尺 */
if (scale > PAGE_SHIFT)
numentries >>= (scale - PAGE_SHIFT);
else
numentries <<= (PAGE_SHIFT - scale);
/* Make sure we've got at least a 0-order allocation.. */
if (unlikely((numentries * bucketsize) < PAGE_SIZE))
numentries = PAGE_SIZE / bucketsize;
}
/* 变为最接近的2的幂 */
numentries = roundup_pow_of_two(numentries);

最后一个参数limit为限制hash表桶的数目，如果没有指定，则自动计算一个，默认情况下，使用总共路由的1/16（右移四位），：

/* limit allocation size to 1/16 total memory by default */
if (max == 0) {
max = ((unsigned long long)nr_all_pages << PAGE_SHIFT) >> 4;
do_div(max, bucketsize);
}

如果numentries超限，调整它：

if (numentries > max)
numentries = max;

对numentries对对数：ilog2 - log of base 2 of 32-bit or a 64-bit unsigned value，即hash表的总元素是2^log2qty。可以很方便地使用1 << log2qty来表示之。

log2qty = ilog2(numentries);

另一方面，hash表的分配，采用了三种方式，其值主要是根据参数flags和另一个全局变量hashdist来决定的：

do {
size = bucketsize << log2qty;
/**
* 这里可以看到其第5个参数的作用，如果标志位设置有HASH_EARLY,表明在启动时分配,
* 在bootmem中分配，否则使用其它方式来分配。
*/
if (flags & HASH_EARLY)
table = alloc_bootmem_nopanic(size);
else if (hashdist)
table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
else {
/*
* If bucketsize is not a power-of-two, we may free
* some pages at the end of hash table which
* alloc_pages_exact() automatically does
*/
if (get_order(size) < MAX_ORDER) {
table = alloc_pages_exact(size, GFP_ATOMIC);
kmemleak_alloc(table, size, 1, GFP_ATOMIC);
}
}
} while (!table && size > PAGE_SIZE && --log2qty);

/* 分配失败 */

if (!table)
panic("Failed to allocate %s hash table\n", tablename);

/* 从成功分配的信息当中，可以了解一些重要的计算参数的含义，也可以在dmesg中，对照计算 */

printk(KERN_INFO "%s hash table entries: %d (order: %d, %lu bytes)\n",
tablename,
(1U << log2qty),
ilog2(size) - PAGE_SHIFT,
size);

第6个参数向用户返回log2qty的值，这个值的含义前文已有分析。

if (_hash_shift)
*_hash_shift = log2qty;

1U << log2qty是hash表桶的大小，第7个参数*_hash_mask是向调用者返回桶的大小，即桶大小为*_hash_mask + 1
之所以在做减1的调整，应该是因为C语言的数组是从0开始的。

if (_hash_mask)
*_hash_mask = (1 << log2qty) - 1;
return table;

hashdist的dist，意指distribution：

#define HASH_EARLY 0x00000001 /* Allocating during early boot? */
/* Only NUMA needs hash distribution. 64bit NUMA architectures have
* sufficient vmalloc space.
*/
#if defined(CONFIG_NUMA) && defined(CONFIG_64BIT)
#define HASHDIST_DEFAULT 1
#else
#define HASHDIST_DEFAULT 0
#endif
extern int hashdist; /* Distribute hashes across NUMA nodes? */
int hashdist = HASHDIST_DEFAULT;
#ifdef CONFIG_NUMA
static int __init set_hashdist(char *str)
{
if (!str)
return 0;
hashdist = simple_strtoul(str, &str, 0);
return 1;
}
__setup("hashdist=", set_hashdist);
#endif

可见，hashdist主要是为了支持NUMA，而这个distribution，应该就是对应vmalloc的特性吧：物理上非连续。

了解了alloc_large_system_hash函数的各个参数的作用，就可以完全理解DST的hash表的分配了。

经过上面的初始化操作之后，整个路由缓存的hash链表已经建立起来，但hash链表的内容还是为空，下面我们来看整个链表的增，删，改操作。

三、缓存的查询

查询路由缓存的时候，路由子系统会根据src ip, ds tip, tos, 入口设备或者出口设备的组合来选择缓存表中的一个bucket。具体可以分为两种情况：

转发封包：IP层收到需要转发的数据报文时，选择入口设备进行路由缓存的查询。
对于转发流量，在ip_rout_input中进行路由缓存的查询：
1. if (((rth->fl.fl4_dst ^ daddr) |
2. 2285 (rth->fl.fl4_src ^ saddr) |
3. 2286 (rth->fl.iif ^ iif) | /*用入口设备进行路由缓存匹配*/
4. 2287 rth->fl.oif |
5. 2288 (rth->fl.fl4_tos ^ tos)) == 0 &&
当然，插入路由缓存的时候，所填写的设备也是入口设备。
对于转发流量，在ip_mkroute_input中插入路由缓存：
1. /* put it into the cache */
2. 2067 hash = rt_hash(daddr, saddr, fl->iif, /*选择入口设备计算hash值*/
3. 2068 rt_genid(dev_net(rth->u.dst.dev)));
4. 2069 return rt_intern_hash(hash, rth, NULL, skb);
本地生成封包：对本地生成的流量，使用出口设备进行路由缓存的查询。
对于本地生成的流量，使用__ip_route_output_key进行路由缓存的查询：
1. if (rth->fl.fl4_dst == flp->fl4_dst &&
2. 2698 rth->fl.fl4_src == flp->fl4_src &&
3. 2699 rth->fl.iif == 0 &&
4. 2700 rth->fl.oif == flp->oif && /*使用出口设备进行路由缓存匹配*/
5. 2701 rth->fl.mark == flp->mark &&
6. 2702 !((rth->fl.fl4_tos ^ flp->fl4_tos) &
7. 2703 (IPTOS_RT_MASK | RTO_ONLINK)) &&
8. 2704 net_eq(dev_net(rth->u.dst.dev), net) &&
9. 2705 !rt_is_expired(rth)) {
同时，插入路由缓存的时候，所填写的设备是出口设备。
对于本地流量，用ip_mkroute_output插入路由缓存：
1. hash = rt_hash(oldflp->fl4_dst, oldflp->fl4_src, oldflp->oif, /*使用出口设备计算hash值*/
2. 2471 rt_genid(dev_net(dev_out)));
3. 2472 err = rt_intern_hash(hash, rth, rp, NULL);
需要知道的是，入口流量总是能够找到入口设备，但本地流量不一定总是能够找到出口设备。

接下来以转发报文为例，来分析路由缓存的查询过程。
IP层收到数据报文的时候，ip_rcv_finish会调用 ip_route_input进行路由查询工作：

static int ip_rcv_finish(struct sk_buff *skb)
{
const struct iphdr *iph = ip_hdr(skb);
struct rtable *rt;
/*
* Initialise the virtual path cache for the packet. It describes
* how the packet travels inside Linux networking.
*/
if (skb_dst(skb) == NULL) {
int err = ip_route_input(skb, iph->daddr, iph->saddr, iph->tos,
skb->dev);
if (unlikely(err)) {
if (err == -EHOSTUNREACH)
IP_INC_STATS_BH(dev_net(skb->dev),
IPSTATS_MIB_INADDRERRORS);
else if (err == -ENETUNREACH)
IP_INC_STATS_BH(dev_net(skb->dev),
IPSTATS_MIB_INNOROUTES);
goto drop;
}
}

ip_route_input会首先尝试进行缓存的查找，如果找不到，再查询路由表，这里仅分析缓存的查找：

int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
u8 tos, struct net_device *dev)
{
struct rtable * rth;
unsigned hash;
int iif = dev->ifindex;
struct net *net;
net = dev_net(dev);
if (!rt_caching(net))
goto skip_cache;
tos &= IPTOS_RT_MASK;
hash = rt_hash(daddr, saddr, iif, rt_genid(net));
rcu_read_lock();
for (rth = rcu_dereference(rt_hash_table[hash].chain); rth;
rth = rcu_dereference(rth->u.dst.rt_next)) {
if (((rth->fl.fl4_dst ^ daddr) |
(rth->fl.fl4_src ^ saddr) |
(rth->fl.iif ^ iif) |
rth->fl.oif |
(rth->fl.fl4_tos ^ tos)) == 0 &&
rth->fl.mark == skb->mark &&
net_eq(dev_net(rth->u.dst.dev), net) &&
!rt_is_expired(rth)) {
dst_use(&rth->u.dst, jiffies);
RT_CACHE_STAT_INC(in_hit);
rcu_read_unlock();
skb_dst_set(skb, &rth->u.dst);
return 0;
}
RT_CACHE_STAT_INC(in_hlist_search);
}
rcu_read_unlock();

ip_route_input首先调用rt_hash_code函数计算hash值，以取得在rt_hash_table中的入口，然后使用for循环，遍历hash链中的每一个桶，进行缓存的匹备，匹备的要素包括：
目的地址
来源地址
输入接口
输出接口为空且ToS相同
netfilter mark
网络命名空间要一致，不能一个为ipv4一个为ipv6
缓存是否过期

如果缓存查找命中，则使用dst_use更新使用计数器和时间戳:

static inline void dst_use(struct dst_entry *dst, unsigned long time)
{
dst_hold(dst);
dst->__use++;
dst->lastuse = time;
}

RT_CACHE_STAT_INC宏用于累加查找命中计数器，skb_dst_set设置当前skb的dst：

static inline void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)
{
skb->_skb_dst = (unsigned long)dst;
}

有一个重要的问题是，在缓存创建的时候，在dst_entry 结构中封装了缓存下一步的发送函数output，这里设置了skb的dst，就意味着它可以继续处理和转发了skb报文了。

一点小改变：值得注意的是，查找匹备与老版本相比较，已经有了明显的变化：

for (rth = rcu_dereference(rt_hash_table[hash].chain); rth;
rth = rcu_dereference(rth->u.rt_next)) {
if (rth->fl.fl4_dst == daddr &&
rth->fl.fl4_src == saddr &&
rth->fl.iif == iif &&
rth->fl.oif == 0 &&
#ifdef CONFIG_IP_ROUTE_FWMARK
rth->fl.fl4_fwmark == skb->nfmark &&
#endif
rth->fl.fl4_tos == tos) {

上面代码取自2.6.12，新版本中多增加了两项比较：

net_eq(dev_net(rth->u.dst.dev), net) &&
!rt_is_expired(rth)

因为缓存是独立于协议的，所以net_eq比较当前缓存对应的协议是否匹备，例如是否都是ipv4。rt_is_expired用于检查缓存是否过期。
另一个变化是把(XX == XX) && (YY == YY)比较，变成了(XX ^ XX) | (YY ^ YY)，这样变的理由在于：
如果A == B, 则A ^ B = 0

位运算相对于算数运算和比较运算总是节省cpu时间的，后面还会看到很多用位运算替代算数运算的列子。

三、缓存的增加

当缓存查找没有命中，系统会进行路由表的查找，当查找命中后，会创建一个缓存项，将其插入到路由缓存hash表当中，这样，后续报文就不用再查路由表了。例如：

/* 分配一个路由缓存项 */
rth = dst_alloc(&ipv4_dst_ops);
if (!rth)
goto e_nobufs;
/* 初始化rtable的各个成员 */
rth->u.dst.output= ip_rt_bug;
rth->rt_genid = rt_genid(net);
atomic_set(&rth->u.dst.__refcnt, 1);
rth->u.dst.flags= DST_HOST;
if (IN_DEV_CONF_GET(in_dev, NOPOLICY))
rth->u.dst.flags |= DST_NOPOLICY;
rth->fl.fl4_dst = daddr;
rth->rt_dst = daddr;
rth->fl.fl4_tos = tos;
rth->fl.mark = skb->mark;
rth->fl.fl4_src = saddr;
rth->rt_src = saddr;
ONFIG_NET_CLS_ROUTE
rth->u.dst.tclassid = itag;
rth->rt_iif =
rth->fl.iif = dev->ifindex;
rth->u.dst.dev = net->loopback_dev;
dev_hold(rth->u.dst.dev);
rth->idev = in_dev_get(rth->u.dst.dev);
rth->rt_gateway = daddr;
rth->rt_spec_dst= spec_dst;
rth->u.dst.input= ip_local_deliver;
rth->rt_flags = flags|RTCF_LOCAL;
if (res.type == RTN_UNREACHABLE) {
rth->u.dst.input= ip_error;
rth->u.dst.error= -err;
rth->rt_flags &= ~RTCF_LOCAL;
}
rth->rt_type = res.type;
/* 计算hash值 */
hash = rt_hash(daddr, saddr, fl.iif, rt_genid(net));
/* 将缓存插入hash表 */
err = rt_intern_hash(hash, rth, NULL, skb);

在插入缓存项的时候，有四个动作：
1、调用dst_alloc分配一个缓存项；
2、初始化rth(struct rtable)各个成员，对rtable结构的理解，需要分析整个路由子系统，这里略过之；
3、计算缓存表的hash值，寻找入口；
4、调用rt_intern_hash插入路由表项；

3.1 缓存的分配
要将一个缓存增加入hash表，首先要调用dst_alloc分配一个路由缓存项，分配的实质就是在SLAB中分配一个高速缓存节点，每次分配的时候，都会尝试垃圾回收，关于垃圾回收，后面会有详述：

void * dst_alloc(struct dst_ops * ops)
{
struct dst_entry * dst;
/* 垃圾收集 */
if (ops->gc && atomic_read(&ops->entries) > ops->gc_thresh) {
if (ops->gc(ops))
return NULL;
}
/* 在slab中分配缓存 */
dst = kmem_cache_zalloc(ops->kmem_cachep, GFP_ATOMIC);
if (!dst)
return NULL;
/* 初始化各成员 */
atomic_set(&dst->__refcnt, 0);
dst->ops = ops;
dst->lastuse = jiffies;
dst->path = dst;
dst->input = dst->output = dst_discard;
#if RT_CACHE_DEBUG >= 2
atomic_inc(&dst_total);
#endif
atomic_inc(&ops->entries);
return dst;
}

DST缓存做为一个独立于协议的子系统，需要和外部事件进行交互，因此它定义了一些函数来和外部进行交互，这些函数的集合组成了dst_ops结构体，对于ipv4来说，这个结构体是ipv4_dst_ops。

每个SLAB缓存的大小是sizeof(struct rtable)，所以dst_alloc分配的空间并不是dst，而是rtable，函数名称有点名不副实了。dst和rtable的指针可以相互转换，所以这并不是一个问题。不过函数的名称也并非完全不准确：它的初始化工作，仅是针对dst。而整个rtable的初始化，是留给调用者的。

3.2 缓存项的插入
缓存的插入是通过rt_intern_hash来完成的，这个函数主要做了如下工作：

a. 遍历hash链表，如果找到该缓存项，将缓存项移到hash链表首部(这个依据的是最近最常使用原则)

b. 如果该缓存项在hash链表中找不到，则将路由缓存插入到hash链表，同时将arp项绑定到路由缓存，这样在数据发送的时候可以很方便的填充二层数据帧首部。

c. 扫描hash链表的时候，根据适当的规则，选举出适合释放的路由缓存项，如进行释放。如果未选举出来，调用函数rt_emergency_hash_rebuild 进行垃圾回收处理。

static int rt_intern_hash(unsigned hash, struct rtable *rt,
struct rtable **rp, struct sk_buff *skb)
{
struct rtable *rth, **rthp;
unsigned long now;
struct rtable *cand, **candp;
u32 min_score;
int chain_length;
int attempts = !in_softirq();
restart:
chain_length = 0;
min_score = ~(u32)0;
cand = NULL;
candp = NULL;
now = jiffies;
/* rt_intern_hash要做的第一件事情，就是检索要插入的缓存项在缓存hash表中是否存在。
* 常理来讲，缓存的插入都是先查找但未命中后，再进行插入操作，所以这个检查好像是多余的。
* 但是因为路由缓hash表可以在多个CPU上并行，缓存项可能在一个CPU上查找未命中的同时却被其它CPU插入……
*/
/* 设备对应的网络子系统还没有缓存，当然也用不着检索了 */
if (!rt_caching(dev_net(rt->u.dst.dev))) {
/*
* If we're not caching, just tell the caller we
* were successful and don't touch the route. The
* caller hold the sole reference to the cache entry, and
* it will be released when the caller is done with it.
* If we drop it here, the callers have no way to resolve routes
* when we're not caching. Instead, just point *rp at rt, so
* the caller gets a single use out of the route
* Note that we do rt_free on this new route entry, so that
* once its refcount hits zero, we are still able to reap it
* (Thanks Alexey)
* Note also the rt_free uses call_rcu. We don't actually
* need rcu protection here, this is just our path to get
* on the route gc list.
*/
/* 如果是单播中转或本地发送的报文，则尝试与arp绑定 */
if (rt->rt_type == RTN_UNICAST || rt->fl.iif == 0) {
int err = arp_bind_neighbour(&rt->u.dst);
if (err) {
/* 失败处理 */
if (net_ratelimit())
printk(KERN_WARNING
"Neighbour table failure & not caching routes.\n");
rt_drop(rt);
return err;
}
}
rt_free(rt);
goto skip_hashing;
}
/* 取得hash链，这里与普通的链表稍有区别，因为rthp是指向指针的指针 */
rthp = &rt_hash_table[hash].chain;
/* 取得链锁 */
spin_lock_bh(rt_hash_lock_addr(hash));
/* 遍历链，寻找缓存是否已经存在
* 这里一个值得注意的地方，是链表的删除操作：rthp被定义成一个指向指向的指针，这主要是为了高效地操作链表
* 每一次遍历，rthp指向的并不是缓存链中的下一个指点，而是指向"指向下一个节点的指针(dst.rt_next)的指针":
* rthp = &rth->u.dst.rt_next; 这样，在删除节点时，只需要修改这个指针指向的地址，让它指向待删除的节点的
* 即可： *rthp = rth->u.dst.rt_next;这样，就不必再保留一个"前一节点的prev指针"。
*/
while ((rth = *rthp) != NULL) {
/* 尝试超时过期清理 */
if (rt_is_expired(rth)) {
*rthp = rth->u.dst.rt_next;
rt_free(rth);
continue;
}
/* 关键字匹备，查看要插入的项是否存在 */
if (compare_keys(&rth->fl, &rt->fl) && compare_netns(rth, rt)) {
/* 如果查找命中，则将它调到链首，这样做的理由是因为它是最近被使用，有可能会在接下来的查找中最先被使用 */
/* Put it first */
*rthp = rth->u.dst.rt_next;
/*
* Since lookup is lockfree, the deletion
* must be visible to another weakly ordered CPU before
* the insertion at the start of the hash chain.
*/
rcu_assign_pointer(rth->u.dst.rt_next,
rt_hash_table[hash].chain);
/*
* Since lookup is lockfree, the update writes
* must be ordered for consistency on SMP.
*/
rcu_assign_pointer(rt_hash_table[hash].chain, rth);
/* 更新使用计数器(不是引用计数器)和时间戳 */
dst_use(&rth->u.dst, now);
/* 解锁*/
spin_unlock_bh(rt_hash_lock_addr(hash));
/* 因为存在，没有必要插入了，丢弃要插入的缓存项 */
rt_drop(rt);
/* 如果调用者指明了rp，则使用它返回查找到的缓存项，否则调用skb_dst_set设置skb的dst */
if (rp)
*rp = rth;
else
skb_dst_set(skb, &rth->u.dst);
return 0;
}
/* 如果对该缓存项没有引用，则尝试调用rt_score来计算应被删除的最佳候选人
* rt_score算计一个得分，拥有最小得分的缓存项则被记录至cand,同时使用了一个candp的理由与rthp作用类似，
* 在删除这个最佳删除项cand的时候，减少一个prev指针
*/
if (!atomic_read(&rth->u.dst.__refcnt)) {
u32 score = rt_score(rth);
/* min_socre初值是一个32位的最大值，如果计算出最小值，则不断地更新它，以期得到最大值 */
if (score <= min_score) {
cand = rth;
candp = rthp;
min_score = score;
}
}
/* 统计链长，主要是用于以后判断是否超过垃圾回收的阀值 */
chain_length++;
rthp = &rth->u.dst.rt_next;
}
/* 循环完hash链，没有找到匹备的缓存项，则将尝试插入，每次插入之前，都会尝试找到一个最佳的删除项cand，这样，以避免出现缓存容量的溢出 */
if (cand) {
/* ip_rt_gc_elasticity used to be average length of chain
* length, when exceeded gc becomes really aggressive.
*
* The second limit is less certain. At the moment it allows
* only 2 entries per bucket. We will see.
*/
/* 如果找到了cand，并且当前链的长度已经超过了定义的垃圾回收的阀值，则直接调用rt_free删除之 */
if (chain_length > ip_rt_gc_elasticity) {
*candp = cand->u.dst.rt_next;
rt_free(cand);
}
} else {
/* 如果没有找到cand，但是当前链的长度已经超过了链的最大长度，则仍然在进行垃圾回收处理，这次出马的是rt_emergency_hash_rebuild */
if (chain_length > rt_chain_length_max) {
struct net *net = dev_net(rt->u.dst.dev);
int num = ++net->ipv4.current_rt_cache_rebuild_count;
if (!rt_caching(dev_net(rt->u.dst.dev))) {
printk(KERN_WARNING "%s: %d rebuilds is over limit, route caching disabled\n",
rt->u.dst.dev->name, num);
}
rt_emergency_hash_rebuild(dev_net(rt->u.dst.dev));
}
}
/* Try to bind route to arp only if it is output
route or unicast forwarding path.
*/
/* 如果是单播中转或本地发出的报文，尝试将路由缓存与arp绑定，需要绑定的理由在于加速，这样在数据发送的时候，可以很方便地封装二层帧首部 */
if (rt->rt_type == RTN_UNICAST || rt->fl.iif == 0) {
int err = arp_bind_neighbour(&rt->u.dst);
/* 绑定失败 */
if (err) {
spin_unlock_bh(rt_hash_lock_addr(hash));
/* 内存不足，直接丢弃，并出错返回 */
if (err != -ENOBUFS) {
rt_drop(rt);
return err;
}
/* Neighbour tables are full and nothing
can be released. Try to shrink route cache,
it is most likely it holds some neighbour records.
*/
/* 否则调整垃圾收回阀值，调用rt_garbage_collect进行主动垃圾清理，并尝试重试 */
if (attempts-- > 0) {
int saved_elasticity = ip_rt_gc_elasticity;
int saved_int = ip_rt_gc_min_interval;
ip_rt_gc_elasticity = 1;
ip_rt_gc_min_interval = 0;
rt_garbage_collect(&ipv4_dst_ops);
ip_rt_gc_min_interval = saved_int;
ip_rt_gc_elasticity = saved_elasticity;
goto restart;
}
/* 超过最大重试次数，仍旧失败 */
if (net_ratelimit())
printk(KERN_WARNING "Neighbour table overflow.\n");
rt_drop(rt);
return -ENOBUFS;
}
}
rt->u.dst.rt_next = rt_hash_table[hash].chain;
#if RT_CACHE_DEBUG >= 2
if (rt->u.dst.rt_next) {
struct rtable *trt;
printk(KERN_DEBUG "rt_cache @%02x: %pI4",
hash, &rt->rt_dst);
for (trt = rt->u.dst.rt_next; trt; trt = trt->u.dst.rt_next)
printk(" . %pI4", &trt->rt_dst);
printk("\n");
}
#endif
/*
* Since lookup is lockfree, we must make sure
* previous writes to rt are comitted to memory
* before making rt visible to other CPUS.
*/
/* 插入缓存项 */
rcu_assign_pointer(rt_hash_table[hash].chain, rt);
/* 解锁 */
spin_unlock_bh(rt_hash_lock_addr(hash));
skip_hashing:
if (rp)
*rp = rt;
else
skb_dst_set(skb, &rt->u.dst);
return 0;
}

四、缓存的释放
在rt_intern_hash函数中，多次出现调用rt_free来释放或调用rt_drop来丢弃缓存的情况。
这两个函数非常相似：

static inline void rt_free(struct rtable *rt)
{
call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free);
}

static inline void rt_drop(struct rtable *rt)
{
ip_rt_put(rt);
call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free);
}

rt_drop多了一句ip_rt_put调用。ip_rt_put函数通过调用dst_release递减缓存的引用计数器：

static inline void ip_rt_put(struct rtable * rt)
{
if (rt)
dst_release(&rt->u.dst);
}

static inline void dst_rcu_free(struct rcu_head *head)
{
struct dst_entry *dst = container_of(head, struct dst_entry, rcu_head);
dst_free(dst);
}

static inline void dst_free(struct dst_entry * dst)
{
/* obsoolete > 1 时意思着缓存项已经被处理过了，直接返回 */
if (dst->obsolete > 1)
return;
/* 如果dst的引用计数器为0，则直接调用dst_destory删除之，否则调用__dst_free进一步处理，
* 特别地，如果删除失败，也会调用__dst_free
*/
if (!atomic_read(&dst->__refcnt)) {
dst = dst_destroy(dst);
if (!dst)
return;
}
__dst_free(dst);
}

void __dst_free(struct dst_entry * dst)
{
spin_lock_bh(&dst_garbage.lock);
___dst_free(dst);
dst->next = dst_garbage.list;
dst_garbage.list = dst;
if (dst_garbage.timer_inc > DST_GC_INC) {
dst_garbage.timer_inc = DST_GC_INC;
dst_garbage.timer_expires = DST_GC_MIN;
cancel_delayed_work(&dst_gc_work);
schedule_delayed_work(&dst_gc_work, dst_garbage.timer_expires);
}
spin_unlock_bh(&dst_garbage.lock);
}

__dst_free函数首先调用___dst_free将设备还没有处于运行或运行状态不是IFF_UP时，将dst的input/output函数指针设为dst_discard,并且设置obsolete=2，标识缓存项为DEAD状态，或者意味着它已经被_dst_free处理过了，与此相对就是在前文dst_free函数中，判断obsoolete > 1时就直接返回，因为它已经被处理过了：

static void ___dst_free(struct dst_entry * dst)
{
/* The first case (dev==NULL) is required, when
protocol module is unloaded.
*/
if (dst->dev == NULL || !(dst->dev->flags&IFF_UP)) {
dst->input = dst->output = dst_discard;
}
dst->obsolete = 2;
}

接下来的工作就是把dst放到一个dst_garbage_list全局链表中，这意味着缓存应该被释放，但是因为引用计数器非0，所以暂时放在这里，相当于
打入天牢，待秋后处决的意思:

dst->next = dst_garbage.list;
dst_garbage.list = dst;

这个秋后处决的时间调度是使用一个延迟队列来实现的，如果队列的时间计数器inc大于 DST_GC_INC，则设置在最小
延迟时间DST_GC_MIN后处理dst_garbage_list：

if (dst_garbage.timer_inc > DST_GC_INC) {
dst_garbage.timer_inc = DST_GC_INC;
dst_garbage.timer_expires = DST_GC_MIN;
cancel_delayed_work(&dst_gc_work);
schedule_delayed_work(&dst_gc_work, dst_garbage.timer_expires);
}

缓存最终的内存释放，是通过dst_destroy来实现的，缓存释放的本质是向SLAB返还内存:

kmem_cache_free(dst->ops->kmem_cachep, dst);

不过因为dst与其它子系统的相关性，实际的过程还要稍微麻烦一些：

struct dst_entry *dst_destroy(struct dst_entry * dst)
{
struct dst_entry *child;
struct neighbour *neigh;
struct hh_cache *hh;
smp_rmb();
again:
neigh = dst->neighbour;
hh = dst->hh;
child = dst->child;
/* 释放缓存对应的二层报头 */
dst->hh = NULL;
if (hh && atomic_dec_and_test(&hh->hh_refcnt))
kfree(hh);
/* 释放对应的二层协议结构 */
if (neigh) {
dst->neighbour = NULL;
neigh_release(neigh);
}
/* 减少总的缓存计数器 */
atomic_dec(&dst->ops->entries);
/* 如果协议定义了destroy，调用之 */
if (dst->ops->destroy)
dst->ops->destroy(dst);
/* 递减对应设备的引用计数器 */
if (dst->dev)
dev_put(dst->dev);
#if RT_CACHE_DEBUG >= 2
atomic_dec(&dst_total);
#endif
/* 内存释放 */
kmem_cache_free(dst->ops->kmem_cachep, dst);
/* dst的child指针被IPSEC模块使用，它可能是一个child链，在释放dst的时候，也会尝试去释放它，如果其
* 设置了DST_NOHASH标识，并且没有被引用，则使用goto again反复地释放它们。否则，则将其做为返回值返回，
* 前面在分析dst_free时指出，如果dst_free在调用dst_destroy时没有返回NULL，则调用__dst_free进一步处理。
*/
dst = child;
if (dst) {
int nohash = dst->flags & DST_NOHASH;
if (atomic_dec_and_test(&dst->__refcnt)) {
/* We were real parent of this dst, so kill child. */
if (nohash)
goto again;
} else {
/* Child is still referenced, return it for freeing. */
if (nohash)
return dst;
/* Child is still in his hash table */
}
}
return NULL;
}

五、缓存的垃圾回收
要进行垃圾回收的原因很多，例如，缓存项巨大，点用过多内存。前面在插入缓存项的时候已经看到在插入新缓存项的时候，总是会尝试去删除一个合适的旧的缓存项。
这就是缓存垃圾回收的一个例子。

缓存子系统使用了两种垃圾回收机集：
同步回收
当分配新的缓存项，但是发现缓存总数已经超过阀值gc_thresh时。
当一条新的缓存项需要插入到缓存hash表，而对应的表的链中有合适应该被删除的项，这在之前已经看到过。
当邻居子系统缓存需要内存时，因为dst与2层协议缓存之前存在相互用用关系。如果二层缓存协议无法分配到内存时，那么进入同步回收，间接地释放2层缓存协议所占的内存。
异步回收
系统使用一个定时器，来定时地触发定期的垃圾回收操作，以使缓存的容量始终在一个合理的范围内。

5.1 同步回收
dst_alloc在分配新的缓存项，但是发现缓存总数已经超过阀值gc_thresh时

猜你喜欢