个人理解,欢迎指正
*参考linux内核源码4.1
基本原理
接收缓存大小的动态调整
接收端要想不成为瓶颈,需要提供的窗口大小win_size>=RTT*speed。linux内核通过上一个RTT的接收情况来估算当前RTT需要的窗口大小(window_clamp)。再通过窗口大小计算出所需的接收缓存大小。在连接将数据拷贝给用户时触发接收存储动态调整。具体详见tcp_rcv_space_adjust。
接收窗口大小的动态调整
一般初始化一个较小的窗口,然后逐渐增加,避免多连接的情况下大家一上来都通告大窗口导致链路拥塞。接收窗口的调整主要受限rcv_ssthresh。在连接启动阶段,rcv_ssthresh会随着连接收包不断的扩大,当rcv_ssthresh增大到一定程度后受限window_clamp,不再扩大。另外当系统内存吃紧的情况下连接会主动缩小rcv_ssthresh,避免连接占用过多内存。
相关变量解释
因为系统的内存有限,为了限制单连接占用过多的内存,连接的发送和接收buf都会有上限。接收方向由sk->sk_rcvbuf决定。连接发包时都会重新计算通告窗口,窗口的选择主要与如下几个变量相关:
sk->sk_rcvbuf:分配给连接的接收使用的buf大小(size of receive buffer in bytes)。通常将数据拷贝给用户后会根据历史接收情况重新计算sk_rcvbuf(具体参见tcp_rcv_space_adjust)。
sk->sk_rmem_alloc:接收已经使用的buf大小。(接收到报文会相应增加,将数据交给用户后会相应减少)
tp->window_clamp:连接窗口的上限(Maximal window to advertise)。通常是通过历史接收情况估算出当前需要通告的最优窗口(具体参见tcp_rcv_space_adjust)。rcv_ssthresh动态调整最终会趋向于这个值。
tp->rcv_ssthresh:当前允许通告的最大窗口(Current window clamp)。窗口的选择大多时候都由rcv_ssthresh决定,而rcv_ssthresh是会动态调整的。而rcv_ssthresh调整的上限就是tp->window_clamp(具体参见tcp_grow_window)。
源码分析
接收缓存大小的动态调整
接收缓存大小的动态调整主要涉及的函数有
1)tcp_rcv_space_adjust():根据上一个RTT的接收情况来估算当前RTT需要的窗口和相应的接收缓存大小。连接将数据拷贝给用户后调用
2)tcp_rcv_rtt_measure():连接不开启时间戳时的RTT估算
3)tcp_rcv_rtt_measure_ts():连接开启时间戳时的RTT估算
具体函数实现:
/*
* This function should be called every time data is copied to user space.
* It calculates the appropriate TCP receive buffer space.
*/
void tcp_rcv_space_adjust(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
int time;
int copied;
time = tcp_time_stamp - tp->rcvq_space.time;
if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
return;
/* Number of bytes copied to user in last RTT */
copied = tp->copied_seq - tp->rcvq_space.seq;
if (copied <= tp->rcvq_space.space)
goto new_measure;
/* A bit of theory :
* copied = bytes received in previous RTT, our base window
* To cope with packet losses, we need a 2x factor
* To cope with slow start, and sender growing its cwin by 100 %
* every RTT, we need a 4x factor, because the ACK we are sending
* now is for the next RTT, not the current one :
* <prev RTT . ><current RTT .. ><next RTT .... >
*/
if (sysctl_tcp_moderate_rcvbuf &&
!(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) {
int rcvwin, rcvmem, rcvbuf;
/* minimal window to cope with packet losses, assuming
* steady state. Add some cushion because of small variations.
*/
rcvwin = (copied << 1) + 16 * tp->advmss;
/* If rate increased by 25%,
* assume slow start, rcvwin = 3 * copied
* If rate increased by 50%,
* assume sender can use 2x growth, rcvwin = 4 * copied
*/
if (copied >=
tp->rcvq_space.space + (tp->rcvq_space.space >> 2)) {
if (copied >=
tp->rcvq_space.space + (tp->rcvq_space.space >> 1))
rcvwin <<= 1;
else
rcvwin += (rcvwin >> 1);
}
rcvmem = SKB_TRUESIZE(tp->advmss + MAX_TCP_HEADER);
while (tcp_win_from_space(rcvmem) < tp->advmss)
rcvmem += 128;
rcvbuf = min(rcvwin / tp->advmss * rcvmem, sysctl_tcp_rmem[2]);
if (rcvbuf > sk->sk_rcvbuf) {
sk->sk_rcvbuf = rcvbuf;
/* Make the window clamp follow along. */
tp->window_clamp = rcvwin;
}
}
tp->rcvq_space.space = copied;
new_measure:
tp->rcvq_space.seq = tp->copied_seq;
tp->rcvq_space.time = tcp_time_stamp;
}
接收窗口大小的动态调整
接收窗口大小的动态调整主要涉及的函数有
1)tcp_select_window():根据连接实际情况计算得到的最终通告窗口
2)tcp_grow_window():连接接收到报文后调用(包括正序和乱序),尝试扩大rcv_ssthresh
3)skb_set_owner_r():增加sk_rmem_alloc,并指定destructor函数,保证skb释放时相应减少sk_rmem_alloc
TCP连接接收到报文,并将报文挂入sk->sk_receive_queue后(乱序报文情况待补充:TBD),会调用skb_set_owner_r()增加sk_rmem_alloc。同时会调用tcp_grow_window()尝试扩大rcv_ssthresh。当连接将数据拷贝给用户释放skb时,会调用skb_set_owner_r()内指定的destructor函数减少sk_rmem_alloc。当连接发包(发数据或回ack)时会调用tcp_select_window()计算出当前的通告窗口。
具体函数实现:
static u16 tcp_select_window(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
u32 old_win = tp->rcv_wnd;
u32 cur_win = tcp_receive_window(tp);
u32 new_win = __tcp_select_window(sk);
/* Never shrink the offered window */
if (new_win < cur_win) {
/* Danger Will Robinson!
* Don't update rcv_wup/rcv_wnd here or else
* we will not be able to advertise a zero
* window in time. --DaveM
*
* Relax Will Robinson.
*/
if (new_win == 0)
NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPWANTZEROWINDOWADV);
new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
}
tp->rcv_wnd = new_win;
tp->rcv_wup = tp->rcv_nxt;
/* Make sure we do not exceed the maximum possible
* scaled window.
*/
if (!tp->rx_opt.rcv_wscale && sysctl_tcp_workaround_signed_windows)
/* 不能超过32767,因为一些奇葩协议采用有符号的接收窗口大小*/
new_win = min(new_win, MAX_TCP_WINDOW);
else
new_win = min(new_win, (65535U << tp->rx_opt.rcv_wscale));
/* RFC1323 scaling applied */
new_win >>= tp->rx_opt.rcv_wscale;
/* If we advertise zero window, disable fast path. */
if (new_win == 0) {
tp->pred_flags = 0;
if (old_win)
NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPTOZEROWINDOWADV);
} else if (old_win == 0) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPFROMZEROWINDOWADV);
}
return new_win;
}
u32 __tcp_select_window(struct sock *sk)
{
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
/* MSS for the peer's data. Previous versions used mss_clamp
* here. I don't know if the value based on our guesses
* of peer's MSS is better for the performance. It's more correct
* but may be worse for the performance because of rcv_mss
* fluctuations. --SAW 1998/11/1
*/
int mss = icsk->icsk_ack.rcv_mss;
int free_space = tcp_space(sk);
int allowed_space = tcp_full_space(sk);
int full_space = min_t(int, tp->window_clamp, allowed_space);
int window;
if (mss > full_space)
mss = full_space;
if (free_space < (full_space >> 1)) {
icsk->icsk_ack.quick = 0;
if (sk_under_memory_pressure(sk))
tp->rcv_ssthresh = min(tp->rcv_ssthresh,
4U * tp->advmss);
/* free_space might become our new window, make sure we don't
* increase it due to wscale.
*/
free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale);
/* if free space is less than mss estimate, or is below 1/16th
* of the maximum allowed, try to move to zero-window, else
* tcp_clamp_window() will grow rcv buf up to tcp_rmem[2], and
* new incoming data is dropped due to memory limits.
* With large window, mss test triggers way too late in order
* to announce zero window in time before rmem limit kicks in.
*/
if (free_space < (allowed_space >> 4) || free_space < mss)
return 0;
}
if (free_space > tp->rcv_ssthresh)
free_space = tp->rcv_ssthresh;
/* Don't do rounding if we are using window scaling, since the
* scaled window will not line up with the MSS boundary anyway.
*/
window = tp->rcv_wnd;
if (tp->rx_opt.rcv_wscale) {
window = free_space;
/* Advertise enough space so that it won't get scaled away.
* Import case: prevent zero window announcement if
* 1<<rcv_wscale > mss.
*/
if (((window >> tp->rx_opt.rcv_wscale) << tp->rx_opt.rcv_wscale) != window)
window = (((window >> tp->rx_opt.rcv_wscale) + 1)
<< tp->rx_opt.rcv_wscale);
} else {
/* Get the largest window that is a nice multiple of mss.
* Window clamp already applied above.
* If our current window offering is within 1 mss of the
* free space we just keep it. This prevents the divide
* and multiply from happening most of the time.
* We also don't do any window rounding when the free space
* is too small.
*/
if (window <= free_space - mss || window > free_space)
window = (free_space / mss) * mss;
else if (mss == full_space &&
free_space > window + (full_space >> 1))
window = free_space;
}
return window;
}
static inline void skb_set_owner_r(struct sk_buff *skb, struct sock *sk)
{
skb_orphan(skb);
skb->sk = sk;
/*连接将数据交给用户后会释放skb,释放skb会调用此时注册的destructor函数。
*sock_rfree函数内部会相应增加sk->sk_rmem_alloc*/
skb->destructor = sock_rfree;
atomic_add(skb->truesize, &sk->sk_rmem_alloc);
sk_mem_charge(sk, skb->truesize);
}
void sock_rfree(struct sk_buff *skb)
{
struct sock *sk = skb->sk;
unsigned int len = skb->truesize;
atomic_sub(len, &sk->sk_rmem_alloc);
sk_mem_uncharge(sk, len);
}
参考资料:
https://blog.csdn.net/zhangskd/article/details/8603099
TCP接收窗口的调整算法(下)
https://blog.csdn.net/zhangskd/article/details/8200048
TCP接收缓存大小的动态调整