Skip to content

Commit

Permalink
tcp: Tail loss probe (TLP)
Browse files Browse the repository at this point in the history
This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.

TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.

PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.

TLP Algorithm

On transmission of new data in Open state:
  -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
  -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
  -> PTO = min(PTO, RTO)

Conditions for scheduling PTO:
  -> Connection is in Open state.
  -> Connection is either cwnd limited or no new data to send.
  -> Number of probes per tail loss episode is limited to one.
  -> Connection is SACK enabled.

When PTO fires:
  new_segment_exists:
    -> transmit new segment.
    -> packets_out++. cwnd remains same.

  no_new_packet:
    -> retransmit the last segment.
       Its ACK triggers FACK or early retransmit based recovery.

ACK path:
  -> rearm RTO at start of ACK processing.
  -> reschedule PTO if need be.

In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
		 ==1; enables RFC5827 ER.
		 ==2; delayed ER.
		 ==3; TLP and delayed ER. [DEFAULT]
		 ==4; TLP only.

The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.

Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
  • Loading branch information
Nandita Dukkipati authored and davem330 committed Mar 12, 2013
1 parent 83e519b commit 6ba8a3b
Show file tree
Hide file tree
Showing 12 changed files with 171 additions and 28 deletions.
8 changes: 6 additions & 2 deletions Documentation/networking/ip-sysctl.txt
Original file line number Diff line number Diff line change
Expand Up @@ -190,15 +190,19 @@ tcp_early_retrans - INTEGER
Enable Early Retransmit (ER), per RFC 5827. ER lowers the threshold
for triggering fast retransmit when the amount of outstanding data is
small and when no previously unsent data can be transmitted (such
that limited transmit could be used).
that limited transmit could be used). Also controls the use of
Tail loss probe (TLP) that converts RTOs occuring due to tail
losses into fast recovery (draft-dukkipati-tcpm-tcp-loss-probe-01).
Possible values:
0 disables ER
1 enables ER
2 enables ER but delays fast recovery and fast retransmit
by a fourth of RTT. This mitigates connection falsely
recovers when network has a small degree of reordering
(less than 3 packets).
Default: 2
3 enables delayed ER and TLP.
4 enables TLP only.
Default: 3

tcp_ecn - INTEGER
Control use of Explicit Congestion Notification (ECN) by TCP.
Expand Down
1 change: 0 additions & 1 deletion include/linux/tcp.h
Original file line number Diff line number Diff line change
Expand Up @@ -201,7 +201,6 @@ struct tcp_sock {
unused : 1;
u8 repair_queue;
u8 do_early_retrans:1,/* Enable RFC5827 early-retransmit */
early_retrans_delayed:1, /* Delayed ER timer installed */
syn_data:1, /* SYN includes data */
syn_fastopen:1, /* SYN includes Fast Open option */
syn_data_acked:1;/* data in SYN is acked by SYN-ACK */
Expand Down
5 changes: 4 additions & 1 deletion include/net/inet_connection_sock.h
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,8 @@ struct inet_connection_sock {
#define ICSK_TIME_RETRANS 1 /* Retransmit timer */
#define ICSK_TIME_DACK 2 /* Delayed ack timer */
#define ICSK_TIME_PROBE0 3 /* Zero window probe timer */
#define ICSK_TIME_EARLY_RETRANS 4 /* Early retransmit timer */
#define ICSK_TIME_LOSS_PROBE 5 /* Tail loss probe timer */

static inline struct inet_connection_sock *inet_csk(const struct sock *sk)
{
Expand Down Expand Up @@ -222,7 +224,8 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
when = max_when;
}

if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0) {
if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0 ||
what == ICSK_TIME_EARLY_RETRANS || what == ICSK_TIME_LOSS_PROBE) {
icsk->icsk_pending = what;
icsk->icsk_timeout = jiffies + when;
sk_reset_timer(sk, &icsk->icsk_retransmit_timer, icsk->icsk_timeout);
Expand Down
6 changes: 4 additions & 2 deletions include/net/tcp.h
Original file line number Diff line number Diff line change
Expand Up @@ -543,6 +543,8 @@ extern bool tcp_syn_flood_action(struct sock *sk,
extern void tcp_push_one(struct sock *, unsigned int mss_now);
extern void tcp_send_ack(struct sock *sk);
extern void tcp_send_delayed_ack(struct sock *sk);
extern void tcp_send_loss_probe(struct sock *sk);
extern bool tcp_schedule_loss_probe(struct sock *sk);

/* tcp_input.c */
extern void tcp_cwnd_application_limited(struct sock *sk);
Expand Down Expand Up @@ -873,8 +875,8 @@ static inline void tcp_enable_fack(struct tcp_sock *tp)
static inline void tcp_enable_early_retrans(struct tcp_sock *tp)
{
tp->do_early_retrans = sysctl_tcp_early_retrans &&
!sysctl_tcp_thin_dupack && sysctl_tcp_reordering == 3;
tp->early_retrans_delayed = 0;
sysctl_tcp_early_retrans < 4 && !sysctl_tcp_thin_dupack &&
sysctl_tcp_reordering == 3;
}

static inline void tcp_disable_early_retrans(struct tcp_sock *tp)
Expand Down
1 change: 1 addition & 0 deletions include/uapi/linux/snmp.h
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,7 @@ enum
LINUX_MIB_TCPFORWARDRETRANS, /* TCPForwardRetrans */
LINUX_MIB_TCPSLOWSTARTRETRANS, /* TCPSlowStartRetrans */
LINUX_MIB_TCPTIMEOUTS, /* TCPTimeouts */
LINUX_MIB_TCPLOSSPROBES, /* TCPLossProbes */
LINUX_MIB_TCPRENORECOVERYFAIL, /* TCPRenoRecoveryFail */
LINUX_MIB_TCPSACKRECOVERYFAIL, /* TCPSackRecoveryFail */
LINUX_MIB_TCPSCHEDULERFAILED, /* TCPSchedulerFailed */
Expand Down
4 changes: 3 additions & 1 deletion net/ipv4/inet_diag.c
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,9 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk,

#define EXPIRES_IN_MS(tmo) DIV_ROUND_UP((tmo - jiffies) * 1000, HZ)

if (icsk->icsk_pending == ICSK_TIME_RETRANS) {
if (icsk->icsk_pending == ICSK_TIME_RETRANS ||
icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
r->idiag_timer = 1;
r->idiag_retrans = icsk->icsk_retransmits;
r->idiag_expires = EXPIRES_IN_MS(icsk->icsk_timeout);
Expand Down
1 change: 1 addition & 0 deletions net/ipv4/proc.c
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,7 @@ static const struct snmp_mib snmp4_net_list[] = {
SNMP_MIB_ITEM("TCPForwardRetrans", LINUX_MIB_TCPFORWARDRETRANS),
SNMP_MIB_ITEM("TCPSlowStartRetrans", LINUX_MIB_TCPSLOWSTARTRETRANS),
SNMP_MIB_ITEM("TCPTimeouts", LINUX_MIB_TCPTIMEOUTS),
SNMP_MIB_ITEM("TCPLossProbes", LINUX_MIB_TCPLOSSPROBES),
SNMP_MIB_ITEM("TCPRenoRecoveryFail", LINUX_MIB_TCPRENORECOVERYFAIL),
SNMP_MIB_ITEM("TCPSackRecoveryFail", LINUX_MIB_TCPSACKRECOVERYFAIL),
SNMP_MIB_ITEM("TCPSchedulerFailed", LINUX_MIB_TCPSCHEDULERFAILED),
Expand Down
4 changes: 2 additions & 2 deletions net/ipv4/sysctl_net_ipv4.c
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@

static int zero;
static int one = 1;
static int two = 2;
static int four = 4;
static int tcp_retr1_max = 255;
static int ip_local_port_range_min[] = { 1, 1 };
static int ip_local_port_range_max[] = { 65535, 65535 };
Expand Down Expand Up @@ -760,7 +760,7 @@ static struct ctl_table ipv4_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec_minmax,
.extra1 = &zero,
.extra2 = &two,
.extra2 = &four,
},
{
.procname = "udp_mem",
Expand Down
24 changes: 15 additions & 9 deletions net/ipv4/tcp_input.c
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ int sysctl_tcp_frto_response __read_mostly;
int sysctl_tcp_thin_dupack __read_mostly;

int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
int sysctl_tcp_early_retrans __read_mostly = 2;
int sysctl_tcp_early_retrans __read_mostly = 3;

#define FLAG_DATA 0x01 /* Incoming frame contained data. */
#define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a window update. */
Expand Down Expand Up @@ -2150,15 +2150,16 @@ static bool tcp_pause_early_retransmit(struct sock *sk, int flag)
* max(RTT/4, 2msec) unless ack has ECE mark, no RTT samples
* available, or RTO is scheduled to fire first.
*/
if (sysctl_tcp_early_retrans < 2 || (flag & FLAG_ECE) || !tp->srtt)
if (sysctl_tcp_early_retrans < 2 || sysctl_tcp_early_retrans > 3 ||
(flag & FLAG_ECE) || !tp->srtt)
return false;

delay = max_t(unsigned long, (tp->srtt >> 5), msecs_to_jiffies(2));
if (!time_after(inet_csk(sk)->icsk_timeout, (jiffies + delay)))
return false;

inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, delay, TCP_RTO_MAX);
tp->early_retrans_delayed = 1;
inet_csk_reset_xmit_timer(sk, ICSK_TIME_EARLY_RETRANS, delay,
TCP_RTO_MAX);
return true;
}

Expand Down Expand Up @@ -2321,7 +2322,7 @@ static bool tcp_time_to_recover(struct sock *sk, int flag)
* interval if appropriate.
*/
if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out &&
(tp->packets_out == (tp->sacked_out + 1) && tp->packets_out < 4) &&
(tp->packets_out >= (tp->sacked_out + 1) && tp->packets_out < 4) &&
!tcp_may_send_now(sk))
return !tcp_pause_early_retransmit(sk, flag);

Expand Down Expand Up @@ -3081,6 +3082,7 @@ static void tcp_cong_avoid(struct sock *sk, u32 ack, u32 in_flight)
*/
void tcp_rearm_rto(struct sock *sk)
{
const struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);

/* If the retrans timer is currently being used by Fast Open
Expand All @@ -3094,20 +3096,20 @@ void tcp_rearm_rto(struct sock *sk)
} else {
u32 rto = inet_csk(sk)->icsk_rto;
/* Offset the time elapsed after installing regular RTO */
if (tp->early_retrans_delayed) {
if (icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
struct sk_buff *skb = tcp_write_queue_head(sk);
const u32 rto_time_stamp = TCP_SKB_CB(skb)->when + rto;
s32 delta = (s32)(rto_time_stamp - tcp_time_stamp);
/* delta may not be positive if the socket is locked
* when the delayed ER timer fires and is rescheduled.
* when the retrans timer fires and is rescheduled.
*/
if (delta > 0)
rto = delta;
}
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
TCP_RTO_MAX);
}
tp->early_retrans_delayed = 0;
}

/* This function is called when the delayed ER timer fires. TCP enters
Expand Down Expand Up @@ -3601,7 +3603,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
if (after(ack, tp->snd_nxt))
goto invalid_ack;

if (tp->early_retrans_delayed)
if (icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
tcp_rearm_rto(sk);

if (after(ack, prior_snd_una))
Expand Down Expand Up @@ -3678,6 +3681,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
if (dst)
dst_confirm(dst);
}

if (icsk->icsk_pending == ICSK_TIME_RETRANS)
tcp_schedule_loss_probe(sk);
return 1;

no_queue:
Expand Down
4 changes: 3 additions & 1 deletion net/ipv4/tcp_ipv4.c
Original file line number Diff line number Diff line change
Expand Up @@ -2703,7 +2703,9 @@ static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i, int *len)
__u16 srcp = ntohs(inet->inet_sport);
int rx_queue;

if (icsk->icsk_pending == ICSK_TIME_RETRANS) {
if (icsk->icsk_pending == ICSK_TIME_RETRANS ||
icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
timer_active = 1;
timer_expires = icsk->icsk_timeout;
} else if (icsk->icsk_pending == ICSK_TIME_PROBE0) {
Expand Down
128 changes: 124 additions & 4 deletions net/ipv4/tcp_output.c
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
/* Account for new data that has been sent to the network. */
static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
{
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
unsigned int prior_packets = tp->packets_out;

Expand All @@ -85,7 +86,8 @@ static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
tp->frto_counter = 3;

tp->packets_out += tcp_skb_pcount(skb);
if (!prior_packets || tp->early_retrans_delayed)
if (!prior_packets || icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
tcp_rearm_rto(sk);
}

Expand Down Expand Up @@ -1959,6 +1961,9 @@ static int tcp_mtu_probe(struct sock *sk)
* snd_up-64k-mss .. snd_up cannot be large. However, taking into
* account rare use of URG, this is not a big flaw.
*
* Send at most one packet when push_one > 0. Temporarily ignore
* cwnd limit to force at most one packet out when push_one == 2.
* Returns true, if no segments are in flight and we have queued segments,
* but cannot send anything now because of SWS or another problem.
*/
Expand Down Expand Up @@ -1994,8 +1999,13 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
goto repair; /* Skip network transmission */

cwnd_quota = tcp_cwnd_test(tp, skb);
if (!cwnd_quota)
break;
if (!cwnd_quota) {
if (push_one == 2)
/* Force out a loss probe pkt. */
cwnd_quota = 1;
else
break;
}

if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now)))
break;
Expand Down Expand Up @@ -2049,10 +2059,120 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
if (likely(sent_pkts)) {
if (tcp_in_cwnd_reduction(sk))
tp->prr_out += sent_pkts;

/* Send one loss probe per tail loss episode. */
if (push_one != 2)
tcp_schedule_loss_probe(sk);
tcp_cwnd_validate(sk);
return false;
}
return !tp->packets_out && tcp_send_head(sk);
return (push_one == 2) || (!tp->packets_out && tcp_send_head(sk));
}

bool tcp_schedule_loss_probe(struct sock *sk)
{
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
u32 timeout, tlp_time_stamp, rto_time_stamp;
u32 rtt = tp->srtt >> 3;

if (WARN_ON(icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS))
return false;
/* No consecutive loss probes. */
if (WARN_ON(icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)) {
tcp_rearm_rto(sk);
return false;
}
/* Don't do any loss probe on a Fast Open connection before 3WHS
* finishes.
*/
if (sk->sk_state == TCP_SYN_RECV)
return false;

/* TLP is only scheduled when next timer event is RTO. */
if (icsk->icsk_pending != ICSK_TIME_RETRANS)
return false;

/* Schedule a loss probe in 2*RTT for SACK capable connections
* in Open state, that are either limited by cwnd or application.
*/
if (sysctl_tcp_early_retrans < 3 || !rtt || !tp->packets_out ||
!tcp_is_sack(tp) || inet_csk(sk)->icsk_ca_state != TCP_CA_Open)
return false;

if ((tp->snd_cwnd > tcp_packets_in_flight(tp)) &&
tcp_send_head(sk))
return false;

/* Probe timeout is at least 1.5*rtt + TCP_DELACK_MAX to account
* for delayed ack when there's one outstanding packet.
*/
timeout = rtt << 1;
if (tp->packets_out == 1)
timeout = max_t(u32, timeout,
(rtt + (rtt >> 1) + TCP_DELACK_MAX));
timeout = max_t(u32, timeout, msecs_to_jiffies(10));

/* If RTO is shorter, just schedule TLP in its place. */
tlp_time_stamp = tcp_time_stamp + timeout;
rto_time_stamp = (u32)inet_csk(sk)->icsk_timeout;
if ((s32)(tlp_time_stamp - rto_time_stamp) > 0) {
s32 delta = rto_time_stamp - tcp_time_stamp;
if (delta > 0)
timeout = delta;
}

inet_csk_reset_xmit_timer(sk, ICSK_TIME_LOSS_PROBE, timeout,
TCP_RTO_MAX);
return true;
}

/* When probe timeout (PTO) fires, send a new segment if one exists, else
* retransmit the last segment.
*/
void tcp_send_loss_probe(struct sock *sk)
{
struct sk_buff *skb;
int pcount;
int mss = tcp_current_mss(sk);
int err = -1;

if (tcp_send_head(sk) != NULL) {
err = tcp_write_xmit(sk, mss, TCP_NAGLE_OFF, 2, GFP_ATOMIC);
goto rearm_timer;
}

/* Retransmit last segment. */
skb = tcp_write_queue_tail(sk);
if (WARN_ON(!skb))
goto rearm_timer;

pcount = tcp_skb_pcount(skb);
if (WARN_ON(!pcount))
goto rearm_timer;

if ((pcount > 1) && (skb->len > (pcount - 1) * mss)) {
if (unlikely(tcp_fragment(sk, skb, (pcount - 1) * mss, mss)))
goto rearm_timer;
skb = tcp_write_queue_tail(sk);
}

if (WARN_ON(!skb || !tcp_skb_pcount(skb)))
goto rearm_timer;

/* Probe with zero data doesn't trigger fast recovery. */
if (skb->len > 0)
err = __tcp_retransmit_skb(sk, skb);

rearm_timer:
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
inet_csk(sk)->icsk_rto,
TCP_RTO_MAX);

if (likely(!err))
NET_INC_STATS_BH(sock_net(sk),
LINUX_MIB_TCPLOSSPROBES);
return;
}

/* Push out any pending frames which were held back due to
Expand Down
Loading

0 comments on commit 6ba8a3b

Please sign in to comment.