Skip to content

Commit b9f6482

Browse files
yuchungchengdavem330
authored andcommitted
tcp: track data delivery rate for a TCP connection
This patch generates data delivery rate (throughput) samples on a per-ACK basis. These rate samples can be used by congestion control modules, and specifically will be used by TCP BBR in later patches in this series. Key state: tp->delivered: Tracks the total number of data packets (original or not) delivered so far. This is an already-existing field. tp->delivered_mstamp: the last time tp->delivered was updated. Algorithm: A rate sample is calculated as (d1 - d0)/(t1 - t0) on a per-ACK basis: d1: the current tp->delivered after processing the ACK t1: the current time after processing the ACK d0: the prior tp->delivered when the acked skb was transmitted t0: the prior tp->delivered_mstamp when the acked skb was transmitted When an skb is transmitted, we snapshot d0 and t0 in its control block in tcp_rate_skb_sent(). When an ACK arrives, it may SACK and ACK some skbs. For each SACKed or ACKed skb, tcp_rate_skb_delivered() updates the rate_sample struct to reflect the latest (d0, t0). Finally, tcp_rate_gen() generates a rate sample by storing (d1 - d0) in rs->delivered and (t1 - t0) in rs->interval_us. One caveat: if an skb was sent with no packets in flight, then tp->delivered_mstamp may be either invalid (if the connection is starting) or outdated (if the connection was idle). In that case, we'll re-stamp tp->delivered_mstamp. At first glance it seems t0 should always be the time when an skb was transmitted, but actually this could over-estimate the rate due to phase mismatch between transmit and ACK events. To track the delivery rate, we ensure that if packets are in flight then t0 and and t1 are times at which packets were marked delivered. If the initial and final RTTs are different then one may be corrupted by some sort of noise. The noise we see most often is sending gaps caused by delayed, compressed, or stretched acks. This either affects both RTTs equally or artificially reduces the final RTT. We approach this by recording the info we need to compute the initial RTT (duration of the "send phase" of the window) when we recorded the associated inflight. Then, for a filter to avoid bandwidth overestimates, we generalize the per-sample bandwidth computation from: bw = delivered / ack_phase_rtt to the following: bw = delivered / max(send_phase_rtt, ack_phase_rtt) In large-scale experiments, this filtering approach incorporating send_phase_rtt is effective at avoiding bandwidth overestimates due to ACK compression or stretched ACKs. Signed-off-by: Van Jacobson <[email protected]> Signed-off-by: Neal Cardwell <[email protected]> Signed-off-by: Yuchung Cheng <[email protected]> Signed-off-by: Nandita Dukkipati <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Soheil Hassas Yeganeh <[email protected]> Signed-off-by: David S. Miller <[email protected]>
1 parent 0682e69 commit b9f6482

File tree

6 files changed

+222
-16
lines changed

6 files changed

+222
-16
lines changed

include/linux/tcp.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -268,6 +268,8 @@ struct tcp_sock {
268268
u32 prr_out; /* Total number of pkts sent during Recovery. */
269269
u32 delivered; /* Total data packets delivered incl. rexmits */
270270
u32 lost; /* Total data packets lost incl. rexmits */
271+
struct skb_mstamp first_tx_mstamp; /* start of window send phase */
272+
struct skb_mstamp delivered_mstamp; /* time we reached "delivered" */
271273

272274
u32 rcv_wnd; /* Current receiver window */
273275
u32 write_seq; /* Tail(+1) of data held in tcp send buffer */

include/net/tcp.h

Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -763,8 +763,14 @@ struct tcp_skb_cb {
763763
__u32 ack_seq; /* Sequence number ACK'd */
764764
union {
765765
struct {
766-
/* There is space for up to 20 bytes */
766+
/* There is space for up to 24 bytes */
767767
__u32 in_flight;/* Bytes in flight when packet sent */
768+
/* pkts S/ACKed so far upon tx of skb, incl retrans: */
769+
__u32 delivered;
770+
/* start of send pipeline phase */
771+
struct skb_mstamp first_tx_mstamp;
772+
/* when we reached the "delivered" count */
773+
struct skb_mstamp delivered_mstamp;
768774
} tx; /* only used for outgoing skbs */
769775
union {
770776
struct inet_skb_parm h4;
@@ -860,6 +866,26 @@ struct ack_sample {
860866
u32 in_flight;
861867
};
862868

869+
/* A rate sample measures the number of (original/retransmitted) data
870+
* packets delivered "delivered" over an interval of time "interval_us".
871+
* The tcp_rate.c code fills in the rate sample, and congestion
872+
* control modules that define a cong_control function to run at the end
873+
* of ACK processing can optionally chose to consult this sample when
874+
* setting cwnd and pacing rate.
875+
* A sample is invalid if "delivered" or "interval_us" is negative.
876+
*/
877+
struct rate_sample {
878+
struct skb_mstamp prior_mstamp; /* starting timestamp for interval */
879+
u32 prior_delivered; /* tp->delivered at "prior_mstamp" */
880+
s32 delivered; /* number of packets delivered over interval */
881+
long interval_us; /* time for tp->delivered to incr "delivered" */
882+
long rtt_us; /* RTT of last (S)ACKed packet (or -1) */
883+
int losses; /* number of packets marked lost upon ACK */
884+
u32 acked_sacked; /* number of packets newly (S)ACKed upon ACK */
885+
u32 prior_in_flight; /* in flight before this ACK */
886+
bool is_retrans; /* is sample from retransmission? */
887+
};
888+
863889
struct tcp_congestion_ops {
864890
struct list_head list;
865891
u32 key;
@@ -946,6 +972,13 @@ static inline void tcp_ca_event(struct sock *sk, const enum tcp_ca_event event)
946972
icsk->icsk_ca_ops->cwnd_event(sk, event);
947973
}
948974

975+
/* From tcp_rate.c */
976+
void tcp_rate_skb_sent(struct sock *sk, struct sk_buff *skb);
977+
void tcp_rate_skb_delivered(struct sock *sk, struct sk_buff *skb,
978+
struct rate_sample *rs);
979+
void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost,
980+
struct skb_mstamp *now, struct rate_sample *rs);
981+
949982
/* These functions determine how the current flow behaves in respect of SACK
950983
* handling. SACK is negotiated with the peer, and therefore it can vary
951984
* between different flows.

net/ipv4/Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ obj-y := route.o inetpeer.o protocol.o \
88
inet_timewait_sock.o inet_connection_sock.o \
99
tcp.o tcp_input.o tcp_output.o tcp_timer.o tcp_ipv4.o \
1010
tcp_minisocks.o tcp_cong.o tcp_metrics.o tcp_fastopen.o \
11-
tcp_recovery.o \
11+
tcp_rate.o tcp_recovery.o \
1212
tcp_offload.o datagram.o raw.o udp.o udplite.o \
1313
udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
1414
fib_frontend.o fib_semantics.o fib_trie.o \

net/ipv4/tcp_input.c

Lines changed: 32 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1112,6 +1112,7 @@ struct tcp_sacktag_state {
11121112
*/
11131113
struct skb_mstamp first_sackt;
11141114
struct skb_mstamp last_sackt;
1115+
struct rate_sample *rate;
11151116
int flag;
11161117
};
11171118

@@ -1279,6 +1280,7 @@ static bool tcp_shifted_skb(struct sock *sk, struct sk_buff *skb,
12791280
tcp_sacktag_one(sk, state, TCP_SKB_CB(skb)->sacked,
12801281
start_seq, end_seq, dup_sack, pcount,
12811282
&skb->skb_mstamp);
1283+
tcp_rate_skb_delivered(sk, skb, state->rate);
12821284

12831285
if (skb == tp->lost_skb_hint)
12841286
tp->lost_cnt_hint += pcount;
@@ -1329,6 +1331,9 @@ static bool tcp_shifted_skb(struct sock *sk, struct sk_buff *skb,
13291331
tcp_advance_highest_sack(sk, skb);
13301332

13311333
tcp_skb_collapse_tstamp(prev, skb);
1334+
if (unlikely(TCP_SKB_CB(prev)->tx.delivered_mstamp.v64))
1335+
TCP_SKB_CB(prev)->tx.delivered_mstamp.v64 = 0;
1336+
13321337
tcp_unlink_write_queue(skb, sk);
13331338
sk_wmem_free_skb(sk, skb);
13341339

@@ -1558,6 +1563,7 @@ static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk,
15581563
dup_sack,
15591564
tcp_skb_pcount(skb),
15601565
&skb->skb_mstamp);
1566+
tcp_rate_skb_delivered(sk, skb, state->rate);
15611567

15621568
if (!before(TCP_SKB_CB(skb)->seq,
15631569
tcp_highest_sack_seq(tp)))
@@ -1640,8 +1646,10 @@ tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
16401646

16411647
found_dup_sack = tcp_check_dsack(sk, ack_skb, sp_wire,
16421648
num_sacks, prior_snd_una);
1643-
if (found_dup_sack)
1649+
if (found_dup_sack) {
16441650
state->flag |= FLAG_DSACKING_ACK;
1651+
tp->delivered++; /* A spurious retransmission is delivered */
1652+
}
16451653

16461654
/* Eliminate too old ACKs, but take into
16471655
* account more or less fresh ones, they can
@@ -3071,10 +3079,11 @@ static void tcp_ack_tstamp(struct sock *sk, struct sk_buff *skb,
30713079
*/
30723080
static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
30733081
u32 prior_snd_una, int *acked,
3074-
struct tcp_sacktag_state *sack)
3082+
struct tcp_sacktag_state *sack,
3083+
struct skb_mstamp *now)
30753084
{
30763085
const struct inet_connection_sock *icsk = inet_csk(sk);
3077-
struct skb_mstamp first_ackt, last_ackt, now;
3086+
struct skb_mstamp first_ackt, last_ackt;
30783087
struct tcp_sock *tp = tcp_sk(sk);
30793088
u32 prior_sacked = tp->sacked_out;
30803089
u32 reord = tp->packets_out;
@@ -3106,7 +3115,6 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
31063115
acked_pcount = tcp_tso_acked(sk, skb);
31073116
if (!acked_pcount)
31083117
break;
3109-
31103118
fully_acked = false;
31113119
} else {
31123120
/* Speedup tcp_unlink_write_queue() and next loop */
@@ -3142,6 +3150,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
31423150

31433151
tp->packets_out -= acked_pcount;
31443152
pkts_acked += acked_pcount;
3153+
tcp_rate_skb_delivered(sk, skb, sack->rate);
31453154

31463155
/* Initial outgoing SYN's get put onto the write_queue
31473156
* just like anything else we transmit. It is not
@@ -3174,16 +3183,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
31743183
if (skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED))
31753184
flag |= FLAG_SACK_RENEGING;
31763185

3177-
skb_mstamp_get(&now);
31783186
if (likely(first_ackt.v64) && !(flag & FLAG_RETRANS_DATA_ACKED)) {
3179-
seq_rtt_us = skb_mstamp_us_delta(&now, &first_ackt);
3180-
ca_rtt_us = skb_mstamp_us_delta(&now, &last_ackt);
3187+
seq_rtt_us = skb_mstamp_us_delta(now, &first_ackt);
3188+
ca_rtt_us = skb_mstamp_us_delta(now, &last_ackt);
31813189
}
31823190
if (sack->first_sackt.v64) {
3183-
sack_rtt_us = skb_mstamp_us_delta(&now, &sack->first_sackt);
3184-
ca_rtt_us = skb_mstamp_us_delta(&now, &sack->last_sackt);
3191+
sack_rtt_us = skb_mstamp_us_delta(now, &sack->first_sackt);
3192+
ca_rtt_us = skb_mstamp_us_delta(now, &sack->last_sackt);
31853193
}
3186-
3194+
sack->rate->rtt_us = ca_rtt_us; /* RTT of last (S)ACKed packet, or -1 */
31873195
rtt_update = tcp_ack_update_rtt(sk, flag, seq_rtt_us, sack_rtt_us,
31883196
ca_rtt_us);
31893197

@@ -3211,7 +3219,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
32113219
tp->fackets_out -= min(pkts_acked, tp->fackets_out);
32123220

32133221
} else if (skb && rtt_update && sack_rtt_us >= 0 &&
3214-
sack_rtt_us > skb_mstamp_us_delta(&now, &skb->skb_mstamp)) {
3222+
sack_rtt_us > skb_mstamp_us_delta(now, &skb->skb_mstamp)) {
32153223
/* Do not re-arm RTO if the sack RTT is measured from data sent
32163224
* after when the head was last (re)transmitted. Otherwise the
32173225
* timeout may continue to extend in loss recovery.
@@ -3548,17 +3556,21 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
35483556
struct inet_connection_sock *icsk = inet_csk(sk);
35493557
struct tcp_sock *tp = tcp_sk(sk);
35503558
struct tcp_sacktag_state sack_state;
3559+
struct rate_sample rs = { .prior_delivered = 0 };
35513560
u32 prior_snd_una = tp->snd_una;
35523561
u32 ack_seq = TCP_SKB_CB(skb)->seq;
35533562
u32 ack = TCP_SKB_CB(skb)->ack_seq;
35543563
bool is_dupack = false;
35553564
u32 prior_fackets;
35563565
int prior_packets = tp->packets_out;
3557-
u32 prior_delivered = tp->delivered;
3566+
u32 delivered = tp->delivered;
3567+
u32 lost = tp->lost;
35583568
int acked = 0; /* Number of packets newly acked */
35593569
int rexmit = REXMIT_NONE; /* Flag to (re)transmit to recover losses */
3570+
struct skb_mstamp now;
35603571

35613572
sack_state.first_sackt.v64 = 0;
3573+
sack_state.rate = &rs;
35623574

35633575
/* We very likely will need to access write queue head. */
35643576
prefetchw(sk->sk_write_queue.next);
@@ -3581,6 +3593,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
35813593
if (after(ack, tp->snd_nxt))
35823594
goto invalid_ack;
35833595

3596+
skb_mstamp_get(&now);
3597+
35843598
if (icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
35853599
icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
35863600
tcp_rearm_rto(sk);
@@ -3591,6 +3605,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
35913605
}
35923606

35933607
prior_fackets = tp->fackets_out;
3608+
rs.prior_in_flight = tcp_packets_in_flight(tp);
35943609

35953610
/* ts_recent update must be made after we are sure that the packet
35963611
* is in window.
@@ -3646,7 +3661,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
36463661

36473662
/* See if we can take anything off of the retransmit queue. */
36483663
flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una, &acked,
3649-
&sack_state);
3664+
&sack_state, &now);
36503665

36513666
if (tcp_ack_is_dubious(sk, flag)) {
36523667
is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP));
@@ -3663,7 +3678,10 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
36633678

36643679
if (icsk->icsk_pending == ICSK_TIME_RETRANS)
36653680
tcp_schedule_loss_probe(sk);
3666-
tcp_cong_control(sk, ack, tp->delivered - prior_delivered, flag);
3681+
delivered = tp->delivered - delivered; /* freshly ACKed or SACKed */
3682+
lost = tp->lost - lost; /* freshly marked lost */
3683+
tcp_rate_gen(sk, delivered, lost, &now, &rs);
3684+
tcp_cong_control(sk, ack, delivered, flag);
36673685
tcp_xmit_recovery(sk, rexmit);
36683686
return 1;
36693687

net/ipv4/tcp_output.c

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -918,6 +918,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
918918
skb_mstamp_get(&skb->skb_mstamp);
919919
TCP_SKB_CB(skb)->tx.in_flight = TCP_SKB_CB(skb)->end_seq
920920
- tp->snd_una;
921+
tcp_rate_skb_sent(sk, skb);
921922

922923
if (unlikely(skb_cloned(skb)))
923924
skb = pskb_copy(skb, gfp_mask);
@@ -1213,6 +1214,9 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
12131214
tcp_set_skb_tso_segs(skb, mss_now);
12141215
tcp_set_skb_tso_segs(buff, mss_now);
12151216

1217+
/* Update delivered info for the new segment */
1218+
TCP_SKB_CB(buff)->tx = TCP_SKB_CB(skb)->tx;
1219+
12161220
/* If this packet has been sent out already, we must
12171221
* adjust the various packet counters.
12181222
*/

0 commit comments

Comments
 (0)