Skip to content

Commit a87c83d

Browse files
nealcardwelldavem330
authored andcommitted
tcp_bbr: adjust TCP BBR for departure time pacing
Adjust TCP BBR for the new departure time pacing model in the recent commit ab408b6 ("tcp: switch tcp and sch_fq to new earliest departure time model"). With TSQ and pacing at lower layers, there are often several skbs queued in the pacing layer, and thus there is less data "in the network" than "in flight". With departure time pacing at lower layers (e.g. fq or potential future NICs), the data in the pacing layer now has a pre-scheduled ("baked-in") departure time that cannot be changed, even if the congestion control algorithm decides to use a new pacing rate. This means that there can be a non-trivial lag between when BBR makes a pacing rate change and when the inter-skb pacing delays change. After a pacing rate change, the number of packets in the network can gradually evolve to be higher or lower, depending on whether the sending rate is higher or lower than the delivery rate. Thus ignoring this lag can cause significant overshoot, with the flow ending up with too many or too few packets in the network. This commit changes BBR to adapt its pacing rate based on the amount of data in the network that it estimates has already been "baked in" by previous departure time decisions. We estimate the number of our packets that will be in the network at the earliest departure time (EDT) for the next skb scheduled as: in_network_at_edt = inflight_at_edt - (EDT - now) * bw If we're increasing the amount of data in the network ("in_network"), then we want to know if the transmit of the EDT skb will push in_network above the target, so our answer includes bbr_tso_segs_goal() from the skb departing at EDT. If we're decreasing in_network, then we want to know if in_network will sink too low just before the EDT transmit, so our answer does not include the segments from the skb departing at EDT. Why do we treat pacing_gain > 1.0 case and pacing_gain < 1.0 case differently? The in_network curve is a step function: in_network goes up on transmits, and down on ACKs. To accurately predict when in_network will go beyond our target value, this will happen on different events, depending on whether we're concerned about in_network potentially going too high or too low: o if pushing in_network up (pacing_gain > 1.0), then in_network goes above target upon a transmit event o if pushing in_network down (pacing_gain < 1.0), then in_network goes below target upon an ACK event This commit changes the BBR state machine to use this estimated "packets in network" value to make its decisions. Signed-off-by: Neal Cardwell <[email protected]> Signed-off-by: Yuchung Cheng <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
1 parent cb10c7c commit a87c83d

File tree

1 file changed

+35
-2
lines changed

1 file changed

+35
-2
lines changed

net/ipv4/tcp_bbr.c

Lines changed: 35 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -369,6 +369,39 @@ static u32 bbr_target_cwnd(struct sock *sk, u32 bw, int gain)
369369
return cwnd;
370370
}
371371

372+
/* With pacing at lower layers, there's often less data "in the network" than
373+
* "in flight". With TSQ and departure time pacing at lower layers (e.g. fq),
374+
* we often have several skbs queued in the pacing layer with a pre-scheduled
375+
* earliest departure time (EDT). BBR adapts its pacing rate based on the
376+
* inflight level that it estimates has already been "baked in" by previous
377+
* departure time decisions. We calculate a rough estimate of the number of our
378+
* packets that might be in the network at the earliest departure time for the
379+
* next skb scheduled:
380+
* in_network_at_edt = inflight_at_edt - (EDT - now) * bw
381+
* If we're increasing inflight, then we want to know if the transmit of the
382+
* EDT skb will push inflight above the target, so inflight_at_edt includes
383+
* bbr_tso_segs_goal() from the skb departing at EDT. If decreasing inflight,
384+
* then estimate if inflight will sink too low just before the EDT transmit.
385+
*/
386+
static u32 bbr_packets_in_net_at_edt(struct sock *sk, u32 inflight_now)
387+
{
388+
struct tcp_sock *tp = tcp_sk(sk);
389+
struct bbr *bbr = inet_csk_ca(sk);
390+
u64 now_ns, edt_ns, interval_us;
391+
u32 interval_delivered, inflight_at_edt;
392+
393+
now_ns = tp->tcp_clock_cache;
394+
edt_ns = max(tp->tcp_wstamp_ns, now_ns);
395+
interval_us = div_u64(edt_ns - now_ns, NSEC_PER_USEC);
396+
interval_delivered = (u64)bbr_bw(sk) * interval_us >> BW_SCALE;
397+
inflight_at_edt = inflight_now;
398+
if (bbr->pacing_gain > BBR_UNIT) /* increasing inflight */
399+
inflight_at_edt += bbr_tso_segs_goal(sk); /* include EDT skb */
400+
if (interval_delivered >= inflight_at_edt)
401+
return 0;
402+
return inflight_at_edt - interval_delivered;
403+
}
404+
372405
/* An optimization in BBR to reduce losses: On the first round of recovery, we
373406
* follow the packet conservation principle: send P packets per P packets acked.
374407
* After that, we slow-start and send at most 2*P packets per P packets acked.
@@ -460,7 +493,7 @@ static bool bbr_is_next_cycle_phase(struct sock *sk,
460493
if (bbr->pacing_gain == BBR_UNIT)
461494
return is_full_length; /* just use wall clock time */
462495

463-
inflight = rs->prior_in_flight; /* what was in-flight before ACK? */
496+
inflight = bbr_packets_in_net_at_edt(sk, rs->prior_in_flight);
464497
bw = bbr_max_bw(sk);
465498

466499
/* A pacing_gain > 1.0 probes for bw by trying to raise inflight to at
@@ -741,7 +774,7 @@ static void bbr_check_drain(struct sock *sk, const struct rate_sample *rs)
741774
bbr_target_cwnd(sk, bbr_max_bw(sk), BBR_UNIT);
742775
} /* fall through to check if in-flight is already small: */
743776
if (bbr->mode == BBR_DRAIN &&
744-
tcp_packets_in_flight(tcp_sk(sk)) <=
777+
bbr_packets_in_net_at_edt(sk, tcp_packets_in_flight(tcp_sk(sk))) <=
745778
bbr_target_cwnd(sk, bbr_max_bw(sk), BBR_UNIT))
746779
bbr_reset_probe_bw_mode(sk); /* we estimate queue is drained */
747780
}

0 commit comments

Comments
 (0)