Skip to content

Commit f9ac74e

Browse files
committed
Add middleware to prioritize download traffic
This new middleware layer will reject some requests as the database pool reaches capacity. At 20% load, the `in_flight_requests` count is added to the log output. At 70% load, all safe requests (`GET`, `HEAD`, `OPTIONS`, `TRACE`) are rejected immediately. This will reject many legitimate frontend requests as well, but should catch all bot traffic (which is unlikely to send `PUT`, `POST`, or `DELETE` requests). This filter also helps avoid rejecting frontend requests that update the database where we don’t always provide good feedback for errors in the UI. Finally, at 80% load all non-download traffic is rejected. In other words, at least 20% of database connections are reserved for handling download traffic. By choosing to drop other requests, there should be sufficient database connections available to avoid queuing and timeouts on download requests. There is some overlap with the LogConnectionPoolStatus middleware. These may eventually be consolidated to avoid some duplicate work and to make smarter decisions regarding the instantaneous spare capacity of individual pools (primary vs read-only replica). The existing middleware does give us some insight into our current in_flight_request counts on production. Looking through the logs, it is rare to have more than a few requests running at the same time. This is because download requests are completed very quickly and other API traffic accounts for about 10 requests per second.
1 parent 2867d08 commit f9ac74e

File tree

2 files changed

+122
-0
lines changed

2 files changed

+122
-0
lines changed

src/middleware.rs

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ use self::log_connection_pool_status::LogConnectionPoolStatus;
1212
use self::static_or_continue::StaticOrContinue;
1313

1414
pub mod app;
15+
mod balance_capacity;
1516
mod block_traffic;
1617
pub mod current_user;
1718
mod debug;
@@ -74,6 +75,27 @@ pub fn build_middleware(app: Arc<App>, endpoints: R404) -> MiddlewareBuilder {
7475

7576
// Note: The following `m.around()` middleware is run from bottom to top
7677

78+
// This is currently the final middleware to run. If a middleware layer requires a database
79+
// connection, it should be run after this middleware so that the potential pool usage can be
80+
// tracked here.
81+
//
82+
// In production we currently have 2 equally sized pools (primary and a read-only replica).
83+
// Because such a large portion of production traffic is for download requests (which update
84+
// download counts), we consider only the primary pool here.
85+
if let Ok(capacity) = env::var("DB_POOL_SIZE") {
86+
if let Ok(capacity) = capacity.parse() {
87+
if capacity >= 10 {
88+
println!(
89+
"Enabling BalanceCapacity middleware with {} pool capacity",
90+
capacity
91+
);
92+
m.around(balance_capacity::BalanceCapacity::new(capacity))
93+
} else {
94+
println!("BalanceCapacity middleware not enabled. DB_POOL_SIZE is too low.");
95+
}
96+
}
97+
}
98+
7799
// Serve the static files in the *dist* directory, which are the frontend assets.
78100
// Not needed for the backend tests.
79101
if env != Env::Test {

src/middleware/balance_capacity.rs

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
//! Reject certain requests as instance load reaches capacity.
2+
//!
3+
//! The primary goal of this middleware is to avoid starving the download endpoint of resources.
4+
//! When bots send many parallel requests that run slow database queries, download requests may
5+
//! block and eventually timeout waiting for a database connection.
6+
//!
7+
//! Bots must continue to respect our crawler policy, but until we can manually block bad clients
8+
//! we should avoid dropping download requests even if that means rejecting some legitimate
9+
//! requests to other endpoints.
10+
11+
use std::sync::atomic::{AtomicUsize, Ordering};
12+
13+
use super::prelude::*;
14+
use conduit::{RequestExt, StatusCode};
15+
16+
#[derive(Default)]
17+
pub(super) struct BalanceCapacity {
18+
handler: Option<Box<dyn Handler>>,
19+
capacity: usize,
20+
in_flight_requests: AtomicUsize,
21+
}
22+
23+
impl BalanceCapacity {
24+
pub fn new(capacity: usize) -> Self {
25+
Self {
26+
handler: None,
27+
capacity,
28+
in_flight_requests: AtomicUsize::new(0),
29+
}
30+
}
31+
}
32+
33+
impl AroundMiddleware for BalanceCapacity {
34+
fn with_handler(&mut self, handler: Box<dyn Handler>) {
35+
self.handler = Some(handler);
36+
}
37+
}
38+
39+
impl Handler for BalanceCapacity {
40+
fn call(&self, request: &mut dyn RequestExt) -> AfterResult {
41+
// The _drop_on_exit ensures the counter is decremented for all exit paths (including panics)
42+
let (_drop_on_exit, count) = RequestCounter::add_one(&self.in_flight_requests);
43+
let handler = self.handler.as_ref().unwrap();
44+
let load = 100 * count / self.capacity;
45+
46+
// Begin logging request count at 20% capacity
47+
if load >= 20 {
48+
super::log_request::add_custom_metadata(request, "in_flight_requests", count);
49+
}
50+
51+
// Download requests are always accepted
52+
if request.path().starts_with("/api/v1/crates/") && request.path().ends_with("/download") {
53+
return handler.call(request);
54+
}
55+
56+
// Reject read-only requests after reaching 70% load. Bots are likely to send only safe
57+
// requests and this helps prioritize requests that users may be reluctant to retry.
58+
if load >= 70 && request.method().is_safe() {
59+
return over_capcity_response();
60+
}
61+
62+
// At 80% load, all non-download requests are rejected
63+
if load >= 80 {
64+
return over_capcity_response();
65+
}
66+
67+
handler.call(request)
68+
}
69+
}
70+
71+
fn over_capcity_response() -> AfterResult {
72+
// TODO: Generate an alert so we can investigate
73+
let body = "Service temporarily unavailable";
74+
Response::builder()
75+
.status(StatusCode::SERVICE_UNAVAILABLE)
76+
.header(header::CONTENT_LENGTH, body.len())
77+
.body(Body::from_static(body.as_bytes()))
78+
.map_err(box_error)
79+
}
80+
81+
// FIXME(JTG): I've copied the following from my `conduit-hyper` crate. Once we transition from
82+
// `civet`, we could pass the in_flight_request count from `condut-hyper` via a request extension.
83+
84+
/// A struct that stores a reference to an atomic counter so it can be decremented when dropped
85+
struct RequestCounter<'a> {
86+
counter: &'a AtomicUsize,
87+
}
88+
89+
impl<'a> RequestCounter<'a> {
90+
fn add_one(counter: &'a AtomicUsize) -> (Self, usize) {
91+
let previous = counter.fetch_add(1, Ordering::SeqCst);
92+
(Self { counter }, previous + 1)
93+
}
94+
}
95+
96+
impl<'a> Drop for RequestCounter<'a> {
97+
fn drop(&mut self) {
98+
self.counter.fetch_sub(1, Ordering::SeqCst);
99+
}
100+
}

0 commit comments

Comments
 (0)