Skip to content

fix: stream connections longer than 5 minutes are dropped #244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Sep 13, 2023

Conversation

cwaldren-ld
Copy link
Contributor

@cwaldren-ld cwaldren-ld commented Sep 12, 2023

This fixes streaming connection dropping after 5 minutes.

The problem was applying a 5 minute timeout to an async_read operation, when the actual intent was to apply it to the act of receiving any data.

The timeout exists to prevent the client from hanging infinitely if the server stops responding. LaunchDarkly sends heartbeats to clear the timer.

With async_read, the operation won't complete until the response body is finished. Since the body will never finish until the client shuts down (or server interrupts it), it's not possible to institute a timeout.

The solution is to use async_read_some, which completes whenever data arrives.

This way, we can assign the 5 minute timeout as an upper bound on receiving any data, as was originally intended.


A side effect of this change was breaking the reconnection SSE contract tests, which proved to be somewhat of a rabbithole.

The upshot is that we weren't handling chunked encoding properly when chunked encoding ends. The original code assumes we either terminate an async read with an error, or it goes on forever.

In fact it should perform the backoff algorithm if we detect the end of chunked encoding.


I've also included more comments, more debug logging, and renamed some symbols for clarity. This should aid in debugging issues like this more efficiently.

  • Long running connection test (4 hours no drops)

@shortcut-integration
Copy link

This pull request has been linked to Shortcut Story #216849: Fix stream timing out after 5 minutes.

: level_(Defaults<AnySDK>::LogLevel()), tag_(Defaults<AnySDK>::LogTag()) {}
: level_(GetLogLevelEnum(std::getenv("LD_LOG_LEVEL"),
Defaults<AnySDK>::LogLevel())),
tag_(Defaults<AnySDK>::LogTag()) {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we can just run LD_LOG_LEVEL=debug ./binary but the Console Logger constructed in the client passes in a default value, so the environment variable isn't inspected.

Now it works like you'd want.

* The timestamp of the last read is initialized after receiving successful
* headers, but before initiating the first body read. Future reads then
* update the timestamp.
*/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this helper so we can more easily tell how fast / when data is being received. For example, this clearly shows heartbeats arriving every 180 seconds.

@@ -79,21 +79,49 @@ class FoxyClient : public Client,
read_timeout_(read_timeout),
write_timeout_(write_timeout),
req_(std::move(req)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-arranging to match declaration order.

@@ -263,8 +293,22 @@ class FoxyClient : public Client,
void on_read_body(boost::system::error_code ec, std::size_t amount) {
boost::ignore_unused(amount);
if (ec == boost::asio::error::operation_aborted) {
logger_("read HTTP response body aborted");
return;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the key: if the parser indicates that the chunked encoding is finished, initiate backoff instead of attempting to keep reading the body.

@@ -276,6 +320,7 @@ class FoxyClient : public Client,
}

void do_shutdown(std::function<void()> completion) {
Copy link
Contributor Author

@cwaldren-ld cwaldren-ld Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was causing unnecessary hanging when the test service tells the client to close. By cancelling the backoff timer, that operation can complete immediately.

@cwaldren-ld cwaldren-ld force-pushed the cw/sc-216849/sse-read-timeout branch from 937b304 to ed5ff7f Compare September 13, 2023 00:37
@@ -290,13 +299,13 @@ class FoxyClient : public Client,

void on_read_body(boost::system::error_code ec, std::size_t amount) {
boost::ignore_unused(amount);
if (ec == boost::asio::error::operation_aborted) {
logger_("read HTTP response body aborted");
if (ec == boost::asio::error::operation_aborted && shutting_down_) {
Copy link

@jdegges jdegges Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If on_read_body() can ever be executed from a context that is different from where do_shutdown() is executed then we may have a race between modifying & accessing shutting_down_.

Edit: Never mind, I see this was removed in the last commit :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good callout, especially since I had to add the bool back.

Although the public async_shutdown method will post's to the session_'s executor, the run routine did not - I've fixed that.

Since those are the only two public entrypoints, we should be safe.

@cwaldren-ld cwaldren-ld force-pushed the cw/sc-216849/sse-read-timeout branch from 01bdc8b to 5e28860 Compare September 13, 2023 15:03
@cwaldren-ld cwaldren-ld force-pushed the cw/sc-216849/sse-read-timeout branch from 6d9ca86 to 24fcdfc Compare September 13, 2023 15:38
@@ -35,6 +35,11 @@ std::optional<std::string> EntityManager::create(ConfigParams const& params) {
std::chrono::milliseconds(*params.readTimeoutMs));
}

Copy link
Contributor Author

@cwaldren-ld cwaldren-ld Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This builder option was missing, but it's already implemented. Easy win to speed up tests.

@cwaldren-ld cwaldren-ld force-pushed the cw/sc-216849/sse-read-timeout branch from 23262e3 to d8e7b1d Compare September 13, 2023 16:43
@@ -72,11 +72,11 @@ class EventOutbox : public std::enable_shared_from_this<EventOutbox> {
private:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lints.

@@ -53,7 +59,7 @@ std::optional<std::string> EntityManager::create(ConfigParams const& params) {
return std::nullopt;
}

client->run();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed run to async_connect to make it clear this is kicking off an asynchronous operation.

@@ -14,8 +14,6 @@ auto const kOutboxCapacity = 1023;
EventOutbox::EventOutbox(net::any_io_executor executor,
std::string callback_url)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lints.

RequestType req;

req.set(http::field::host, callback_host_);
req.method(http::verb::get);
req.method(http::verb::post);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get was just plain wrong, but accepted by the contract tests.

@@ -38,6 +38,7 @@ int main(int argc, char* argv[]) {
srv.add_capability("report");
srv.add_capability("post");
srv.add_capability("reconnection");
srv.add_capability("read-timeout");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were already passing the parameter along, just hadn't advertised the capability.

errors_(std::move(errors)) {
backoff_timer_(std::move(executor)),
last_read_(std::nullopt),
shutting_down_(false) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Session construction now takes place in create_session(), so the logic can be reused for each new connection.

}

void run() override {
session_.async_connect(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After renaming run, I wrapped it in a post so we can ensure there is only one thread accessing session. This is the same as async_shutdown.

@cwaldren-ld cwaldren-ld marked this pull request as ready for review September 13, 2023 18:01
@cwaldren-ld cwaldren-ld requested review from a team and kinyoklion September 13, 2023 18:01
@cwaldren-ld cwaldren-ld merged commit e12664f into main Sep 13, 2023
@cwaldren-ld cwaldren-ld deleted the cw/sc-216849/sse-read-timeout branch September 13, 2023 21:46
@github-actions github-actions bot mentioned this pull request Sep 13, 2023
cwaldren-ld pushed a commit that referenced this pull request Sep 13, 2023
🤖 I have created a release *beep* *boop*
---


<details><summary>launchdarkly-cpp-client: 3.0.8</summary>

##
[3.0.8](launchdarkly-cpp-client-v3.0.7...launchdarkly-cpp-client-v3.0.8)
(2023-09-13)


### Bug Fixes

* stream connections longer than 5 minutes are dropped
([#244](#244))
([e12664f](e12664f))


### Dependencies

* The following workspace dependencies were updated
  * dependencies
    * launchdarkly-cpp-internal bumped from 0.1.8 to 0.1.9
    * launchdarkly-cpp-common bumped from 0.3.5 to 0.3.6
    * launchdarkly-cpp-sse-client bumped from 0.1.2 to 0.1.3
</details>

<details><summary>launchdarkly-cpp-common: 0.3.6</summary>

##
[0.3.6](launchdarkly-cpp-common-v0.3.5...launchdarkly-cpp-common-v0.3.6)
(2023-09-13)


### Bug Fixes

* stream connections longer than 5 minutes are dropped
([#244](#244))
([e12664f](e12664f))
</details>

<details><summary>launchdarkly-cpp-internal: 0.1.9</summary>

### Dependencies

* The following workspace dependencies were updated
  * dependencies
    * launchdarkly-cpp-common bumped from 0.3.5 to 0.3.6
</details>

<details><summary>launchdarkly-cpp-sse-client: 0.1.3</summary>

##
[0.1.3](launchdarkly-cpp-sse-client-v0.1.2...launchdarkly-cpp-sse-client-v0.1.3)
(2023-09-13)


### Bug Fixes

* stream connections longer than 5 minutes are dropped
([#244](#244))
([e12664f](e12664f))
</details>

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This was referenced Oct 23, 2023
@github-actions github-actions bot mentioned this pull request May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Client stops receiving flag updates after about 5 minutes
3 participants