Lower memory consumption by streaming request data #2848

rtyley · 2021-11-16T07:37:05Z

Motivation and Context

Prior to this change, the UrlConnectionHttpClient implementation of SdkHttpClient would unintentionally cause the entire RequestBody to be loaded into memory before being sent, even if the body had been made using software.amazon.awssdk.core.sync.RequestBody.fromInputStream() and thus was a stream of known content-length. Even worse, as the buffering ByteArrayInputStream received the request, it would have to repeatedly grow capacity by allocating new arrays and copying the data over, so peak memory usage would be 1.5x the size of the request data. For instance, streaming 1 GB of data to S3 with putObject would lead to a peak memory allocation of 1.5 GB. This large allocation of memory can be avoided by instead streaming the request data as it is sent.

java.net.HttpURLConnection (the core Java class used by UrlConnectionHttpClient to implement SdkHttpClient) supports streaming the request, but will only do it if setFixedLengthStreamingMode() or setChunkedStreamingMode() has been called - this is true even if the Content-Length header has already been set. You can see this in the implementation of sun.net.www.protocol.http.HttpURLConnection where streaming() must be true to avoid the whole request being read into a PosterOutputStream (a simple extension of ByteArrayInputStream) before being sent.

Description

If Content-Length is known, we're actually free to call setFixedLengthStreamingMode() on the connection to activate streaming, the only question is where & when to do it! Unfortunately software.amazon.awssdk.http.urlconnection.UrlConnectionFactory, which provides a way for end-users of the AWS SDK to customise the creation & configuration of the HttpURLConnection, doesn't have access to the request or it's Content-Length, so can't call setFixedLengthStreamingMode(). In any case, this behaviour probably always makes sense, so it doesn't make sense to leave it to users to customise it. That leaves the best place to do this being the createAndConfigureConnection() method, where the newly created HttpURLConnection is available and the Content-Length can be read if it's present.

Consequently, I've updated createAndConfigureConnection() to call setFixedLengthStreamingMode if the Content-Length is available and above a streaming threshold of 16 MB. It's possible that buffering an entire request in memory is beneficial in some way for short requests, I don't know, but I believe for large request objects avoiding the large memory allocation is definitely better.

Testing

At present, the only way to workaround this issue without a change to the AWS SDK is to create a custom SdkHttpClient that wraps UrlConnectionHttpClient and overrides prepareRequest(). As the connection field is not visible in the ExecutableHttpRequest object (it's defined on the private software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient.RequestCallable class), it's necessary to use reflection to extract the connection field before operating on it. An example of this in Scala code can be seen here:

guardian/ophan-geoip-db-refresher@e2fc370#diff-5780ade773d1bdf8dd5dea40283e6fec88bacdb30c20c5b39955da1faa8e3ade

Running the ophan-geoip-db-refresher code locally, the heap required is:

560 MB (-Xmx560m) - without the workaround, request fully in memory
32 MB - with the workaround on, and so streaming enabled

Screenshots

These videos show memory usage before and after streaming being enabled:

Demo.Before.NoStreaming.mov

...at 31s in the above video, you can see memory usage suddenly massively increase - because the entire outgoing request has been buffered into memory. In the video below, with streaming enabled, memory consumption stays low for the entire duration of the run:

Demo.After.StreamingForced.mov

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)

Checklist

I have read the CONTRIBUTING document
Local run of mvn install - fails on main, fails in same way on this branch
My code follows the code style of this project
My change requires a change to the Javadoc documentation
I have updated the Javadoc documentation accordingly
I have read the README document
I have added tests to cover my changes
All new and existing tests passed
A short description of the change has been added to the CHANGELOG

License

I confirm that this pull request can be released under the Apache 2 license

zoewangg · 2021-11-24T18:07:12Z

Hi @rtyley, apologies for the delayed response and thank you for your PR! We will take a look shortly

rtyley · 2021-11-26T11:00:05Z

Hi @rtyley, apologies for the delayed response and thank you for your PR! We will take a look shortly

Thanks @zoewangg I appreciate it! I can rebase on the master branch to repair the CHANGELOG.md merge conflict if that helps?

...-client/src/main/java/software/amazon/awssdk/http/urlconnection/UrlConnectionHttpClient.java

CHANGELOG.md

Prior to this change, the `UrlConnectionHttpClient` implementation of `SdkHttpClient` would unintentionally cause **the entire `RequestBody`** to be loaded into memory before being sent, even if the body had been made using `software.amazon.awssdk.core.sync.RequestBody.fromInputStream()` and thus was a stream of known content-length. Even worse, as the buffering `ByteArrayInputStream` received the request, it would have to repeatedly grow capacity by allocating new arrays and copying the data over, so peak memory usage would be 1.5x the size of the request data. For instance, streaming 1 GB of data to S3 with `putObject` would lead to a peak memory allocation of 1.5 GB. This large allocation of memory can be avoided by instead *streaming* the request data as it is sent. `java.net.HttpURLConnection` (the core Java class used by `UrlConnectionHttpClient` to implement `SdkHttpClient`) supports streaming the request, but will only do it if `setFixedLengthStreamingMode()` or `setChunkedStreamingMode()` has been called - this is true even if the `Content-Length` header has already been set. You can see this in the implementation of `sun.net.www.protocol.http.HttpURLConnection` where `streaming()` must be true to avoid **the whole request** being read into a `PosterOutputStream` (a simple extension of `ByteArrayInputStream`) before being sent: https://github.com/openjdk/jdk/blob/da75f3c4ad5bdf25167a3ed80e51f567ab3dbd01/src/java.base/share/classes/sun/net/www/protocol/http/HttpURLConnection.java#L1373-L1394 If `Content-Length` is known, we're actually free to call `setFixedLengthStreamingMode()` on the connection to activate streaming, the only question is where & when to do it! Unfortunately `software.amazon.awssdk.http.urlconnection.UrlConnectionFactory`, which provides a way for end-users of the AWS SDK to customise the creation & configuration of the `HttpURLConnection`, _doesn't have access to the request or it's `Content-Length`, so can't call `setFixedLengthStreamingMode()`. In any case, this behaviour probably _always_ makes sense, so it doesn't make sense to leave it to users to customise it. That leaves the best place to do this being the `createAndConfigureConnection()` method, where the newly created `HttpURLConnection` is available and the `Content-Length` can be read if it's present. Consequently, I've updated `createAndConfigureConnection()` to call `setFixedLengthStreamingMode` if the `Content-Length` is not null, which is similar behaviour to the `ApacheHttpClient` SDK client: aws#2848 (comment) https://github.com/aws/aws-sdk-java-v2/blob/db0b45a0cd29603a1ea95b5e5f3d243755720309/http-clients/apache-client/src/main/java/software/amazon/awssdk/http/apache/internal/impl/ApacheHttpRequestFactory.java#L139-L160 At present, the only way to workaround this issue without a change to the AWS SDK is to create a custom `SdkHttpClient` that wraps `UrlConnectionHttpClient` and overrides `prepareRequest()`. As the `connection` field is not visible in the `ExecutableHttpRequest` object (it's defined on the private `software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient.RequestCallable` class), it's necessary to use reflection to extract the `connection` field before operating on it. An example of this in Scala code can be seen here: guardian/ophan-geoip-db-refresher@e2fc370#diff-5780ade773d1bdf8dd5dea40283e6fec88bacdb30c20c5b39955da1faa8e3ade Running the `ophan-geoip-db-refresher` code locally, the heap required is: * 560 MB (`-Xmx560m`) - _without_ the workaround, request fully in memory * 32 MB - with the workaround on, and so streaming enabled

...-client/src/main/java/software/amazon/awssdk/http/urlconnection/UrlConnectionHttpClient.java

…est-output

zoewangg

Thank you for your contribution!

…est-output

sonarqubecloud · 2021-12-03T22:20:12Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

100.0% Coverage
0.0% Duplication

zoewangg · 2021-12-03T22:49:41Z

@all-contributors please add @rtyley for code.

allcontributors · 2021-12-03T22:49:50Z

@zoewangg

I've put up a pull request to add @rtyley! 🎉

AWS SDK v2.17.97 includes aws/aws-sdk-java-v2#2848, which lowers memory consumption by streaming request data, rather than loading the entire request into memory before sending it, which can make a large difference to consumption of memory when uploading large files to S3. In the case of the `geoip-db-refresher`, it's actually economical to run with _more_ RAM than we need (we're billed for fewer GB-seconds when specifying a `MemorySize` of 1536MB than 1024MB, as AWS Lambdas are given more CPU to match increased memory) so this is no longer a serious issue for the 134MB file uploaded here! See also: * #7 * #8 * 1de50ef * https://aws.amazon.com/lambda/pricing/ * https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-memory-console

…40c9deb8b Pull request: release <- staging/d4522bf0-423c-4a6a-b2e4-bcc40c9deb8b

rtyley force-pushed the lower-memory-consumption-by-streaming-request-output branch 3 times, most recently from eaa0aac to 13a9122 Compare November 20, 2021 12:51

rtyley marked this pull request as ready for review November 20, 2021 22:07

rtyley requested a review from a team as a code owner November 20, 2021 22:07

lestephane mentioned this pull request Nov 26, 2021

Cold-start improvements with a lightweight client #2527

Closed

zoewangg reviewed Nov 29, 2021

View reviewed changes

...-client/src/main/java/software/amazon/awssdk/http/urlconnection/UrlConnectionHttpClient.java Outdated Show resolved Hide resolved

CHANGELOG.md Outdated Show resolved Hide resolved

rtyley force-pushed the lower-memory-consumption-by-streaming-request-output branch from 13a9122 to e1feaa6 Compare November 29, 2021 23:36

zoewangg reviewed Nov 30, 2021

View reviewed changes

...-client/src/main/java/software/amazon/awssdk/http/urlconnection/UrlConnectionHttpClient.java Show resolved Hide resolved

Merge branch 'master' into lower-memory-consumption-by-streaming-requ…

7e952e4

…est-output

zoewangg approved these changes Dec 1, 2021

View reviewed changes

zoewangg added 4 commits December 1, 2021 19:01

Merge branch 'master' into lower-memory-consumption-by-streaming-requ…

65361b8

…est-output

Merge branch 'master' into lower-memory-consumption-by-streaming-requ…

33a7be7

…est-output

Merge branch 'master' into lower-memory-consumption-by-streaming-requ…

cadaa7a

…est-output

Merge branch 'master' into lower-memory-consumption-by-streaming-requ…

470adfa

…est-output

zoewangg merged commit d08f919 into aws:master Dec 3, 2021

allcontributors bot mentioned this pull request Dec 3, 2021

docs: add rtyley as a contributor for code #2888

Merged

lestephane mentioned this pull request Dec 6, 2021

[UrlConnectionHttpClient] propagate IOExceptions during request body sending (to benefit from the aws sdk retry policy) #2891

Closed

13 tasks

rtyley mentioned this pull request Dec 8, 2021

Improve memory usage with streaming of AWS requests guardian/ophan-geoip-db-refresher#10

Merged

aws-sdk-java-automation added a commit that referenced this pull request Dec 29, 2023

Merge pull request #2848 from aws/staging/d4522bf0-423c-4a6a-b2e4-bcc…

dd92ea0

…40c9deb8b Pull request: release <- staging/d4522bf0-423c-4a6a-b2e4-bcc40c9deb8b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lower memory consumption by streaming request data #2848

Lower memory consumption by streaming request data #2848

Uh oh!

rtyley commented Nov 16, 2021 •

edited

Loading

Uh oh!

zoewangg commented Nov 24, 2021

Uh oh!

rtyley commented Nov 26, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zoewangg left a comment

Uh oh!

sonarqubecloud bot commented Dec 3, 2021

Uh oh!

zoewangg commented Dec 3, 2021

Uh oh!

allcontributors bot commented Dec 3, 2021

Uh oh!

Uh oh!

Lower memory consumption by streaming request data #2848

Lower memory consumption by streaming request data #2848

Uh oh!

Conversation

rtyley commented Nov 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

Testing

Screenshots

Types of changes

Checklist

License

Uh oh!

zoewangg commented Nov 24, 2021

Uh oh!

rtyley commented Nov 26, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zoewangg left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Dec 3, 2021

Uh oh!

zoewangg commented Dec 3, 2021

Uh oh!

allcontributors bot commented Dec 3, 2021

Uh oh!

Uh oh!

rtyley commented Nov 16, 2021 •

edited

Loading