Skip to content

Lower memory consumption by streaming request data #2848

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

rtyley
Copy link
Contributor

@rtyley rtyley commented Nov 16, 2021

Motivation and Context

Prior to this change, the UrlConnectionHttpClient implementation of SdkHttpClient would unintentionally cause the entire RequestBody to be loaded into memory before being sent, even if the body had been made using software.amazon.awssdk.core.sync.RequestBody.fromInputStream() and thus was a stream of known content-length. Even worse, as the buffering ByteArrayInputStream received the request, it would have to repeatedly grow capacity by allocating new arrays and copying the data over, so peak memory usage would be 1.5x the size of the request data. For instance, streaming 1 GB of data to S3 with putObject would lead to a peak memory allocation of 1.5 GB. This large allocation of memory can be avoided by instead streaming the request data as it is sent.

java.net.HttpURLConnection (the core Java class used by UrlConnectionHttpClient to implement SdkHttpClient) supports streaming the request, but will only do it if setFixedLengthStreamingMode() or setChunkedStreamingMode() has been called - this is true even if the Content-Length header has already been set. You can see this in the implementation of sun.net.www.protocol.http.HttpURLConnection where streaming() must be true to avoid the whole request being read into a PosterOutputStream (a simple extension of ByteArrayInputStream) before being sent.

Description

If Content-Length is known, we're actually free to call setFixedLengthStreamingMode() on the connection to activate streaming, the only question is where & when to do it! Unfortunately software.amazon.awssdk.http.urlconnection.UrlConnectionFactory, which provides a way for end-users of the AWS SDK to customise the creation & configuration of the HttpURLConnection, doesn't have access to the request or it's Content-Length, so can't call setFixedLengthStreamingMode(). In any case, this behaviour probably always makes sense, so it doesn't make sense to leave it to users to customise it. That leaves the best place to do this being the createAndConfigureConnection() method, where the newly created HttpURLConnection is available and the Content-Length can be read if it's present.

Consequently, I've updated createAndConfigureConnection() to call setFixedLengthStreamingMode if the Content-Length is available and above a streaming threshold of 16 MB. It's possible that buffering an entire request in memory is beneficial in some way for short requests, I don't know, but I believe for large request objects avoiding the large memory allocation is definitely better.

Testing

At present, the only way to workaround this issue without a change to the AWS SDK is to create a custom SdkHttpClient that wraps UrlConnectionHttpClient and overrides prepareRequest(). As the connection field is not visible in the ExecutableHttpRequest object (it's defined on the private software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient.RequestCallable class), it's necessary to use reflection to extract the connection field before operating on it. An example of this in Scala code can be seen here:

guardian/ophan-geoip-db-refresher@e2fc370#diff-5780ade773d1bdf8dd5dea40283e6fec88bacdb30c20c5b39955da1faa8e3ade

Running the ophan-geoip-db-refresher code locally, the heap required is:

  • 560 MB (-Xmx560m) - without the workaround, request fully in memory
  • 32 MB - with the workaround on, and so streaming enabled

Screenshots

These videos show memory usage before and after streaming being enabled:

Demo.Before.NoStreaming.mov

...at 31s in the above video, you can see memory usage suddenly massively increase - because the entire outgoing request has been buffered into memory. In the video below, with streaming enabled, memory consumption stays low for the entire duration of the run:

Demo.After.StreamingForced.mov

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

Checklist

  • I have read the CONTRIBUTING document
  • Local run of mvn install - fails on main, fails in same way on this branch
  • My code follows the code style of this project
  • My change requires a change to the Javadoc documentation
  • I have updated the Javadoc documentation accordingly
  • I have read the README document
  • I have added tests to cover my changes
  • All new and existing tests passed
  • A short description of the change has been added to the CHANGELOG

License

  • I confirm that this pull request can be released under the Apache 2 license

@rtyley rtyley force-pushed the lower-memory-consumption-by-streaming-request-output branch 3 times, most recently from eaa0aac to 13a9122 Compare November 20, 2021 12:51
@rtyley rtyley marked this pull request as ready for review November 20, 2021 22:07
@rtyley rtyley requested a review from a team as a code owner November 20, 2021 22:07
@zoewangg
Copy link
Contributor

Hi @rtyley, apologies for the delayed response and thank you for your PR! We will take a look shortly

@rtyley
Copy link
Contributor Author

rtyley commented Nov 26, 2021

Hi @rtyley, apologies for the delayed response and thank you for your PR! We will take a look shortly

Thanks @zoewangg I appreciate it! I can rebase on the master branch to repair the CHANGELOG.md merge conflict if that helps?

Prior to this change, the `UrlConnectionHttpClient` implementation of
`SdkHttpClient` would unintentionally cause **the entire `RequestBody`**
to be loaded into memory before being sent, even if the body had been
made using `software.amazon.awssdk.core.sync.RequestBody.fromInputStream()`
and thus was a stream of known content-length. Even worse, as the buffering
`ByteArrayInputStream` received the request, it would have to repeatedly
grow capacity by allocating new arrays and copying the data over, so peak
memory usage would be 1.5x the size of the request data. For instance,
streaming 1 GB of data to S3 with `putObject` would lead to a peak memory
allocation of 1.5 GB. This large allocation of memory can be avoided by
instead *streaming* the request data as it is sent.

`java.net.HttpURLConnection` (the core Java class used by
`UrlConnectionHttpClient` to implement `SdkHttpClient`) supports streaming
the request, but will only do it if `setFixedLengthStreamingMode()` or
`setChunkedStreamingMode()` has been called - this is true even if the
`Content-Length` header has already been set. You can see this in the
implementation of `sun.net.www.protocol.http.HttpURLConnection` where
`streaming()` must be true to avoid **the whole request** being read into
a `PosterOutputStream` (a simple extension of `ByteArrayInputStream`)
before being sent:

https://github.com/openjdk/jdk/blob/da75f3c4ad5bdf25167a3ed80e51f567ab3dbd01/src/java.base/share/classes/sun/net/www/protocol/http/HttpURLConnection.java#L1373-L1394

If `Content-Length` is known, we're actually free to call
`setFixedLengthStreamingMode()` on the connection to activate streaming,
the only question is where & when to do it! Unfortunately
`software.amazon.awssdk.http.urlconnection.UrlConnectionFactory`, which
provides a way for end-users of the AWS SDK to customise the creation &
configuration of the `HttpURLConnection`, _doesn't have access to the
request or it's `Content-Length`, so can't call
`setFixedLengthStreamingMode()`. In any case, this behaviour probably
_always_ makes sense, so it doesn't make sense to leave it to users to
customise it. That leaves the best place to do this being the
`createAndConfigureConnection()` method, where the newly created
`HttpURLConnection` is available and the `Content-Length` can be read if
it's present.

Consequently, I've updated `createAndConfigureConnection()` to call
`setFixedLengthStreamingMode` if the `Content-Length` is not null, which
is similar behaviour to the `ApacheHttpClient` SDK client:

aws#2848 (comment)
https://github.com/aws/aws-sdk-java-v2/blob/db0b45a0cd29603a1ea95b5e5f3d243755720309/http-clients/apache-client/src/main/java/software/amazon/awssdk/http/apache/internal/impl/ApacheHttpRequestFactory.java#L139-L160

At present, the only way to workaround this issue without a change to the
AWS SDK is to create a custom `SdkHttpClient` that wraps
`UrlConnectionHttpClient` and overrides `prepareRequest()`. As the
`connection` field is not visible in the `ExecutableHttpRequest` object
(it's defined on the private
`software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient.RequestCallable`
class), it's necessary to use reflection to extract the `connection` field
before operating on it. An example of this in Scala code can be seen here:

guardian/ophan-geoip-db-refresher@e2fc370#diff-5780ade773d1bdf8dd5dea40283e6fec88bacdb30c20c5b39955da1faa8e3ade

Running the `ophan-geoip-db-refresher` code locally, the heap required is:

* 560 MB (`-Xmx560m`) - _without_ the workaround, request fully in memory
*  32 MB - with the workaround on, and so streaming enabled
@rtyley rtyley force-pushed the lower-memory-consumption-by-streaming-request-output branch from 13a9122 to e1feaa6 Compare November 29, 2021 23:36
Copy link
Contributor

@zoewangg zoewangg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution!

@sonarqubecloud
Copy link

sonarqubecloud bot commented Dec 3, 2021

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

100.0% 100.0% Coverage
0.0% 0.0% Duplication

@zoewangg zoewangg merged commit d08f919 into aws:master Dec 3, 2021
@zoewangg
Copy link
Contributor

zoewangg commented Dec 3, 2021

@all-contributors please add @rtyley for code.

@allcontributors
Copy link
Contributor

@zoewangg

I've put up a pull request to add @rtyley! 🎉

rtyley added a commit to guardian/ophan-geoip-db-refresher that referenced this pull request Dec 8, 2021
AWS SDK v2.17.97 includes aws/aws-sdk-java-v2#2848,
which lowers memory consumption by streaming request data, rather than
loading the entire request into memory before sending it, which can make
a large difference to consumption of memory when uploading large files
to S3.

In the case of the `geoip-db-refresher`, it's actually economical to
run with _more_ RAM than we need (we're billed for fewer GB-seconds
when specifying a `MemorySize` of 1536MB than 1024MB, as AWS Lambdas
are given more CPU to match increased memory) so this is no longer a
serious issue for the 134MB file uploaded here!

See also:

* #7
* #8
* 1de50ef
* https://aws.amazon.com/lambda/pricing/
* https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-memory-console
aws-sdk-java-automation added a commit that referenced this pull request Dec 29, 2023
…40c9deb8b

Pull request: release <- staging/d4522bf0-423c-4a6a-b2e4-bcc40c9deb8b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants