Skip to content

TLS setup as part of channel acquire #3056

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

dagnir
Copy link
Contributor

@dagnir dagnir commented Feb 24, 2022

Motivation and Context

Fix confusing error behavior where acquisition completes, but then channel appears to be closed as request is made because TLS setup did not complete successfully.

Modifications

This commit correctly factors in the TLS negotiation into the channel pooling
acquire logic; this ensures that when the Netty client successfully acquires a
connection from the channel pool, TLS negotiation is guaranteed to have
completed successfully.

Testing

  • New unit tests
  • Running integ tests

Screenshots (if appropriate)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

Checklist

  • I have read the CONTRIBUTING document
  • Local run of mvn install succeeds
  • My code follows the code style of this project
  • My change requires a change to the Javadoc documentation
  • I have updated the Javadoc documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed
  • I have added a changelog entry. Adding a new entry must be accomplished by running the scripts/new-change script and following the instructions. Commit the new file created by the script in .changes/next-release with your changes.
  • My change is to implement 1.11 parity feature and I have updated LaunchChangelog

License

  • I confirm that this pull request can be released under the Apache 2 license

@dagnir dagnir force-pushed the ssl-handshake-lease branch 3 times, most recently from 478f176 to c2c0096 Compare February 24, 2022 21:04
This commit correctly factors in the TLS negotiation into the channel pooling
acquire logic; this ensures that when the Netty client successfully acquires a
connection from the channel pool, TLS negotiation is guaranteed to have
completed successfully.
@dagnir dagnir force-pushed the ssl-handshake-lease branch from c2c0096 to 103e9a5 Compare February 24, 2022 22:01
@dagnir dagnir marked this pull request as ready for review February 24, 2022 22:04
@dagnir dagnir requested a review from a team as a code owner February 24, 2022 22:04
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 8 Code Smells

96.2% 96.2% Coverage
0.0% 0.0% Duplication


/**
* This {@code ChannelPool} ensures that the channel has completed the TLS negotiation before giving it back to the caller. Note
* that it's possible for a channel to have multiple {@link SslHandler} instances its pipeline, for example if it's using a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit/typo: "in its pipeline"

* that it's possible for a channel to have multiple {@link SslHandler} instances its pipeline, for example if it's using a
* proxy over HTTPS. This pool explicitly looks for the handler of the given name.
* <p>
* If TLS setup fails, the channel will be closed and released back to the wrapped pool and the future will be failed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention anything about how this will interact w/ retries?

* If TLS setup fails, the channel will be closed and released back to the wrapped pool and the future will be failed.
*/
@SdkInternalApi
public class TlsHandshakeEnsuringChannelPool implements ChannelPool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "ensuring" vs "awaiting"? (Not deeply opinionated since it's internal API.)

Comment on lines +84 to +87
if (!f.isSuccess()) {
promise.tryFailure(f.cause());
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we more explicitly propagate cancellation here? E.g., via NettyUtils#consumeOrPropagate.

}

private Optional<Future<Channel>> sslHandshakeFuture(Channel ch) {
ChannelHandler handlerByName = ch.pipeline().get(sslHandlerName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than looking by name, would it be simpler to just look for, e.g., the first handler of type SslHandler? Would that be guaranteed to be the outer-most layer? Or can we not make any assumptions about the order?

private Optional<Future<Channel>> sslHandshakeFuture(Channel ch) {
ChannelHandler handlerByName = ch.pipeline().get(sslHandlerName);
if (!(handlerByName instanceof SslHandler)) {
return Optional.empty();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps could simplify this slightly by also just returning an already-successful future.

@@ -205,4 +226,16 @@ private static SdkHttpFullRequest testSdkRequest() {
.build();
}

private static int jdkVersion() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared test utility, or too premature?

Comment on lines +86 to +88
return Arrays.asList(new TestCase(CloseTime.DURING_INIT, "Failed TLS connection setup"),
new TestCase(CloseTime.BEFORE_SSL_HANDSHAKE, "Failed TLS connection setup"),
new TestCase(CloseTime.DURING_SSL_HANDSHAKE, "Failed TLS connection setup"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is a big improvement on wording/clarity.

LOGGER.debug(channel, () -> "Failed TLS connection setup. Channel will be closed.", handshake.cause());
channel.close();
delegate.release(channel);
IOException error = new IOException("Failed TLS connection setup: " + channel, handshake.cause());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider more prescriptive wording if the error was, e.g., a timeout?

@zoewangg
Copy link
Contributor

Can we run benchmarks to see how much performance impact this change would cause?

@dagnir
Copy link
Contributor Author

dagnir commented Mar 25, 2022

Can we run benchmarks to see how much performance impact this change would cause?

Did a test with the following code

Test code
    public static void main(String[] args) throws InterruptedException {
        CloudWatchMetricPublisher metricPublisher = CloudWatchMetricPublisher.builder()
                .namespace("DongieTlsTesting")
                .cloudWatchClient(CloudWatchAsyncClient.builder()
                        .region(Region.US_WEST_2)
                        .credentialsProvider(EnvironmentVariableCredentialsProvider.create())
                        .build())
                .build();

        int nThreads = Runtime.getRuntime().availableProcessors() * 2;

        ExecutorService exec = Executors.newFixedThreadPool(nThreads);

        System.out.printf("Spinning up %d threads%n", nThreads);
        for (int i = 0; i < nThreads; ++i) {
            exec.submit(() -> {
                while (true) {
                    S3AsyncClient s3 = null;
                    try {
                        // Always create a new client to defeat pooling
                         s3 = S3AsyncClient.builder()
                                .region(Region.US_WEST_2)
                                .credentialsProvider(EnvironmentVariableCredentialsProvider.create())
                                .overrideConfiguration(o -> o.addMetricPublisher(metricPublisher))
                                .build();

                        s3.getObject(r -> r.bucket(BUCKET).key(KEY), AsyncResponseTransformer.toBytes()).join();
                    } catch (Exception e) {
                        System.out.printf("ERROR [%s]: %s%n", Thread.currentThread().getName(), e.getMessage());
                    } finally {
                        if (s3 != null) {
                            s3.close();
                        }
                    }
                }
            });
        }

        Thread.sleep(Duration.ofMinutes(15).toMillis());

        System.out.println("Done.");
        exec.shutdown();

Predictably, the ConcurrencyAcquireDuration is higher, 13-14ms vs 7ms, since the complete handshake is being factored in. Interestingly though, both the ServiceCallDuration and ApiCallDuration are lower overall, probably since previously they were including at least part of the TLS handshake:

Screen Shot 2022-03-25 at 3 51 18 PM

I'm also pulling this change in internally to test in the canaries.

@dagnir
Copy link
Contributor Author

dagnir commented Sep 29, 2022

Closing this for now because of performance regression concerns.

@dagnir dagnir closed this Sep 29, 2022
aws-sdk-java-automation added a commit that referenced this pull request Jun 4, 2024
…f4b406b40

Pull request: release <- staging/78627166-ab50-482c-9c30-34bf4b406b40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants