Skip to content

feat: add option to enable ApiTracer #3095

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jun 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
49e6ac3
feat: add option to enable ApiTracer
olavloite May 6, 2024
62a5492
feat: add gfe-latency as attribute to span
olavloite May 8, 2024
990850b
fix: get span at request start
olavloite May 8, 2024
6958173
test: add tests for ApiTracer
olavloite May 10, 2024
319e889
test: fix tests when using mux sessions
olavloite May 14, 2024
ce205a5
🦉 Updates from OwlBot post-processor
gcf-owl-bot[bot] May 14, 2024
90752a1
chore: cleanup
olavloite May 14, 2024
ab5e2dd
Merge branch 'add-option-for-api-tracer' of github.com:googleapis/jav…
olavloite May 14, 2024
644e3d3
build: declare test dependency that is being used
olavloite May 14, 2024
e03f21b
Merge branch 'main' into add-option-for-api-tracer
olavloite May 15, 2024
3392074
chore: remove TODOs
olavloite May 15, 2024
1510a12
chore: add env var for enabling api tracing
olavloite May 15, 2024
bfa1aac
chore: add clirr ignored diff
olavloite May 15, 2024
aa2d4da
chore: address review comments
olavloite May 21, 2024
fc7c7f0
Merge branch 'main' into add-option-for-api-tracer
olavloite May 21, 2024
2318515
🦉 Updates from OwlBot post-processor
gcf-owl-bot[bot] May 21, 2024
dc9d284
Merge branch 'main' into add-option-for-api-tracer
olavloite May 29, 2024
c40f434
🦉 Updates from OwlBot post-processor
gcf-owl-bot[bot] May 29, 2024
b6e9018
Merge branch 'main' into add-option-for-api-tracer
olavloite Jun 5, 2024
2081efd
test: add OpenCensus ApiTracer tests
olavloite Jun 6, 2024
2e2798c
🦉 Updates from OwlBot post-processor
gcf-owl-bot[bot] Jun 6, 2024
3b51a06
fix: clear spans after test
olavloite Jun 6, 2024
2ed8d2a
Merge branch 'add-option-for-api-tracer' of github.com:googleapis/jav…
olavloite Jun 6, 2024
1107a04
fix: ignore test traces coming from a previous test
olavloite Jun 6, 2024
715f209
test: skip OpenCensus ApiTracer test as OpenCensus is way too intrusive
olavloite Jun 6, 2024
52dfd06
test: increase timeout to reduce flakiness
olavloite Jun 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions .readme-partials.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,22 @@ custom_content: |
This option can also be enabled by setting the environment variable
`SPANNER_ENABLE_EXTENDED_TRACING=true`.

#### OpenTelemetry API Tracing
You can enable tracing of each API call that the Spanner client executes with the `enableApiTracing`
option. These traces also include any retry attempts for an API call:

```
SpannerOptions options = SpannerOptions.newBuilder()
.setOpenTelemetry(openTelemetry)
.setEnableApiTracing(true)
.build();
```

This option can also be enabled by setting the environment variable
`SPANNER_ENABLE_API_TRACING=true`.

> Note: The attribute keys that are used for additional information about retry attempts and the number of requests might change in a future release.

### Instrument with OpenCensus

> Note: OpenCensus project is deprecated. See [Sunsetting OpenCensus](https://opentelemetry.io/blog/2023/sunsetting-opencensus/).
Expand Down
22 changes: 19 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,13 +57,13 @@ implementation 'com.google.cloud:google-cloud-spanner'
If you are using Gradle without BOM, add this to your dependencies:

```Groovy
implementation 'com.google.cloud:google-cloud-spanner:6.68.0'
implementation 'com.google.cloud:google-cloud-spanner:6.68.1'
```

If you are using SBT, add this to your dependencies:

```Scala
libraryDependencies += "com.google.cloud" % "google-cloud-spanner" % "6.68.0"
libraryDependencies += "com.google.cloud" % "google-cloud-spanner" % "6.68.1"
```
<!-- {x-version-update-end} -->

Expand Down Expand Up @@ -272,6 +272,22 @@ SpannerOptions options = SpannerOptions.newBuilder()
This option can also be enabled by setting the environment variable
`SPANNER_ENABLE_EXTENDED_TRACING=true`.

#### OpenTelemetry API Tracing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we consider this a beta feature or a stable feature? If we consider this a beta feature, I would add @BetaApi to setEnableApiTracing and mention it here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any specific reason for why we would not consider this a stable feature, other than the comments already mentioned regarding the possiblity of changing attribute keys. In many ways, this is just an extension of something that was already supported (ApiTracer with OpenCensus) using the newer tracing framework, that is also considered a stable feature.

You can enable tracing of each API call that the Spanner client executes with the `enableApiTracing`
option. These traces also include any retry attempts for an API call:

```
SpannerOptions options = SpannerOptions.newBuilder()
.setOpenTelemetry(openTelemetry)
.setEnableApiTracing(true)
.build();
```

This option can also be enabled by setting the environment variable
`SPANNER_ENABLE_API_TRACING=true`.

> Note: The attribute keys that are used for additional information about retry attempts and the number of requests might change in a future release.

### Instrument with OpenCensus

> Note: OpenCensus project is deprecated. See [Sunsetting OpenCensus](https://opentelemetry.io/blog/2023/sunsetting-opencensus/).
Expand Down Expand Up @@ -671,7 +687,7 @@ Java is a registered trademark of Oracle and/or its affiliates.
[kokoro-badge-link-5]: http://storage.googleapis.com/cloud-devrel-public/java/badges/java-spanner/java11.html
[stability-image]: https://img.shields.io/badge/stability-stable-green
[maven-version-image]: https://img.shields.io/maven-central/v/com.google.cloud/google-cloud-spanner.svg
[maven-version-link]: https://central.sonatype.com/artifact/com.google.cloud/google-cloud-spanner/6.68.0
[maven-version-link]: https://central.sonatype.com/artifact/com.google.cloud/google-cloud-spanner/6.68.1
[authentication]: https://github.com/googleapis/google-cloud-java#authentication
[auth-scopes]: https://developers.google.com/identity/protocols/oauth2/scopes
[predefined-iam-roles]: https://cloud.google.com/iam/docs/understanding-roles#predefined_roles
Expand Down
7 changes: 7 additions & 0 deletions google-cloud-spanner/clirr-ignored-differences.xml
Original file line number Diff line number Diff line change
Expand Up @@ -688,6 +688,13 @@
<method>boolean isEnableExtendedTracing()</method>
</difference>

<!-- Added API tracing option -->
<difference>
<differenceType>7012</differenceType>
<className>com/google/cloud/spanner/SpannerOptions$SpannerEnvironment</className>
<method>boolean isEnableApiTracing()</method>
</difference>

<!-- Added ExcludeTxnFromChangeStreams -->
<difference>
<differenceType>7012</differenceType>
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,274 @@
/*
* Copyright 2024 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* https://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package com.google.cloud.spanner;

import com.google.api.gax.tracing.ApiTracer;
import com.google.api.gax.tracing.ApiTracerFactory.OperationType;
import com.google.common.base.Preconditions;
import io.opentelemetry.api.common.AttributeKey;
import io.opentelemetry.api.common.Attributes;
import io.opentelemetry.api.common.AttributesBuilder;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.StatusCode;
import java.util.concurrent.atomic.AtomicLong;
import javax.annotation.Nonnull;
import javax.annotation.Nullable;
import org.threeten.bp.Duration;

/**
* {@link com.google.api.gax.tracing.ApiTracer} for use with OpenTelemetry. Based on {@link
* com.google.api.gax.tracing.OpencensusTracer}.
*/
class OpenTelemetryApiTracer implements ApiTracer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to separate the logic to calculate the metrics and the logic to record the metrics while implementing metrics in Gax. Hence MetricsTracer and MetricsRecorder are generic classes without any knowledge of OpenTelemetry, only OpenTelemetryMetricsRecorder is dependent on OpenTelemetry apis.
Spanner does not have to follow the same design principle and it may not be feasible for traces either, but something to consider.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Spanner client supports tracing with both OpenCensus and OpenTelemetry. The customer makes a choice at startup, and then the client library configures itself accordingly. We want to support the same for the ApiTracer, so based on that, it would make sense to create a technology-agnostic ApiTracer. However, there is already an OpencensusTracer in gax that can be used with Spanner today (and that we need to continue to support). That means that it is easier to create an OpenTelemetry sibling of this tracer, and choose one or the other based on the user configuration when the client is created, instead of adding another layer of abstraction between the client library and the ApiTracer.

/** The attribute keys that are used by this tracer might change in a future release. */
private final AttributeKey<Long> ATTEMPT_COUNT_KEY = AttributeKey.longKey("attempt.count");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to document that these attribute keys are subject to change? If we are going to implement them in Gax, they may have different naming conventions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment to the class, but I don't think many people will actually read it. I'm also not sure it will be a very big problem for customers if they change in the future. I also added a comment regarding this to the README file.

The keys are by the way taken from the existing OpencensusTracer implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added a comment regarding this to the README file.

Thanks, this sounds good to me!


private final AttributeKey<Long> TOTAL_REQUEST_COUNT_KEY =
AttributeKey.longKey("total_request_count");
private final AttributeKey<Long> TOTAL_RESPONSE_COUNT_KEY =
AttributeKey.longKey("total_response_count");
private final AttributeKey<String> EXCEPTION_MESSAGE_KEY =
AttributeKey.stringKey("exception.message");
private final AttributeKey<Long> ATTEMPT_NUMBER_KEY = AttributeKey.longKey("attempt.number");
private final AttributeKey<Long> ATTEMPT_REQUEST_COUNT_KEY =
AttributeKey.longKey("attempt.request_count");
private final AttributeKey<Long> ATTEMPT_RESPONSE_COUNT_KEY =
AttributeKey.longKey("attempt.response_count");
private final AttributeKey<String> CONNECTION_ID_KEY = AttributeKey.stringKey("connection");
private final AttributeKey<Long> RETRY_DELAY_KEY = AttributeKey.longKey("delay_ms");
private static final AttributeKey<Long> BATCH_SIZE_KEY = AttributeKey.longKey("batch.size");
private static final AttributeKey<Long> BATCH_COUNT_KEY = AttributeKey.longKey("batch.count");

private final Span span;
private final OperationType operationType;

private volatile String lastConnectionId;
private volatile long currentAttemptId;
private final AtomicLong attemptSentMessages = new AtomicLong(0);
private long attemptReceivedMessages = 0;
private final AtomicLong totalSentMessages = new AtomicLong(0);
private long totalReceivedMessages = 0;

OpenTelemetryApiTracer(@Nonnull Span span, @Nonnull OperationType operationType) {
this.span = Preconditions.checkNotNull(span);
this.operationType = Preconditions.checkNotNull(operationType);
}

Span getSpan() {
return this.span;
}

@Override
public Scope inScope() {
final io.opentelemetry.context.Scope openTelemetryScope = span.makeCurrent();
return openTelemetryScope::close;
}

@Override
public void operationSucceeded() {
span.setAllAttributes(baseOperationAttributes());
span.setStatus(StatusCode.OK);
span.end();
}

@Override
public void operationCancelled() {
span.setAllAttributes(baseOperationAttributes());
span.setStatus(StatusCode.ERROR, "Cancelled by caller");
span.end();
}

@Override
public void operationFailed(Throwable error) {
span.setAllAttributes(baseOperationAttributes());
span.setStatus(StatusCode.ERROR, error.getMessage());
span.end();
}

@Override
public void lroStartFailed(Throwable error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these methods are not currently used by anyone. For example, neither Bigtable's MetricsTracer or Gax's MetricsTracer overrides lroStartFailed.
In general any method starts with attempt or operation should be well tested, others like lroStartFailed, lroStartSucceeded, connectionSelected, batchRequestSent should still mostly work, but I would use them with caution.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's called from here: https://github.com/googleapis/sdk-platform-java/blob/5799827a864be627bac03969cc178efc9d6170aa/gax-java/gax/src/main/java/com/google/api/gax/tracing/TracedOperationInitialCallable.java#L85

This test case covers the kind of failure that would trigger this method to be called:

I agree that it is relatively unlikely to happen in production, as it requires an RPC that would normally return an LRO to fail in a way that does not return an LRO, but instead just returns an error. But it is needed to implement the interface, and in theory it could happen (I could for example imagine an RPC like this returning a RESOURCE_EXHAUSTED error directly, instead of an LRO, if there is a limit on the number of operations running in parallel).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is needed to implement the interface

All methods in ApiTracer are default no-op methods now, you don't have to override them.

What I'm trying to get to is, maybe we can start with traces/events that we know will be most useful, we don't have to implement each and every scenario. Especially for LROs and batches related methods, they may not be well tested before and I'm not confident enough that they are working as intended.
For example, maybe lroStartFailed should be called in places more than just in TracedOperationInitialCallable, and we did find a few small bugs related to the whole ApiTracer framework while implementing OpenTelemetry metrics in Gax.

That being said, if you do think they provide value to Spanner, we can start with it and fix bugs along the way when we find them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to keep this, because:

  1. Tracing long-running operations adds value for Spanner, specifically because DDL statements are executed as long-running operations. Giving customers (and us) insights into this can help in debugging issues, especially if the customer uses statements like create table if not exists..., which are relatively quick to execute, but still add significant latency if they are executed as part of for example the standard application startup.
  2. This method covers one of the potential end states of such an LRO. Failing to implement that method, would leave the trace open.
  3. If there are specific cases that do not call this method, then that will also leave the trace open, but getting the trace closed in most cases is still better than not implementing this method. The 'standard' way that this method should be called is covered by a test case, so I feel confident that the most common code flow that should be covered by this method works as intended.

span.addEvent(
"Operation failed to start", Attributes.of(EXCEPTION_MESSAGE_KEY, error.getMessage()));
span.setStatus(StatusCode.ERROR, error.getMessage());
span.end();
}

@Override
public void lroStartSucceeded() {
span.addEvent("Operation started");
}

@Override
public void connectionSelected(String id) {
lastConnectionId = id;
}

@Override
public void attemptStarted(int attemptNumber) {
attemptStarted(null, attemptNumber);
}

@Override
public void attemptStarted(@Nullable Object request, int attemptNumber) {
currentAttemptId = attemptNumber;
attemptSentMessages.set(0);
attemptReceivedMessages = 0;

// Attempts start counting a zero, so more than zero indicates a retry.
if (attemptNumber > 0 && operationType != OperationType.LongRunning) {
// Add an event if the RPC retries, as this is otherwise transparent to the user. Retries
// would then show up as higher latency without any logical explanation.
span.addEvent("Starting RPC retry " + attemptNumber);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered representing retries as child spans? It would definitely make things complicated but maybe easier for customers to understand.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had not considered it before implementing this. But also after having thought about it for a while, I don't think we should do that. The reason is that:

  1. We would need to make a choice between 'adding a span for each attempt (including the initial attempt)' and 'adding an extra span only for retry attempts'.
  2. The first would mean that we would add an extra span that does not really add any information in the vast majority of the cases when the RPC is not being retried. That would only add additional costs for customers.
  3. The second would mean that we would get a varying 'span depth' depending on whether an RPC is being retried or not. I'm not sure that makes it easier to understand for a customer, and I'm worried that it makes it harder for a customer to search for slow requests. With this, a slow request could show up both as a single slow attempt, and as a slow operation with multiple child spans, where the attributes would be spread across the parent and child spans.

} else if (operationType == OperationType.LongRunning) {
span.addEvent("Starting poll attempt " + attemptNumber);
}
}

@Override
public void attemptSucceeded() {
Attributes attributes = baseAttemptAttributes();

// Same infrastructure is used for both polling and retries, so need to disambiguate it here.
if (operationType == OperationType.LongRunning) {
span.addEvent("Polling completed", attributes);
} else {
span.addEvent("Attempt succeeded", attributes);
}
}

@Override
public void attemptCancelled() {
Attributes attributes = baseAttemptAttributes();

// Same infrastructure is used for both polling and retries, so need to disambiguate it here.
if (operationType == OperationType.LongRunning) {
span.addEvent("Polling was cancelled", attributes);
} else {
span.addEvent("Attempt cancelled", attributes);
}
lastConnectionId = null;
}

@Override
public void attemptFailed(Throwable error, Duration delay) {
AttributesBuilder builder = baseAttemptAttributesBuilder();
if (delay != null) {
builder.put(RETRY_DELAY_KEY, delay.toMillis());
}
if (error != null) {
builder.put(EXCEPTION_MESSAGE_KEY, error.getMessage());
}
Attributes attributes = builder.build();

// Same infrastructure is used for both polling and retries, so need to disambiguate it here.
if (operationType == OperationType.LongRunning) {
// The poll RPC was successful, but it indicated that the operation is still running.
span.addEvent("Scheduling next poll", attributes);
} else {
span.addEvent("Attempt failed, scheduling next attempt", attributes);
}
lastConnectionId = null;
}

@Override
public void attemptFailedRetriesExhausted(@Nonnull Throwable error) {
AttributesBuilder builder = baseAttemptAttributesBuilder();
builder.put(EXCEPTION_MESSAGE_KEY, error.getMessage());
Attributes attributes = builder.build();

// Same infrastructure is used for both polling and retries, so need to disambiguate it here.
if (operationType == OperationType.LongRunning) {
span.addEvent("Polling attempts exhausted", attributes);
} else {
span.addEvent("Attempts exhausted", attributes);
}
lastConnectionId = null;
}

@Override
public void attemptPermanentFailure(@Nonnull Throwable error) {
AttributesBuilder builder = baseAttemptAttributesBuilder();
builder.put(EXCEPTION_MESSAGE_KEY, error.getMessage());
Attributes attributes = builder.build();

// Same infrastructure is used for both polling and retries, so need to disambiguate it here.
if (operationType == OperationType.LongRunning) {
span.addEvent("Polling failed", attributes);
} else {
span.addEvent("Attempt failed, error not retryable", attributes);
}
lastConnectionId = null;
}

@Override
public void responseReceived() {
attemptReceivedMessages++;
totalReceivedMessages++;
}

@Override
public void requestSent() {
attemptSentMessages.incrementAndGet();
totalSentMessages.incrementAndGet();
}

@Override
public void batchRequestSent(long elementCount, long requestSize) {
span.setAllAttributes(
Attributes.of(BATCH_COUNT_KEY, elementCount, BATCH_SIZE_KEY, requestSize));
}

private Attributes baseOperationAttributes() {
AttributesBuilder builder = Attributes.builder();
builder.put(ATTEMPT_COUNT_KEY, currentAttemptId + 1);
long localTotalSentMessages = totalSentMessages.get();
if (localTotalSentMessages > 0) {
builder.put(TOTAL_REQUEST_COUNT_KEY, localTotalSentMessages);
}
if (totalReceivedMessages > 0) {
builder.put(TOTAL_RESPONSE_COUNT_KEY, totalReceivedMessages);
}
return builder.build();
}

private Attributes baseAttemptAttributes() {
return baseAttemptAttributesBuilder().build();
}

private AttributesBuilder baseAttemptAttributesBuilder() {
AttributesBuilder builder = Attributes.builder();
populateAttemptNumber(builder);

long localAttemptSentMessages = attemptSentMessages.get();
if (localAttemptSentMessages > 0) {
builder.put(ATTEMPT_REQUEST_COUNT_KEY, localAttemptSentMessages);
}
if (attemptReceivedMessages > 0) {
builder.put(ATTEMPT_RESPONSE_COUNT_KEY, attemptReceivedMessages);
}
String localLastConnectionId = lastConnectionId;
if (localLastConnectionId != null) {
builder.put(CONNECTION_ID_KEY, localLastConnectionId);
}

return builder;
}

private void populateAttemptNumber(AttributesBuilder builder) {
builder.put(ATTEMPT_NUMBER_KEY, currentAttemptId);
}
}
Loading
Loading