-
Notifications
You must be signed in to change notification settings - Fork 132
feat: add option to enable ApiTracer #3095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
49e6ac3
62a5492
990850b
6958173
319e889
ce205a5
90752a1
ab5e2dd
644e3d3
e03f21b
3392074
1510a12
bfa1aac
aa2d4da
fc7c7f0
2318515
dc9d284
c40f434
b6e9018
2081efd
2e2798c
3b51a06
2ed8d2a
1107a04
715f209
52dfd06
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
@@ -0,0 +1,274 @@ | ||||
/* | ||||
* Copyright 2024 Google LLC | ||||
* | ||||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||||
* you may not use this file except in compliance with the License. | ||||
* You may obtain a copy of the License at | ||||
* | ||||
* https://www.apache.org/licenses/LICENSE-2.0 | ||||
* | ||||
* Unless required by applicable law or agreed to in writing, software | ||||
* distributed under the License is distributed on an "AS IS" BASIS, | ||||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||
* See the License for the specific language governing permissions and | ||||
* limitations under the License. | ||||
*/ | ||||
|
||||
package com.google.cloud.spanner; | ||||
|
||||
import com.google.api.gax.tracing.ApiTracer; | ||||
import com.google.api.gax.tracing.ApiTracerFactory.OperationType; | ||||
import com.google.common.base.Preconditions; | ||||
import io.opentelemetry.api.common.AttributeKey; | ||||
import io.opentelemetry.api.common.Attributes; | ||||
import io.opentelemetry.api.common.AttributesBuilder; | ||||
import io.opentelemetry.api.trace.Span; | ||||
import io.opentelemetry.api.trace.StatusCode; | ||||
import java.util.concurrent.atomic.AtomicLong; | ||||
import javax.annotation.Nonnull; | ||||
import javax.annotation.Nullable; | ||||
import org.threeten.bp.Duration; | ||||
|
||||
/** | ||||
* {@link com.google.api.gax.tracing.ApiTracer} for use with OpenTelemetry. Based on {@link | ||||
* com.google.api.gax.tracing.OpencensusTracer}. | ||||
*/ | ||||
class OpenTelemetryApiTracer implements ApiTracer { | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We decided to separate the logic to calculate the metrics and the logic to record the metrics while implementing metrics in Gax. Hence MetricsTracer and MetricsRecorder are generic classes without any knowledge of OpenTelemetry, only OpenTelemetryMetricsRecorder is dependent on OpenTelemetry apis. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Spanner client supports tracing with both OpenCensus and OpenTelemetry. The customer makes a choice at startup, and then the client library configures itself accordingly. We want to support the same for the |
||||
/** The attribute keys that are used by this tracer might change in a future release. */ | ||||
private final AttributeKey<Long> ATTEMPT_COUNT_KEY = AttributeKey.longKey("attempt.count"); | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it possible to document that these attribute keys are subject to change? If we are going to implement them in Gax, they may have different naming conventions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've added a comment to the class, but I don't think many people will actually read it. I'm also not sure it will be a very big problem for customers if they change in the future. I also added a comment regarding this to the README file. The keys are by the way taken from the existing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Thanks, this sounds good to me! |
||||
|
||||
private final AttributeKey<Long> TOTAL_REQUEST_COUNT_KEY = | ||||
AttributeKey.longKey("total_request_count"); | ||||
private final AttributeKey<Long> TOTAL_RESPONSE_COUNT_KEY = | ||||
AttributeKey.longKey("total_response_count"); | ||||
private final AttributeKey<String> EXCEPTION_MESSAGE_KEY = | ||||
AttributeKey.stringKey("exception.message"); | ||||
private final AttributeKey<Long> ATTEMPT_NUMBER_KEY = AttributeKey.longKey("attempt.number"); | ||||
private final AttributeKey<Long> ATTEMPT_REQUEST_COUNT_KEY = | ||||
AttributeKey.longKey("attempt.request_count"); | ||||
private final AttributeKey<Long> ATTEMPT_RESPONSE_COUNT_KEY = | ||||
AttributeKey.longKey("attempt.response_count"); | ||||
private final AttributeKey<String> CONNECTION_ID_KEY = AttributeKey.stringKey("connection"); | ||||
private final AttributeKey<Long> RETRY_DELAY_KEY = AttributeKey.longKey("delay_ms"); | ||||
private static final AttributeKey<Long> BATCH_SIZE_KEY = AttributeKey.longKey("batch.size"); | ||||
private static final AttributeKey<Long> BATCH_COUNT_KEY = AttributeKey.longKey("batch.count"); | ||||
|
||||
private final Span span; | ||||
private final OperationType operationType; | ||||
|
||||
private volatile String lastConnectionId; | ||||
private volatile long currentAttemptId; | ||||
private final AtomicLong attemptSentMessages = new AtomicLong(0); | ||||
private long attemptReceivedMessages = 0; | ||||
private final AtomicLong totalSentMessages = new AtomicLong(0); | ||||
private long totalReceivedMessages = 0; | ||||
|
||||
OpenTelemetryApiTracer(@Nonnull Span span, @Nonnull OperationType operationType) { | ||||
this.span = Preconditions.checkNotNull(span); | ||||
this.operationType = Preconditions.checkNotNull(operationType); | ||||
} | ||||
|
||||
Span getSpan() { | ||||
return this.span; | ||||
} | ||||
|
||||
@Override | ||||
public Scope inScope() { | ||||
final io.opentelemetry.context.Scope openTelemetryScope = span.makeCurrent(); | ||||
return openTelemetryScope::close; | ||||
} | ||||
|
||||
@Override | ||||
public void operationSucceeded() { | ||||
span.setAllAttributes(baseOperationAttributes()); | ||||
span.setStatus(StatusCode.OK); | ||||
span.end(); | ||||
} | ||||
|
||||
@Override | ||||
public void operationCancelled() { | ||||
span.setAllAttributes(baseOperationAttributes()); | ||||
span.setStatus(StatusCode.ERROR, "Cancelled by caller"); | ||||
span.end(); | ||||
} | ||||
|
||||
@Override | ||||
public void operationFailed(Throwable error) { | ||||
span.setAllAttributes(baseOperationAttributes()); | ||||
span.setStatus(StatusCode.ERROR, error.getMessage()); | ||||
span.end(); | ||||
} | ||||
|
||||
@Override | ||||
public void lroStartFailed(Throwable error) { | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some of these methods are not currently used by anyone. For example, neither Bigtable's MetricsTracer or Gax's MetricsTracer overrides There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's called from here: https://github.com/googleapis/sdk-platform-java/blob/5799827a864be627bac03969cc178efc9d6170aa/gax-java/gax/src/main/java/com/google/api/gax/tracing/TracedOperationInitialCallable.java#L85 This test case covers the kind of failure that would trigger this method to be called: Line 346 in bfa1aac
I agree that it is relatively unlikely to happen in production, as it requires an RPC that would normally return an LRO to fail in a way that does not return an LRO, but instead just returns an error. But it is needed to implement the interface, and in theory it could happen (I could for example imagine an RPC like this returning a RESOURCE_EXHAUSTED error directly, instead of an LRO, if there is a limit on the number of operations running in parallel). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
All methods in What I'm trying to get to is, maybe we can start with traces/events that we know will be most useful, we don't have to implement each and every scenario. Especially for LROs and batches related methods, they may not be well tested before and I'm not confident enough that they are working as intended. That being said, if you do think they provide value to Spanner, we can start with it and fix bugs along the way when we find them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would prefer to keep this, because:
|
||||
span.addEvent( | ||||
"Operation failed to start", Attributes.of(EXCEPTION_MESSAGE_KEY, error.getMessage())); | ||||
span.setStatus(StatusCode.ERROR, error.getMessage()); | ||||
span.end(); | ||||
} | ||||
|
||||
@Override | ||||
public void lroStartSucceeded() { | ||||
span.addEvent("Operation started"); | ||||
} | ||||
|
||||
@Override | ||||
public void connectionSelected(String id) { | ||||
lastConnectionId = id; | ||||
} | ||||
|
||||
@Override | ||||
public void attemptStarted(int attemptNumber) { | ||||
attemptStarted(null, attemptNumber); | ||||
} | ||||
|
||||
@Override | ||||
public void attemptStarted(@Nullable Object request, int attemptNumber) { | ||||
currentAttemptId = attemptNumber; | ||||
attemptSentMessages.set(0); | ||||
attemptReceivedMessages = 0; | ||||
|
||||
// Attempts start counting a zero, so more than zero indicates a retry. | ||||
if (attemptNumber > 0 && operationType != OperationType.LongRunning) { | ||||
// Add an event if the RPC retries, as this is otherwise transparent to the user. Retries | ||||
// would then show up as higher latency without any logical explanation. | ||||
span.addEvent("Starting RPC retry " + attemptNumber); | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Have we considered representing retries as child spans? It would definitely make things complicated but maybe easier for customers to understand. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had not considered it before implementing this. But also after having thought about it for a while, I don't think we should do that. The reason is that:
|
||||
} else if (operationType == OperationType.LongRunning) { | ||||
span.addEvent("Starting poll attempt " + attemptNumber); | ||||
} | ||||
} | ||||
|
||||
@Override | ||||
public void attemptSucceeded() { | ||||
Attributes attributes = baseAttemptAttributes(); | ||||
|
||||
// Same infrastructure is used for both polling and retries, so need to disambiguate it here. | ||||
if (operationType == OperationType.LongRunning) { | ||||
span.addEvent("Polling completed", attributes); | ||||
} else { | ||||
span.addEvent("Attempt succeeded", attributes); | ||||
} | ||||
} | ||||
|
||||
@Override | ||||
public void attemptCancelled() { | ||||
Attributes attributes = baseAttemptAttributes(); | ||||
|
||||
// Same infrastructure is used for both polling and retries, so need to disambiguate it here. | ||||
if (operationType == OperationType.LongRunning) { | ||||
span.addEvent("Polling was cancelled", attributes); | ||||
} else { | ||||
span.addEvent("Attempt cancelled", attributes); | ||||
} | ||||
lastConnectionId = null; | ||||
} | ||||
|
||||
@Override | ||||
public void attemptFailed(Throwable error, Duration delay) { | ||||
AttributesBuilder builder = baseAttemptAttributesBuilder(); | ||||
if (delay != null) { | ||||
builder.put(RETRY_DELAY_KEY, delay.toMillis()); | ||||
} | ||||
if (error != null) { | ||||
builder.put(EXCEPTION_MESSAGE_KEY, error.getMessage()); | ||||
} | ||||
Attributes attributes = builder.build(); | ||||
|
||||
// Same infrastructure is used for both polling and retries, so need to disambiguate it here. | ||||
if (operationType == OperationType.LongRunning) { | ||||
// The poll RPC was successful, but it indicated that the operation is still running. | ||||
span.addEvent("Scheduling next poll", attributes); | ||||
} else { | ||||
span.addEvent("Attempt failed, scheduling next attempt", attributes); | ||||
} | ||||
lastConnectionId = null; | ||||
} | ||||
|
||||
@Override | ||||
public void attemptFailedRetriesExhausted(@Nonnull Throwable error) { | ||||
AttributesBuilder builder = baseAttemptAttributesBuilder(); | ||||
builder.put(EXCEPTION_MESSAGE_KEY, error.getMessage()); | ||||
Attributes attributes = builder.build(); | ||||
|
||||
// Same infrastructure is used for both polling and retries, so need to disambiguate it here. | ||||
if (operationType == OperationType.LongRunning) { | ||||
span.addEvent("Polling attempts exhausted", attributes); | ||||
} else { | ||||
span.addEvent("Attempts exhausted", attributes); | ||||
} | ||||
lastConnectionId = null; | ||||
} | ||||
|
||||
@Override | ||||
public void attemptPermanentFailure(@Nonnull Throwable error) { | ||||
AttributesBuilder builder = baseAttemptAttributesBuilder(); | ||||
builder.put(EXCEPTION_MESSAGE_KEY, error.getMessage()); | ||||
Attributes attributes = builder.build(); | ||||
|
||||
// Same infrastructure is used for both polling and retries, so need to disambiguate it here. | ||||
if (operationType == OperationType.LongRunning) { | ||||
span.addEvent("Polling failed", attributes); | ||||
} else { | ||||
span.addEvent("Attempt failed, error not retryable", attributes); | ||||
} | ||||
lastConnectionId = null; | ||||
} | ||||
|
||||
@Override | ||||
public void responseReceived() { | ||||
attemptReceivedMessages++; | ||||
totalReceivedMessages++; | ||||
} | ||||
|
||||
@Override | ||||
public void requestSent() { | ||||
attemptSentMessages.incrementAndGet(); | ||||
totalSentMessages.incrementAndGet(); | ||||
} | ||||
|
||||
@Override | ||||
public void batchRequestSent(long elementCount, long requestSize) { | ||||
span.setAllAttributes( | ||||
Attributes.of(BATCH_COUNT_KEY, elementCount, BATCH_SIZE_KEY, requestSize)); | ||||
} | ||||
|
||||
private Attributes baseOperationAttributes() { | ||||
AttributesBuilder builder = Attributes.builder(); | ||||
builder.put(ATTEMPT_COUNT_KEY, currentAttemptId + 1); | ||||
long localTotalSentMessages = totalSentMessages.get(); | ||||
if (localTotalSentMessages > 0) { | ||||
builder.put(TOTAL_REQUEST_COUNT_KEY, localTotalSentMessages); | ||||
} | ||||
if (totalReceivedMessages > 0) { | ||||
builder.put(TOTAL_RESPONSE_COUNT_KEY, totalReceivedMessages); | ||||
} | ||||
return builder.build(); | ||||
} | ||||
|
||||
private Attributes baseAttemptAttributes() { | ||||
return baseAttemptAttributesBuilder().build(); | ||||
} | ||||
|
||||
private AttributesBuilder baseAttemptAttributesBuilder() { | ||||
AttributesBuilder builder = Attributes.builder(); | ||||
populateAttemptNumber(builder); | ||||
|
||||
long localAttemptSentMessages = attemptSentMessages.get(); | ||||
if (localAttemptSentMessages > 0) { | ||||
builder.put(ATTEMPT_REQUEST_COUNT_KEY, localAttemptSentMessages); | ||||
} | ||||
if (attemptReceivedMessages > 0) { | ||||
builder.put(ATTEMPT_RESPONSE_COUNT_KEY, attemptReceivedMessages); | ||||
} | ||||
String localLastConnectionId = lastConnectionId; | ||||
if (localLastConnectionId != null) { | ||||
builder.put(CONNECTION_ID_KEY, localLastConnectionId); | ||||
} | ||||
|
||||
return builder; | ||||
} | ||||
|
||||
private void populateAttemptNumber(AttributesBuilder builder) { | ||||
builder.put(ATTEMPT_NUMBER_KEY, currentAttemptId); | ||||
} | ||||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we consider this a beta feature or a stable feature? If we consider this a beta feature, I would add
@BetaApi
tosetEnableApiTracing
and mention it here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any specific reason for why we would not consider this a stable feature, other than the comments already mentioned regarding the possiblity of changing attribute keys. In many ways, this is just an extension of something that was already supported (
ApiTracer
with OpenCensus) using the newer tracing framework, that is also considered a stable feature.