Skip to content

runs replication leader lock expiration fix #2050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 14, 2025
Merged

Conversation

ericallam
Copy link
Member

@ericallam ericallam commented May 14, 2025

Summary by CodeRabbit

  • New Features

    • Added an integration test to verify seamless leadership handover and continued data replication between service instances.
  • Bug Fixes

    • Improved reliability of leader lock acquisition by switching to a time-based retry mechanism with enhanced logging.
  • Refactor

    • Replaced leader lock retry configuration from a retry count to an additional wait time (in milliseconds) for lock acquisition.
    • Updated environment variable and configuration options to reflect this change.
    • Enhanced logging and tracing for transaction handling and batch flushing.
    • Extended client configuration to support connection keep-alive and customizable logging levels.
    • Made Node.js memory limit configurable via environment variable.

Copy link

changeset-bot bot commented May 14, 2025

⚠️ No Changeset found

Latest commit: 55ddfee

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Contributor

coderabbitai bot commented May 14, 2025

Walkthrough

The changes replace the leader lock acquisition retry mechanism from a fixed retry count to a time-based approach. Configuration keys, environment variables, and option interfaces are updated accordingly. The lock acquisition logic now retries until a maximum additional time elapses, with enhanced logging. A new integration test verifies leadership handover and lock extension. Additional improvements include enhanced logging, tracing spans, and HTTP connection keep-alive configuration for the ClickHouse client.

Changes

File(s) Change Summary
apps/webapp/app/env.server.ts Replaced environment variable RUN_REPLICATION_LEADER_LOCK_RETRY_COUNT (integer, default 240) with RUN_REPLICATION_LEADER_LOCK_ADDITIONAL_TIME_MS (integer, default 10,000); added RUN_REPLICATION_KEEP_ALIVE_ENABLED (string, default "1") and RUN_REPLICATION_KEEP_ALIVE_IDLE_SOCKET_TTL_MS (integer, default 9,000) in the environment schema.
apps/webapp/app/services/runsReplicationInstance.server.ts Updated ClickHouse client initialization to include keepAlive config and logLevel from environment; replaced leaderLockRetryCount with leaderLockAcquireAdditionalTimeMs in constructor options.
apps/webapp/app/services/runsReplicationService.server.ts Replaced leaderLockRetryCount with leaderLockAcquireAdditionalTimeMs in options and constructor; improved logger fallback; replaced full transaction debug log with tracing span and concise debug logs; enhanced batch flush instrumentation and logging with timing and error renaming.
apps/webapp/test/runsReplicationService.test.ts Added a new containerized integration test verifying leadership handover and leader lock extension between two RunsReplicationService instances, ensuring replication continuity and data correctness after handover.
internal-packages/replication/src/client.ts Refactored leader lock acquisition logic from fixed retry count with Redlock internal retry to a manual time-based retry loop using leaderLockAcquireAdditionalTimeMs; updated options interface and class properties; enhanced logging for lock acquisition attempts, success, failure, release, and extension.
docker/scripts/entrypoint.sh Modified Node.js startup to use dynamic max old space size via NODE_MAX_OLD_SPACE_SIZE environment variable with default fallback to 8192 MB; logs the configured value before starting the server.
internal-packages/clickhouse/src/client/client.ts Added optional keepAlive, httpAgent, and logLevel configuration options to ClickhouseConfig; updated ClickhouseClient constructor to use these options and pass them to the internal client creation; logging verbosity control added.
internal-packages/clickhouse/src/index.ts Introduced new shared ClickhouseCommonConfig type consolidating common client options including keepAlive, httpAgent, and logLevel; refactored ClickHouseConfig to extend this common config; updated ClickHouse class constructor to pass new options to internal ClickhouseClient instances.

Sequence Diagram(s)

sequenceDiagram
    participant ServiceA as RunsReplicationService A
    participant ServiceB as RunsReplicationService B
    participant Redis as Redis (Redlock)
    participant ClickHouse as ClickHouse
    participant Postgres as Postgres

    ServiceA->>Redis: Attempt to acquire leader lock (with timeout + additional time)
    Redis-->>ServiceA: Lock acquired
    ServiceA->>Postgres: Read new TaskRun
    ServiceA->>ClickHouse: Replicate TaskRun data

    Note over ServiceA: ServiceA running as leader

    ServiceB->>Redis: Attempt to acquire leader lock (while ServiceA is leader)
    Redis-->>ServiceB: Lock denied (retries until ServiceA stops)

    ServiceA-->>Redis: Release leader lock (ServiceA stops)
    ServiceB->>Redis: Acquire leader lock (succeeds after retries)
    Redis-->>ServiceB: Lock acquired

    ServiceB->>Postgres: Read new TaskRun
    ServiceB->>ClickHouse: Replicate TaskRun data

    Note over ServiceB: ServiceB now running as leader
Loading

Possibly related PRs

  • triggerdotdev/trigger.dev#2042: Modifies the same leader lock acquisition logic by introducing a time-based retry loop and updating related configuration keys and options.

Poem

In the warren of code, a new lock we devise,
No longer by count, but by time we apprise.
The leader may hand off, with retries anew,
Ensuring each bunny knows just what to do.
With logs that now sparkle and tests that assure,
Our replication hops on, robust and secure! 🐇⏱️

Tip

⚡️ Faster reviews with caching
  • CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.

Enjoy the performance boost—your workflow just got faster.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32271ce and 55ddfee.

📒 Files selected for processing (6)
  • apps/webapp/app/env.server.ts (1 hunks)
  • apps/webapp/app/services/runsReplicationInstance.server.ts (2 hunks)
  • apps/webapp/app/services/runsReplicationService.server.ts (6 hunks)
  • docker/scripts/entrypoint.sh (1 hunks)
  • internal-packages/clickhouse/src/client/client.ts (2 hunks)
  • internal-packages/clickhouse/src/index.ts (3 hunks)
✅ Files skipped from review due to trivial changes (1)
  • docker/scripts/entrypoint.sh
🚧 Files skipped from review as they are similar to previous changes (2)
  • apps/webapp/app/services/runsReplicationInstance.server.ts
  • apps/webapp/app/env.server.ts
🧰 Additional context used
🧬 Code Graph Analysis (1)
internal-packages/clickhouse/src/index.ts (1)
internal-packages/clickhouse/src/client/client.ts (1)
  • ClickhouseClient (36-277)
⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: units / 🧪 Unit Tests
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (20)
internal-packages/clickhouse/src/client/client.ts (5)

18-20: Good addition of standardized imports for HTTP agents and logging.

These imports align with best practices for NodeJS HTTP agents and the logging system within the codebase.


26-31: Well-structured HTTP connection configuration options.

The addition of the keepAlive and httpAgent options provides important performance improvements for long-running services by allowing HTTP connection reuse. This is a best practice for services that make frequent requests to the same ClickHouse server.


33-33: Good addition of log level configurability.

Adding the logLevel option allows for more granular control of logging verbosity per client instance.


44-44: Improved logger initialization with configurable log level.

Now correctly uses the provided logLevel with fallback to "info", improving observability configuration.


48-49: Properly passing connection configuration to underlying client.

The connection configuration options are now correctly passed to the ClickHouse client, enabling HTTP keep-alive and custom agent support.

internal-packages/clickhouse/src/index.ts (5)

6-8: Good consistency with client imports.

These imports mirror those in client.ts, maintaining consistency throughout the codebase.


12-21: Excellent extraction of common config into a shared type.

The creation of ClickhouseCommonConfig reduces duplication and improves maintainability by centralizing common configuration options. This follows the DRY principle and makes future changes easier.


23-37: Good refactoring of union type with composition.

Leveraging composition with the common config type is a clean way to handle the two different connection strategies without duplicating shared configuration options.


59-61: Correctly propagating configuration to single client.

All new configuration options are properly passed to the client constructor in the single-URL case.


73-75: Consistent configuration for both reader and writer clients.

The same configuration options are consistently passed to both reader and writer clients when split mode is used, ensuring uniform behavior.

Also applies to: 82-84

apps/webapp/app/services/runsReplicationService.server.ts (10)

46-46: Good shift to time-based leader lock acquisition.

Replacing the retry count with a timeout in milliseconds provides more predictable behavior, especially in environments with variable network latency.


105-105: Improved logger propagation to replication client.

Now correctly passes the logger or creates a new one with the appropriate log level, improving observability consistency.


109-109: Properly configured time-based lock acquisition.

Setting a default of 10 seconds for additional lock acquisition time ensures the system has a reasonable window to retry lock acquisition before giving up.


348-362: Enhanced transaction observability with tracing.

Adding a dedicated tracing span for transaction handling with detailed attributes significantly improves observability and debugging capabilities.


363-372: More consistent and structured transaction logging.

The transaction log now includes only relevant details in a structured format, making logs more readable and useful without overwhelming with too much data.


398-398: Better naming convention for log messages.

Renaming the log message to use underscores consistently with other log messages improves log parsability and searchability.


758-787: Improved flush batch functionality with timing and return value.

Now correctly returns the span result and adds performance timing metrics, which is essential for monitoring batch processing efficiency.


767-773: Enhanced batch flush logging for better observability.

Added concurrency metrics to the flush logs provides vital information for understanding system performance and identifying bottlenecks.


775-782: Good performance measurement for batch processing.

Using performance.now() to measure precise durations is a best practice for performance monitoring.


790-807: Improved error handling with structured logging.

The error handling now includes more detailed logs with consistent naming conventions, making it easier to track and diagnose issues.

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@ericallam ericallam merged commit eb39298 into main May 14, 2025
12 checks passed
@ericallam ericallam deleted the runs-replication-locking branch May 14, 2025 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants