Skip to content

feat: Add healthcheck command in nerdctl #4302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

subashkotha
Copy link
Contributor

@subashkotha subashkotha commented May 29, 2025

This PR adds support for Docker-compatible health checks in nerdctl run, nerdctl create, and introduces a new nerdctl container healthcheck command to manually trigger health checks.

Key features:

  1. Health check configuration via CLI flags (--health-cmd, --health-interval, --health-timeout, etc.)
  2. Support for health checks defined in Docker images (HEALTHCHECK in the Dockerfile)
  3. Merging logic for health checks:
  • CLI flags override image-defined health checks
  • Fallback to image-defined health checks if CLI flags not set
  • No health check configured by default
  1. nerdctl container healthcheck command to manually run health checks
  2. Health check configuration is stored as an internal label nerdctl/healthcheck
  3. Health state (status and failing streak) is stored in internal label nerdctl/healthstate for quick access
  4. Health check results stored in a health.json file in the container’s runtime state directory
  5. Tests added for CLI flag parsing, merging behavior, execution of health checks, and health state inspection in nerdctl inspect

Future work (WIP):

  1. Automate periodic health checks using systemd.
  2. Generate systemd timer and service units based on container lifecycle events.
  3. Improve logic to store/fetch health check results.

Related issue - #4157

cc: @Shubhranshu153 @AkihiroSuda

@subashkotha subashkotha force-pushed the health_checks branch 2 times, most recently from a0bfd2c to b03c1f6 Compare May 29, 2025 17:52
@subashkotha subashkotha changed the title feat: Add healthcheck cmd, healthcheck related flags to create/run and include health config and status in inspect output feat: Add healthcheck support in nerdctl May 29, 2025
@subashkotha subashkotha force-pushed the health_checks branch 4 times, most recently from bebb32a to cced4f6 Compare May 29, 2025 21:31
@AkihiroSuda AkihiroSuda added this to the v2.1.3 milestone May 30, 2025

path := filepath.Join(stateDir, HealthLogFilename)

return os.WriteFile(path, data, 0o600)
Copy link
Contributor

@apostasie apostasie May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutex just protects you from concurrent access internally - but you might still have concurrent access across distinct binary invocations. Also, this is not an atomic write, so, it might be left incomplete.

You may either use filesystem lock and atomic write (see internal/filesystem), or leverage store.Store that provides higher level (safe) storage primitives.

The same comment applies to your reads down there, which are not locking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a state.Store already that could be used / expanded to manage that log file as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, I'll switch to using internal/filesystem with locking to ensure atomic and concurrency-safe read/write operations.

@AkihiroSuda
Copy link
Member

CI is failing on Windows

options.HealthStartPeriod != 0 ||
options.HealthStartInterval != 0

if options.NoHealthcheck {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disable any container-specified HEALTHCHECK

This should return nil error as we are just disabling if health check are there and just ignoring if health check are not there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the top level of this call should return with health check NONE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to persist NONE as part of the health check config, so that when the user inspects, they see NONE in the health check configuration (similar to Docker). This way, the user can confirm that they explicitly disabled health checks using CLI flags, even if the underlying image has health checks configured.

// Start with health checks in image if present
hc := &healthcheck.Healthcheck{}
if ensuredImage != nil && ensuredImage.ImageConfig.Labels != nil {
if label := ensuredImage.ImageConfig.Labels[labels.HealthCheck]; label != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic doesnt seem right. the health check is embedded in the image by buildkit as such we need to retrieve it from HealthcheckConfig. i dont think buildkit writes it to health check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the ImageConfig in the ensuredImage is read from getImageConfig which unmarshals the config blob into ocispec.ImageConfig type. In order to retrieve the Healthcheck config, can we not useDockerOCIImageConfig instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ImageConfig in the ensured image is of type ocispec.ImageConfig, which doesn’t include healthcheck field. While we could unmarshal the raw config using DockerOCIImageConfig to extract healthcheck data, that structure is not used to construct the ensured image. Instead, we extract the healthcheck config separately using *healthcheck.Healthcheck and embed it in the image labels. So even though ImageConfig itself doesn’t carry the healthcheck field, the relevant info is still preserved and accessible through labels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to preserve the healthcheck in the image labels. From what I understand we only need to check the healthcheck info once during container create/run, can we not do it by just implementing and calling a new getImageConfigWithHealthCheck. Seems simpler this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring to introduce getImageConfigWithHealthCheck() is turning out to be a bit tricky. The main issue is that EnsuredImage doesn't retain enough context specifically, it doesn't store ImageConfigDesc, which is needed to locate and unmarshal the raw image config blob.

Since EnsureImage is the only thing available during both create/run and inspect (thats where we need healthcheck config), we'd need to thread additional data through or change its structure, which adds complexity.

options.HealthStartPeriod != 0 ||
options.HealthStartInterval != 0

if options.NoHealthcheck {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the top level of this call should return with health check NONE

// updateHealthStatus updates the health status based on the health check result
func updateHealthStatus(ctx context.Context, container containerd.Container, hcConfig *Healthcheck, hcResult *HealthcheckResult) error {
// Get current health status from health log
currentHealth, err := readHealthLog(ctx, container)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading logs to get current health seems not so robust. i think we should atleast check the possibility of having a db.
@AkihiroSuda thoughts? or having a db for it is over engineering the problem?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could the DB improve the robustness?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ocihook should run healthcheck commands periodically, and serve the health status via a Unix socket?

Then there does not need to be a static health log file nor a DB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

db would support atomicity of the writes.
in the case of oci hook is it stored in runc process memory? Not much familiar with it,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ocihook is a process executed on events such as "createRuntime" and "postStop":

switch event {
case "createRuntime":
return onCreateRuntime(opts)
case "postStop":
return onPostStop(opts)
default:
return fmt.Errorf("unexpected event %q", event)
}

https://github.com/opencontainers/runtime-spec/blob/main/runtime.md#lifecycle

ocihook is currently used for CNI, logging, etc.
Periodic health checker could be added here, and it could serve gRPC (or REST) over a Unix socket to provide the latest health status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

db would support atomicity of the writes.

There is an effort to consolidate atomic, concurrency-safe filesystem operations in internal/filesystem. If needed, these can be used instead of filesystem access in almost all cases.

Generally IMHO we should avoid using the filesystem as an API and instead rely on a higher-level abstraction (eg: store.Store for eg). Leaking implementation details into the consumer (filesystem or db) is bad API, and instead having a storage API will allow swapping out the fs implementation to something else, if need-be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I'll update the PR to store current health state and failing streak in the container labels, so we no longer need to read the log file during each health check to get current state.

Health check results are still written to a log file, and during inspect we fetch the last five entries to provide recent history. For health.json operations, we’ll use internal/filesystem along with proper locking as recommended to ensure concurrency safety.

@@ -629,6 +648,15 @@ func ImageFromNative(nativeImage *native.Image) (*Image, error) {
ExposedPorts: portSet,
}

// Add health check if present in labels
if healthStr, ok := imgOCI.Config.Labels[labels.HealthCheck]; ok && healthStr != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part seems confusing to me i would expect the imgOCI.Config.Healthcheck to have it, if we parse the json of image/config.json

Copy link
Contributor Author

@subashkotha subashkotha Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OCI image spec doesn’t include a healthcheck field, that’s why we add the healthcheck info to labels when ensuring the image and during docker-compat inspect we add it to image config.

Not sure if I got your question right

@subashkotha subashkotha force-pushed the health_checks branch 2 times, most recently from 6b91b90 to 5fb121a Compare June 6, 2025 22:29
@subashkotha subashkotha changed the title feat: Add healthcheck support in nerdctl feat: Add healthcheck command in nerdctl Jun 6, 2025
@Shubhranshu153
Copy link
Contributor

LGTM. Lets fix the windows test.


// writeHealthLog writes the latest health check result to the log file, appending it to existing logs.
func writeHealthLog(ctx context.Context, container containerd.Container, result *HealthcheckResult) error {
stateDir, err := getContainerStateDir(ctx, container)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the writing the log file to the container state directory, should we create a health specific store? That way we avoid unnecessary locking the state dir, preventing other threads to write to it.

@@ -0,0 +1,261 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: calling it log.go is not very accurate. May be we can split the log specific methods from the containerd metadata update methods into their own files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants