feat: Add healthcheck command in nerdctl #4302

subashkotha · 2025-05-29T17:35:09Z

This PR adds support for Docker-compatible health checks in nerdctl run, nerdctl create, and introduces a new nerdctl container healthcheck command to manually trigger health checks.

Key features:

Health check configuration via CLI flags (--health-cmd, --health-interval, --health-timeout, etc.)
Support for health checks defined in Docker images (HEALTHCHECK in the Dockerfile)
Merging logic for health checks:

CLI flags override image-defined health checks
Fallback to image-defined health checks if CLI flags not set
No health check configured by default

nerdctl container healthcheck command to manually run health checks
Health check configuration is stored as an internal label nerdctl/healthcheck
Health state (status and failing streak) is stored in internal label nerdctl/healthstate for quick access
Health check results stored in a health.json file in the container’s runtime state directory
Tests added for CLI flag parsing, merging behavior, execution of health checks, and health state inspection in nerdctl inspect

Future work (WIP):

Automate periodic health checks using systemd.
Generate systemd timer and service units based on container lifecycle events.
Improve logic to store/fetch health check results.

Related issue - #4157

cc: @Shubhranshu153 @AkihiroSuda

apostasie · 2025-05-30T20:43:20Z

pkg/healthcheck/log.go

+
+	path := filepath.Join(stateDir, HealthLogFilename)
+
+	return os.WriteFile(path, data, 0o600)


Mutex just protects you from concurrent access internally - but you might still have concurrent access across distinct binary invocations. Also, this is not an atomic write, so, it might be left incomplete.

You may either use filesystem lock and atomic write (see internal/filesystem), or leverage store.Store that provides higher level (safe) storage primitives.

The same comment applies to your reads down there, which are not locking.

There is a state.Store already that could be used / expanded to manage that log file as well.

Ack, I'll switch to using internal/filesystem with locking to ensure atomic and concurrency-safe read/write operations.

AkihiroSuda · 2025-05-31T06:29:43Z

CI is failing on Windows

cmd/nerdctl/container/container_health_check_test.go

Shubhranshu153 · 2025-05-31T13:37:02Z

pkg/cmd/container/create.go

+			options.HealthStartPeriod != 0 ||
+			options.HealthStartInterval != 0
+
+	if options.NoHealthcheck {


Disable any container-specified HEALTHCHECK

This should return nil error as we are just disabling if health check are there and just ignoring if health check are not there.

so the top level of this call should return with health check NONE

We still need to persist NONE as part of the health check config, so that when the user inspects, they see NONE in the health check configuration (similar to Docker). This way, the user can confirm that they explicitly disabled health checks using CLI flags, even if the underlying image has health checks configured.

pkg/cmd/container/create.go

Shubhranshu153 · 2025-05-31T13:53:55Z

pkg/cmd/container/create.go

+	// Start with health checks in image if present
+	hc := &healthcheck.Healthcheck{}
+	if ensuredImage != nil && ensuredImage.ImageConfig.Labels != nil {
+		if label := ensuredImage.ImageConfig.Labels[labels.HealthCheck]; label != "" {


This logic doesnt seem right. the health check is embedded in the image by buildkit as such we need to retrieve it from HealthcheckConfig. i dont think buildkit writes it to health check.

Looks like the ImageConfig in the ensuredImage is read from getImageConfig which unmarshals the config blob into ocispec.ImageConfig type. In order to retrieve the Healthcheck config, can we not useDockerOCIImageConfig instead?

The ImageConfig in the ensured image is of type ocispec.ImageConfig, which doesn’t include healthcheck field. While we could unmarshal the raw config using DockerOCIImageConfig to extract healthcheck data, that structure is not used to construct the ensured image. Instead, we extract the healthcheck config separately using *healthcheck.Healthcheck and embed it in the image labels. So even though ImageConfig itself doesn’t carry the healthcheck field, the relevant info is still preserved and accessible through labels.

Is there a reason to preserve the healthcheck in the image labels. From what I understand we only need to check the healthcheck info once during container create/run, can we not do it by just implementing and calling a new getImageConfigWithHealthCheck. Seems simpler this way.

Refactoring to introduce getImageConfigWithHealthCheck() is turning out to be a bit tricky. The main issue is that EnsuredImage doesn't retain enough context specifically, it doesn't store ImageConfigDesc, which is needed to locate and unmarshal the raw image config blob.

Since EnsureImage is the only thing available during both create/run and inspect (thats where we need healthcheck config), we'd need to thread additional data through or change its structure, which adds complexity.

Shubhranshu153 · 2025-05-31T13:54:42Z

pkg/cmd/container/create.go

+			options.HealthStartPeriod != 0 ||
+			options.HealthStartInterval != 0
+
+	if options.NoHealthcheck {


so the top level of this call should return with health check NONE

pkg/cmd/container/create.go

pkg/cmd/container/health_check.go

cmd/nerdctl/container/container_health_check.go

pkg/cmd/container/create.go

pkg/healthcheck/health.go

Shubhranshu153 · 2025-05-31T14:53:00Z

pkg/healthcheck/executor.go

+// updateHealthStatus updates the health status based on the health check result
+func updateHealthStatus(ctx context.Context, container containerd.Container, hcConfig *Healthcheck, hcResult *HealthcheckResult) error {
+	// Get current health status from health log
+	currentHealth, err := readHealthLog(ctx, container)


reading logs to get current health seems not so robust. i think we should atleast check the possibility of having a db.
@AkihiroSuda thoughts? or having a db for it is over engineering the problem?

How could the DB improve the robustness?

Maybe ocihook should run healthcheck commands periodically, and serve the health status via a Unix socket?

Then there does not need to be a static health log file nor a DB.

db would support atomicity of the writes.
in the case of oci hook is it stored in runc process memory? Not much familiar with it,

ocihook is a process executed on events such as "createRuntime" and "postStop":

nerdctl/pkg/ocihook/ocihook.go

Lines 123 to 130 in b8c4b3d

switch event {

case "createRuntime":

return onCreateRuntime(opts)

case "postStop":

return onPostStop(opts)

default:

return fmt.Errorf("unexpected event %q", event)

}

https://github.com/opencontainers/runtime-spec/blob/main/runtime.md#lifecycle

ocihook is currently used for CNI, logging, etc.
Periodic health checker could be added here, and it could serve gRPC (or REST) over a Unix socket to provide the latest health status.

db would support atomicity of the writes.

There is an effort to consolidate atomic, concurrency-safe filesystem operations in internal/filesystem. If needed, these can be used instead of filesystem access in almost all cases.

Generally IMHO we should avoid using the filesystem as an API and instead rely on a higher-level abstraction (eg: store.Store for eg). Leaking implementation details into the consumer (filesystem or db) is bad API, and instead having a storage API will allow swapping out the fs implementation to something else, if need-be.

Thanks for the feedback. I'll update the PR to store current health state and failing streak in the container labels, so we no longer need to read the log file during each health check to get current state.

Health check results are still written to a log file, and during inspect we fetch the last five entries to provide recent history. For health.json operations, we’ll use internal/filesystem along with proper locking as recommended to ensure concurrency safety.

Shubhranshu153 · 2025-05-31T16:33:07Z

pkg/inspecttypes/dockercompat/dockercompat.go

@@ -629,6 +648,15 @@ func ImageFromNative(nativeImage *native.Image) (*Image, error) {
 		ExposedPorts: portSet,
 	}

+	// Add health check if present in labels
+	if healthStr, ok := imgOCI.Config.Labels[labels.HealthCheck]; ok && healthStr != "" {


this part seems confusing to me i would expect the imgOCI.Config.Healthcheck to have it, if we parse the json of image/config.json

OCI image spec doesn’t include a healthcheck field, that’s why we add the healthcheck info to labels when ensuring the image and during docker-compat inspect we add it to image config.

Not sure if I got your question right

pkg/imgutil/imgutil.go

docs/healthchecks.md

docs/command-reference.md

Signed-off-by: Subash Kotha <[email protected]>

Shubhranshu153 · 2025-06-10T17:51:56Z

LGTM. Lets fix the windows test.

pkg/cmd/container/health_check.go

swagatbora90 · 2025-06-10T20:46:39Z

pkg/healthcheck/log.go

+
+// writeHealthLog writes the latest health check result to the log file, appending it to existing logs.
+func writeHealthLog(ctx context.Context, container containerd.Container, result *HealthcheckResult) error {
+	stateDir, err := getContainerStateDir(ctx, container)


Instead of the writing the log file to the container state directory, should we create a health specific store? That way we avoid unnecessary locking the state dir, preventing other threads to write to it.

swagatbora90 · 2025-06-10T20:48:18Z

pkg/healthcheck/log.go

@@ -0,0 +1,261 @@
+/*


Nit: calling it log.go is not very accurate. May be we can split the log specific methods from the containerd metadata update methods into their own files.

subashkotha force-pushed the health_checks branch 2 times, most recently from a0bfd2c to b03c1f6 Compare May 29, 2025 17:52

subashkotha changed the title ~~feat: Add healthcheck cmd, healthcheck related flags to create/run and include health config and status in inspect output~~ feat: Add healthcheck support in nerdctl May 29, 2025

subashkotha force-pushed the health_checks branch 4 times, most recently from bebb32a to cced4f6 Compare May 29, 2025 21:31

AkihiroSuda added this to the v2.1.3 milestone May 30, 2025

apostasie reviewed May 30, 2025

View reviewed changes

AkihiroSuda reviewed May 31, 2025

View reviewed changes

cmd/nerdctl/container/container_health_check_test.go Outdated Show resolved Hide resolved

Shubhranshu153 reviewed May 31, 2025

View reviewed changes

pkg/cmd/container/create.go Outdated Show resolved Hide resolved

Shubhranshu153 reviewed May 31, 2025

View reviewed changes

pkg/healthcheck/health.go Outdated Show resolved Hide resolved

Shubhranshu153 reviewed May 31, 2025

View reviewed changes

pkg/imgutil/imgutil.go Show resolved Hide resolved

swagatbora90 reviewed Jun 2, 2025

View reviewed changes

docs/healthchecks.md Show resolved Hide resolved

docs/command-reference.md Outdated Show resolved Hide resolved

subashkotha force-pushed the health_checks branch 2 times, most recently from 6b91b90 to 5fb121a Compare June 6, 2025 22:29

subashkotha changed the title ~~feat: Add healthcheck support in nerdctl~~ feat: Add healthcheck command in nerdctl Jun 6, 2025

subashkotha force-pushed the health_checks branch from 5fb121a to bae0a57 Compare June 9, 2025 17:32

feat: Add healthcheck command in nerdctl

245c93e

Signed-off-by: Subash Kotha <[email protected]>

subashkotha force-pushed the health_checks branch from 335753b to 245c93e Compare June 9, 2025 17:41

swagatbora90 reviewed Jun 10, 2025

View reviewed changes

pkg/cmd/container/health_check.go Show resolved Hide resolved

swagatbora90 reviewed Jun 10, 2025

View reviewed changes


		path := filepath.Join(stateDir, HealthLogFilename)

		return os.WriteFile(path, data, 0o600)

	switch event {
	case "createRuntime":
	return onCreateRuntime(opts)
	case "postStop":
	return onPostStop(opts)
	default:
	return fmt.Errorf("unexpected event %q", event)
	}

feat: Add healthcheck command in nerdctl #4302

Are you sure you want to change the base?

feat: Add healthcheck command in nerdctl #4302

Conversation

subashkotha commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apostasie May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AkihiroSuda commented May 31, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

subashkotha Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Shubhranshu153 commented Jun 10, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

subashkotha commented May 29, 2025 •

edited

Loading

apostasie May 30, 2025 •

edited

Loading

subashkotha Jun 2, 2025 •

edited

Loading