-
Notifications
You must be signed in to change notification settings - Fork 657
feat: Add healthcheck command in nerdctl #4302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a0bfd2c
to
b03c1f6
Compare
bebb32a
to
cced4f6
Compare
pkg/healthcheck/log.go
Outdated
|
||
path := filepath.Join(stateDir, HealthLogFilename) | ||
|
||
return os.WriteFile(path, data, 0o600) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mutex just protects you from concurrent access internally - but you might still have concurrent access across distinct binary invocations. Also, this is not an atomic write, so, it might be left incomplete.
You may either use filesystem lock and atomic write (see internal/filesystem
), or leverage store.Store
that provides higher level (safe) storage primitives.
The same comment applies to your reads down there, which are not locking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a state.Store
already that could be used / expanded to manage that log file as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, I'll switch to using internal/filesystem with locking to ensure atomic and concurrency-safe read/write operations.
CI is failing on Windows |
pkg/cmd/container/create.go
Outdated
options.HealthStartPeriod != 0 || | ||
options.HealthStartInterval != 0 | ||
|
||
if options.NoHealthcheck { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disable any container-specified HEALTHCHECK
This should return nil error as we are just disabling if health check are there and just ignoring if health check are not there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the top level of this call should return with health check NONE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to persist NONE
as part of the health check config, so that when the user inspects, they see NONE
in the health check configuration (similar to Docker). This way, the user can confirm that they explicitly disabled health checks using CLI flags, even if the underlying image has health checks configured.
// Start with health checks in image if present | ||
hc := &healthcheck.Healthcheck{} | ||
if ensuredImage != nil && ensuredImage.ImageConfig.Labels != nil { | ||
if label := ensuredImage.ImageConfig.Labels[labels.HealthCheck]; label != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic doesnt seem right. the health check is embedded in the image by buildkit as such we need to retrieve it from HealthcheckConfig. i dont think buildkit writes it to health check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the ImageConfig in the ensuredImage is read from getImageConfig which unmarshals the config blob into ocispec.ImageConfig type. In order to retrieve the Healthcheck config, can we not useDockerOCIImageConfig instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ImageConfig in the ensured image is of type ocispec.ImageConfig
, which doesn’t include healthcheck field. While we could unmarshal the raw config using DockerOCIImageConfig
to extract healthcheck data, that structure is not used to construct the ensured image. Instead, we extract the healthcheck config separately using *healthcheck.Healthcheck
and embed it in the image labels. So even though ImageConfig itself doesn’t carry the healthcheck field, the relevant info is still preserved and accessible through labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to preserve the healthcheck in the image labels. From what I understand we only need to check the healthcheck info once during container create/run, can we not do it by just implementing and calling a new getImageConfigWithHealthCheck
. Seems simpler this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactoring to introduce getImageConfigWithHealthCheck()
is turning out to be a bit tricky. The main issue is that EnsuredImage doesn't retain enough context specifically, it doesn't store ImageConfigDesc
, which is needed to locate and unmarshal the raw image config blob.
Since EnsureImage is the only thing available during both create/run and inspect (thats where we need healthcheck config), we'd need to thread additional data through or change its structure, which adds complexity.
pkg/cmd/container/create.go
Outdated
options.HealthStartPeriod != 0 || | ||
options.HealthStartInterval != 0 | ||
|
||
if options.NoHealthcheck { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the top level of this call should return with health check NONE
pkg/healthcheck/executor.go
Outdated
// updateHealthStatus updates the health status based on the health check result | ||
func updateHealthStatus(ctx context.Context, container containerd.Container, hcConfig *Healthcheck, hcResult *HealthcheckResult) error { | ||
// Get current health status from health log | ||
currentHealth, err := readHealthLog(ctx, container) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reading logs to get current health seems not so robust. i think we should atleast check the possibility of having a db.
@AkihiroSuda thoughts? or having a db for it is over engineering the problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How could the DB improve the robustness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe ocihook
should run healthcheck commands periodically, and serve the health status via a Unix socket?
Then there does not need to be a static health log file nor a DB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
db would support atomicity of the writes.
in the case of oci hook is it stored in runc process memory? Not much familiar with it,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ocihook
is a process executed on events such as "createRuntime" and "postStop":
nerdctl/pkg/ocihook/ocihook.go
Lines 123 to 130 in b8c4b3d
switch event { | |
case "createRuntime": | |
return onCreateRuntime(opts) | |
case "postStop": | |
return onPostStop(opts) | |
default: | |
return fmt.Errorf("unexpected event %q", event) | |
} |
https://github.com/opencontainers/runtime-spec/blob/main/runtime.md#lifecycle
ocihook is currently used for CNI, logging, etc.
Periodic health checker could be added here, and it could serve gRPC (or REST) over a Unix socket to provide the latest health status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
db would support atomicity of the writes.
There is an effort to consolidate atomic, concurrency-safe filesystem operations in internal/filesystem
. If needed, these can be used instead of filesystem access in almost all cases.
Generally IMHO we should avoid using the filesystem as an API and instead rely on a higher-level abstraction (eg: store.Store
for eg). Leaking implementation details into the consumer (filesystem or db) is bad API, and instead having a storage API will allow swapping out the fs implementation to something else, if need-be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. I'll update the PR to store current health state and failing streak in the container labels, so we no longer need to read the log file during each health check to get current state.
Health check results are still written to a log file, and during inspect we fetch the last five entries to provide recent history. For health.json
operations, we’ll use internal/filesystem along with proper locking as recommended to ensure concurrency safety.
@@ -629,6 +648,15 @@ func ImageFromNative(nativeImage *native.Image) (*Image, error) { | |||
ExposedPorts: portSet, | |||
} | |||
|
|||
// Add health check if present in labels | |||
if healthStr, ok := imgOCI.Config.Labels[labels.HealthCheck]; ok && healthStr != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part seems confusing to me i would expect the imgOCI.Config.Healthcheck to have it, if we parse the json of image/config.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OCI image spec doesn’t include a healthcheck field, that’s why we add the healthcheck info to labels when ensuring the image and during docker-compat inspect we add it to image config.
Not sure if I got your question right
6b91b90
to
5fb121a
Compare
5fb121a
to
bae0a57
Compare
Signed-off-by: Subash Kotha <[email protected]>
335753b
to
245c93e
Compare
LGTM. Lets fix the windows test. |
|
||
// writeHealthLog writes the latest health check result to the log file, appending it to existing logs. | ||
func writeHealthLog(ctx context.Context, container containerd.Container, result *HealthcheckResult) error { | ||
stateDir, err := getContainerStateDir(ctx, container) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of the writing the log file to the container state directory, should we create a health specific store? That way we avoid unnecessary locking the state dir, preventing other threads to write to it.
@@ -0,0 +1,261 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: calling it log.go is not very accurate. May be we can split the log specific methods from the containerd metadata update methods into their own files.
This PR adds support for Docker-compatible health checks in
nerdctl run
,nerdctl create
, and introduces a newnerdctl container healthcheck
command to manually trigger health checks.Key features:
nerdctl container healthcheck
command to manually run health checksnerdctl/healthcheck
nerdctl/healthstate
for quick accesshealth.json
file in the container’s runtime state directoryFuture work (WIP):
Related issue - #4157
cc: @Shubhranshu153 @AkihiroSuda