Skip to content
This repository was archived by the owner on Jan 25, 2019. It is now read-only.

Support for multiple resource types and controller-runtime refactor #43

Closed
wants to merge 10 commits into from
Closed

Support for multiple resource types and controller-runtime refactor #43

wants to merge 10 commits into from

Conversation

joelanford
Copy link
Member

This PR is meant to fix #41 and #42.

It is based on new scaffolding produced by running the following command from a build of operator-sdk's refactor/controller-runtime branch:

$ operator-sdk new helm-app-operator

It also adds support for watching multiple resource types and managing releases from multiple helm charts using a new YAML file for configuration. Support for the existing environment variables is maintained if the configuration file is not defined/present.

Much of the design inspiration is taken from github.com/water-hole/ansible-operator

Other improvements were also made:

  • Automatic support for in-cluster and out-of-cluster mode.
  • Minor improvements to input validation. For example, checking to ensure helm chart directories are defined

}

// Emit events once per minute regardless of detected changes.
r := newReconcileLoop(time.Minute*1, options.GVK, mgr.GetClient())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we really need this (both here and in the ansible-operator).
@shawn-hurley Can you comment on why we can't just set the resync option for the manager(which sets it for the cache)?
https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/manager/manager.go#L91-L95

mgr, err := manager.New(cfg, manager.Options{Namespace: namespace, SyncPeriod: time.Minute*1})

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline.
For our use case the manager's resync is almost equivalent to the reconcile loop.
kubernetes-sigs/controller-runtime#88 (comment)
So we can use either.

However the recently added RequeueAfter is probably the best way to periodically re-queue and re-trigger the reconcile for each CR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'll make the change and verify expected behavior.

_, ok := s.(map[string]interface{})
if !ok {
u.Object["spec"] = map[string]interface{}{}
r.Client.Update(context.TODO(), u)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check return error

}

// Update the CR with the updated status.
r.Client.Update(context.TODO(), u)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check return error.

if apierrors.IsNotFound(err) {
u, err = r.Installer.UninstallRelease(u)
if err != nil {
log.Print(err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't too important for this PR but how about using logrus or something else for more structured logging(info,error,warning,debug).
We're still haven't decided on the default logger that the SDK should use since that impacts all users. operator-framework/operator-sdk#503
But we can pick one for the helm-operator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably makes the most sense to go with the outcome of operator-framework/operator-sdk#503, since there's discussion and voting there about the preferences and technical merits of various loggers (and just for the sake of consistency). I'll open a new issue to follow up once that decision is made.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New issue #44 to track logging updates.

m := map[schema.GroupVersionKind]Installer{}

// Determine the watches file to use
watchesFile, ok := os.LookupEnv(HelmChartWatchesEnvVar)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HelmChartWatchesEnvVar could be set to empty, HELM_CHART_WATCHES ="".
In which case ok=true and watchesFile="". We'll end up trying to load the watches file in that case right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would, in which case there would be an error returned by NewFromWatches. I figured that if the environment variable is set (regardless of its value), we should use it since the user explicitly set it.

I could go either way on this though. We could also fall back to the default file for the empty string case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a commit make the logic in this section of code more obvious.

@hasbro17
Copy link
Collaborator

hasbro17 commented Oct 5, 2018

It seems periodic reconciles will result in increasing release numbers for no changes to the CR.
helm/helm#3177
But we do need to periodically re-trigger reconciles to account for changes to unwatched(created) resources.

@joelanford
Copy link
Member Author

Agreed. I'll remove a check that skips upgrades for unchanged releases to allow unwatched resources to be reconciled. I'll investigate the ramifications of and possible workarounds for increasing helm release version numbers.

@joelanford
Copy link
Member Author

I found one pretty significant issue with the constantly incrementing release version numbers. When we complete a release, we update the CR status with the full contents of the release, which changes every time we reconcile. Since the CR status is updated, it causes an immediate re-reconcile with the new status, which then triggers another update with an incremented release number, and another reconciliation and so on, forever. It looks like our operator gets quickly rate-limited, but this is clearly something that needs to be resolved.

A release update can be executed as a dry-run, which will produce the new complete manifest without making any changes to the tiller release cache or to the cluster state. As long as release manifests are created deterministically, comparing manifests may be the best way to determine whether a chart release needs to be made vs. just reconciling the existing resources associated with the chart.

So the question is how to reconcile arbitrary resources. Does controller-runtime or client-go have any helpers to do the equivalent of kubectl apply -f manifest.yaml?

@hasbro17
Copy link
Collaborator

hasbro17 commented Oct 8, 2018

One way to ignore an event resulting from a CR status update is via the status subresource
https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#status-subresource

When the status subresource is enabled for a CRD then any change to cr.spec will increment the metadata.generation field. While changes to cr.status will not update the generation field.

And we can filter out Update events that don't have the generation changed via a predicate in our controller:

pred := predicateForGenerationChange()
err = c.Watch(&source.Kind{Type: helmApp}, &handler.EnqueueRequestForObject{}) , pred)

...

func predicateForGenerationChange() predicate.Predicate {
	return predicate.Funcs{
		CreateFunc: func(e event.CreateEvent) bool {
			return true
		},
		UpdateFunc: func(e event.UpdateEvent) bool {
			// extract and compare metadata.generation between old and new object
			return generationNotEqual(e.objectOld, e.objectNew)
		},
		DeleteFunc: func(e event.DeleteEvent) bool {
			return true
		},
	}
}

The only problem with the above is that the subresource status isn't enabled by default for k8s 1.10. It's alpha and feature-gated. It's beta in k8s 1.11 and 1.12 and should be enabled by default.

For 1.10 we'll need to figure out another way besides the generation.

And for kubectl -f apply what kind of helper did you mean?
Did you mean doing something like a CreateOrUpdate() on a manifest.yaml?

* Replacing custom reconciliation loop with controller-runtime's reconcile.Result.RequeueAfter
* Fixing issue with infinite incrementing of helm relase version numbers
* Adding resource reconcilation even when helm release is unchanged
* Improving logging
@joelanford
Copy link
Member Author

I think this latest commit solves the problem without the need for the status subresource, though that's not something I knew about, so thanks for the tip.

For kubectl apply-like functionality, I went with a create or patch style reconciliation. I had first tried create or replace logic, but that caused various failures. For example ClusterIP is an immutable service field, so replacing the spec failed.

I think this raises a question about whether we should support helm's force option, and if so, how that would be conveyed in a CR (perhaps an annotation?). See helm/helm#1958

@joelanford
Copy link
Member Author

Added another commit which fixes a few more bugs:

Release retries
If a helm release update fails, tiller stores the failed release in its storage backend with details about the failure. During the next reconciliation, we pull the latest release from the storage backend, not the latest deployed release.

Most likely, the latest release is the one that just failed and now we're retrying it, so the latest release and the candidate release are identical. In this case, we reconcile the underlying resources when we should be retrying the failed update.

Fix: use the latest deployed release, not the latest release.

Failing uninstalls
Now that we're using reconcile.Result.RequeueAfter, the reconciliation loop is executed once even after the watched object has been deleted. The uninstall logic is unable to locate the release associated with the now-deleted and already-uninstalled release, resulting in repeated reconciliation failures.

Fix: ignore missing release errors during uninstalls.

* new functions for resource reconciliation
* moved New* functions to separate file
@joelanford
Copy link
Member Author

@hasbro17 - This should be ready for another look.

The PR is getting pretty big and has now expanded beyond the original scope. I'm wondering if it should be split into multiple PRs?

There's probably still a bit more to do as well:

  • Updates to docs and examples
  • Repo layout updates, to more closely align with ansible-operator's layout

@alecmerdler - Are there any community members that should be aware of these proposed changes?

@alecmerdler
Copy link
Member

@joelanford Nope, feel free to make whatever changes necessary.

@hasbro17
Copy link
Collaborator

hasbro17 commented Oct 9, 2018

@joelanford If you could break this PR up into 2 that would be great.

Perhaps a more logical split would be one for changes to the installer and another one for the new controller-runtime refactoring. But I'll let you decide how best to split it up.
You can keep this one open until then.

And yeah updating the docs and the project layout can be done in followup PRs after that.

@joelanford
Copy link
Member Author

Closing since this was split up among multiple PRs that have been merged (#45, #46, #47, #48, #51)

@joelanford joelanford closed this Oct 16, 2018
@joelanford joelanford deleted the refactor/controller-runtime branch October 19, 2018 18:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read config file to support multiple charts simultaneously
4 participants