Support for multiple resource types and controller-runtime refactor #43

joelanford · 2018-10-03T18:54:32Z

This PR is meant to fix #41 and #42.

It is based on new scaffolding produced by running the following command from a build of operator-sdk's refactor/controller-runtime branch:

$ operator-sdk new helm-app-operator

It also adds support for watching multiple resource types and managing releases from multiple helm charts using a new YAML file for configuration. Support for the existing environment variables is maintained if the configuration file is not defined/present.

Much of the design inspiration is taken from github.com/water-hole/ansible-operator

Other improvements were also made:

Automatic support for in-cluster and out-of-cluster mode.
Minor improvements to input validation. For example, checking to ensure helm chart directories are defined

hasbro17 · 2018-10-05T05:56:18Z

helm-app-operator/pkg/helm/controller/controller.go

+	}
+
+	// Emit events once per minute regardless of detected changes.
+	r := newReconcileLoop(time.Minute*1, options.GVK, mgr.GetClient())


I'm wondering if we really need this (both here and in the ansible-operator).
@shawn-hurley Can you comment on why we can't just set the resync option for the manager(which sets it for the cache)?
https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/manager/manager.go#L91-L95

mgr, err := manager.New(cfg, manager.Options{Namespace: namespace, SyncPeriod: time.Minute*1})

Discussed offline.
For our use case the manager's resync is almost equivalent to the reconcile loop.
kubernetes-sigs/controller-runtime#88 (comment)
So we can use either.

However the recently added RequeueAfter is probably the best way to periodically re-queue and re-trigger the reconcile for each CR.

Agreed. I'll make the change and verify expected behavior.

hasbro17 · 2018-10-05T06:12:28Z

helm-app-operator/pkg/helm/controller/reconcile.go

+	_, ok := s.(map[string]interface{})
+	if !ok {
+		u.Object["spec"] = map[string]interface{}{}
+		r.Client.Update(context.TODO(), u)


Check return error

hasbro17 · 2018-10-05T06:12:48Z

helm-app-operator/pkg/helm/controller/reconcile.go

+	}
+
+	// Update the CR with the updated status.
+	r.Client.Update(context.TODO(), u)


Check return error.

hasbro17 · 2018-10-05T06:26:50Z

helm-app-operator/pkg/helm/controller/reconcile.go

+		if apierrors.IsNotFound(err) {
+			u, err = r.Installer.UninstallRelease(u)
+			if err != nil {
+				log.Print(err)


This isn't too important for this PR but how about using logrus or something else for more structured logging(info,error,warning,debug).
We're still haven't decided on the default logger that the SDK should use since that impacts all users. operator-framework/operator-sdk#503
But we can pick one for the helm-operator.

It probably makes the most sense to go with the outcome of operator-framework/operator-sdk#503, since there's discussion and voting there about the preferences and technical merits of various loggers (and just for the sake of consistency). I'll open a new issue to follow up once that decision is made.

New issue #44 to track logging updates.

hasbro17 · 2018-10-05T06:44:02Z

helm-app-operator/pkg/helm/installer/installer.go

+	m := map[schema.GroupVersionKind]Installer{}
+
+	// Determine the watches file to use
+	watchesFile, ok := os.LookupEnv(HelmChartWatchesEnvVar)


The HelmChartWatchesEnvVar could be set to empty, HELM_CHART_WATCHES ="".
In which case ok=true and watchesFile="". We'll end up trying to load the watches file in that case right?

We would, in which case there would be an error returned by NewFromWatches. I figured that if the environment variable is set (regardless of its value), we should use it since the user explicitly set it.

I could go either way on this though. We could also fall back to the default file for the empty string case.

Added a commit make the logic in this section of code more obvious.

…ovements

hasbro17 · 2018-10-05T18:52:12Z

It seems periodic reconciles will result in increasing release numbers for no changes to the CR.
helm/helm#3177
But we do need to periodically re-trigger reconciles to account for changes to unwatched(created) resources.

joelanford · 2018-10-05T18:58:45Z

Agreed. I'll remove a check that skips upgrades for unchanged releases to allow unwatched resources to be reconciled. I'll investigate the ramifications of and possible workarounds for increasing helm release version numbers.

joelanford · 2018-10-05T22:22:19Z

I found one pretty significant issue with the constantly incrementing release version numbers. When we complete a release, we update the CR status with the full contents of the release, which changes every time we reconcile. Since the CR status is updated, it causes an immediate re-reconcile with the new status, which then triggers another update with an incremented release number, and another reconciliation and so on, forever. It looks like our operator gets quickly rate-limited, but this is clearly something that needs to be resolved.

A release update can be executed as a dry-run, which will produce the new complete manifest without making any changes to the tiller release cache or to the cluster state. As long as release manifests are created deterministically, comparing manifests may be the best way to determine whether a chart release needs to be made vs. just reconciling the existing resources associated with the chart.

So the question is how to reconcile arbitrary resources. Does controller-runtime or client-go have any helpers to do the equivalent of kubectl apply -f manifest.yaml?

hasbro17 · 2018-10-08T06:01:52Z

One way to ignore an event resulting from a CR status update is via the status subresource
https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#status-subresource

When the status subresource is enabled for a CRD then any change to cr.spec will increment the metadata.generation field. While changes to cr.status will not update the generation field.

And we can filter out Update events that don't have the generation changed via a predicate in our controller:

pred := predicateForGenerationChange()
err = c.Watch(&source.Kind{Type: helmApp}, &handler.EnqueueRequestForObject{}) , pred)

...

func predicateForGenerationChange() predicate.Predicate {
	return predicate.Funcs{
		CreateFunc: func(e event.CreateEvent) bool {
			return true
		},
		UpdateFunc: func(e event.UpdateEvent) bool {
			// extract and compare metadata.generation between old and new object
			return generationNotEqual(e.objectOld, e.objectNew)
		},
		DeleteFunc: func(e event.DeleteEvent) bool {
			return true
		},
	}
}

The only problem with the above is that the subresource status isn't enabled by default for k8s 1.10. It's alpha and feature-gated. It's beta in k8s 1.11 and 1.12 and should be enabled by default.

For 1.10 we'll need to figure out another way besides the generation.

And for kubectl -f apply what kind of helper did you mean?
Did you mean doing something like a CreateOrUpdate() on a manifest.yaml?

* Replacing custom reconciliation loop with controller-runtime's reconcile.Result.RequeueAfter * Fixing issue with infinite incrementing of helm relase version numbers * Adding resource reconcilation even when helm release is unchanged * Improving logging

joelanford · 2018-10-08T13:05:21Z

I think this latest commit solves the problem without the need for the status subresource, though that's not something I knew about, so thanks for the tip.

For kubectl apply-like functionality, I went with a create or patch style reconciliation. I had first tried create or replace logic, but that caused various failures. For example ClusterIP is an immutable service field, so replacing the spec failed.

I think this raises a question about whether we should support helm's force option, and if so, how that would be conveyed in a CR (perhaps an annotation?). See helm/helm#1958

…nciliation

joelanford · 2018-10-08T20:15:18Z

Added another commit which fixes a few more bugs:

Release retries
If a helm release update fails, tiller stores the failed release in its storage backend with details about the failure. During the next reconciliation, we pull the latest release from the storage backend, not the latest deployed release.

Most likely, the latest release is the one that just failed and now we're retrying it, so the latest release and the candidate release are identical. In this case, we reconcile the underlying resources when we should be retrying the failed update.

Fix: use the latest deployed release, not the latest release.

Failing uninstalls
Now that we're using reconcile.Result.RequeueAfter, the reconciliation loop is executed once even after the watched object has been deleted. The uninstall logic is unable to locate the release associated with the now-deleted and already-uninstalled release, resulting in repeated reconciliation failures.

Fix: ignore missing release errors during uninstalls.

* new functions for resource reconciliation * moved New* functions to separate file

joelanford · 2018-10-09T15:09:04Z

@hasbro17 - This should be ready for another look.

The PR is getting pretty big and has now expanded beyond the original scope. I'm wondering if it should be split into multiple PRs?

There's probably still a bit more to do as well:

Updates to docs and examples
Repo layout updates, to more closely align with ansible-operator's layout

@alecmerdler - Are there any community members that should be aware of these proposed changes?

alecmerdler · 2018-10-09T15:23:32Z

@joelanford Nope, feel free to make whatever changes necessary.

hasbro17 · 2018-10-09T17:22:14Z

@joelanford If you could break this PR up into 2 that would be great.

Perhaps a more logical split would be one for changes to the installer and another one for the new controller-runtime refactoring. But I'll let you decide how best to split it up.
You can keep this one open until then.

And yeah updating the docs and the project layout can be done in followup PRs after that.

joelanford · 2018-10-16T17:59:44Z

Closing since this was split up among multiple PRs that have been merged (#45, #46, #47, #48, #51)

initial commit of controller-runtime refactor

7710201

openshift-ci-robot added the size/XXL label Oct 3, 2018

joelanford added 3 commits October 3, 2018 15:07

fixed go build commands in Dockerfiles

9c7eecd

build/build.sh: fixed REPO_PATH

55f9d69

reconcile.go: removed unused const operatorName

acb86f9

joelanford mentioned this pull request Oct 4, 2018

Update helm-app-operator-kit to use controller-runtime #42

Closed

hasbro17 reviewed Oct 5, 2018

View reviewed changes

pkg/helm/{controller,installer}: error return checks and logging impr…

a4c090e

…ovements

joelanford mentioned this pull request Oct 5, 2018

Ensure logging is consistent with operator-sdk #44

Closed

pkg/helm/installer: refactor WatchesFromEnv for clarity

c939be0

pkg/helm/installer: retry failed updates; ignore stale uninstall reco…

23d1376

…nciliation

joelanford added 2 commits October 8, 2018 21:25

pkg/helm/internal/api: fixing StatusFor

2a316c0

pkg/helm/installer: resource reconciliation clarity improvements

24a1888

* new functions for resource reconciliation * moved New* functions to separate file

joelanford mentioned this pull request Oct 11, 2018

Support multiple resources #48

Merged

joelanford closed this Oct 16, 2018

joelanford deleted the refactor/controller-runtime branch October 19, 2018 18:06

Support for multiple resource types and controller-runtime refactor #43

Support for multiple resource types and controller-runtime refactor #43

Uh oh!

Conversation

joelanford commented Oct 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hasbro17 commented Oct 5, 2018

Uh oh!

joelanford commented Oct 5, 2018

Uh oh!

joelanford commented Oct 5, 2018

Uh oh!

hasbro17 commented Oct 8, 2018

Uh oh!

joelanford commented Oct 8, 2018

Uh oh!

joelanford commented Oct 8, 2018

Uh oh!

joelanford commented Oct 9, 2018

Uh oh!

alecmerdler commented Oct 9, 2018

Uh oh!

hasbro17 commented Oct 9, 2018

Uh oh!

joelanford commented Oct 16, 2018

Uh oh!

Uh oh!