Bug 1849164: don't store full manifests in installplan status (for bundle images) #1589

ecordell · 2020-06-23T00:07:21Z

Description of the change:
If a manifest comes from a bundle image, instead of storing the manifest itself in the installplan status, we now store a reference to the configmap that contains the manifest. When installing, OLM will look up the manifest and use the correct one.

One additional change was necessary - role/rolebinding name generation is now deterministic. This is because some resources for an installplan are generated by OLM, rather than read directly from the unpacked bundle, and we need identities for them so that we can track the the status of an individual step.

Steps that are references will look like this:

    - resolving: ibm-cloud-databases-redis.v1.0.0
      resource:
        group: rbac.authorization.k8s.io
        kind: ClusterRoleBinding
        manifest: >-
         {"kind":"ConfigMap","name":"cbd102757181a13f9cee81c42815586023b4840d95543e67c1d68f1572e4667","namespace":"openshift-marketplace","catalogSourceName":"aspera","catalogSourceNamespace":"openshift-marketplace","replaces":""}
        name: ibm-cloud-databases-redis.v1.0.079b98b8bd4-i7bdf84b456
        sourceName: aspera
        sourceNamespace: openshift-marketplace
        version: v1
      status: Created

where the configmap references the unpacked bundle for that step.

Motivation for the change:
It is possible to hit size limits in the installplan, if resolution results in lots of large manifests. Bundle images are already unpacked into configmaps on-cluster, so persisting the unpacked manifests in the installplan is not necessary.

Reviewer Checklist

Implementation matches the proposed design, or proposal is updated to match implementation
Sufficient unit test coverage
Sufficient end-to-end test coverage
Docs updated or added to /docs
Commit messages sensible and descriptive

openshift-ci-robot · 2020-06-23T14:46:59Z

@ecordell: This pull request references Bugzilla bug 1849164, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1849164: don't store full manifests in installplan status (for bundle images)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ecordell · 2020-06-23T18:03:21Z

/retest

exdx

This is a good quick and dirty solution to the issue at hand -- I was unaware that we already unpack all objects in the bundle into separate configmaps but if so this makes the most sense.

Let's see how the tests perform but I'm a little concerned about the potential for transient read errors (failed to find configmap) impacting the actual step resolution. We are basically adding a roundtrip to the api-server for each object in the bundle.

Longer term, we are adding more and more objects to the bundle. Eventually we want to support all arbitrary kube objects to the bundle. This approach may still work in that case but kind of curious how it will perform.

exdx · 2020-06-23T17:53:09Z

pkg/controller/operators/catalog/operator.go

@@ -1418,9 +1453,79 @@ func (o *Operator) ExecutePlan(plan *v1alpha1.InstallPlan) error {
 	ensurer := newStepEnsurer(kubeclient, crclient, dynamicClient)
 	b := newBuilder(kubeclient, dynamicClient, o.csvProvidedAPIsIndexer, o.logger)

+	refForStep := func(step *v1alpha1.Step) *UnpackedBundleReference {


I'm wondering if it's a bit more clear to hang the refForStep and manifestForStep functions off the builder struct, that way they are logically grouped with the object that executes the steps. From what I recall we wanted to eventually move the step logic into there. It also helps improve the readability of ExecutePlan.

I chose not to do that originally because the builder does not exist prior to 4.5 and I knew we intend to backport this.

But this is a good point - instead of adding to the builder, I'll make a new interface / method for resolving manifests and handling the step cache. In 4.5, the builder can depend on that interface, and in < 4.4 it can be used directly in the catalog operator.

exdx · 2020-06-23T18:01:29Z

pkg/controller/operators/catalog/operator.go

 	for i, step := range plan.Status.Plan {
+		// TODO: should lazy load only when the manifest is actually needed
+		manifest, err := manifestForStep(step)


See above -- I think linking this to step creation via the builder helps with the lazy loading.

exdx · 2020-06-23T18:02:34Z

pkg/controller/operators/catalog/operator.go

 	for i, step := range plan.Status.Plan {
+		// TODO: should lazy load only when the manifest is actually needed
+		manifest, err := manifestForStep(step)


nit: WDYT about having a type Manifest string and using that in the code? helps improve readability, and we can hang methods off the type

Since the type on the API is string, we'd have to handle manifests that are already unpacked differently from manifests that are in configmaps - instead I'll move all manifest handling into its own file, which should make it easier to keep track of what's what?

exdx · 2020-06-23T18:05:19Z

vendor/github.com/operator-framework/api/pkg/operators/v1alpha1/installplan_types.go

@@ -297,7 +297,7 @@ func (b *BundleLookup) SetCondition(cond BundleLookupCondition) BundleLookupCond
 		if existing.Type != cond.Type {
 			continue
 		}
-		if existing.Status == cond.Status {
+		if existing.Status == cond.Status && existing.Reason == cond.Reason {


Should we open up a PR in api for this?

Good call - I will do that.

It would be nice to separate our utility methods from the data type in the future. What about defining decorator types in OLM that embed the data type instead?

exdx · 2020-06-23T18:34:30Z

I will lgtm since this is an urgent bugfix
/lgtm

ecordell · 2020-06-23T19:17:20Z

/hold

I'd like to test with the set of operators we discovered this with before merging

ecordell · 2020-06-23T19:30:50Z

Let's see how the tests perform but I'm a little concerned about the potential for transient read errors (failed to find configmap) impacting the actual step resolution.

Resyncs should cover those cases, but I agree that this needs more stress testing to get numbers.

We are basically adding a roundtrip to the api-server for each object in the bundle.

The current implementation doesn't do any extra roundtripping - it reads the configmap data from the cache. It also only goes to the cache once per bundle (not once per object in the bundle).

I wanted to do a followup that lazy loads the data and only checks if the step needs some work done (so it would skip installed steps / bundles for a sync). The only reason I didn't do that here is because I would've needed to change the stepper interface a lot more than I did (the stepper would need access to the same cache/listers), and I was wary of backporting that.

exdx · 2020-06-23T20:18:19Z

The current implementation doesn't do any extra roundtripping - it reads the configmap data from the cache. It also only goes to the cache once per bundle (not once per object in the bundle).

That makes sense - forgot we have our informer cache to work with

I wanted to do a followup that lazy loads the data and only checks if the step needs some work done (so it would skip installed steps / bundles for a sync). The only reason I didn't do that here is because I would've needed to change the stepper interface a lot more than I did (the stepper would need access to the same cache/listers), and I was wary of backporting that.

That makes sense

openshift-ci-robot · 2020-06-23T20:26:47Z

@ecordell: This pull request references Bugzilla bug 1849164, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1849164: don't store full manifests in installplan status (for bundle images)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-06-23T20:27:14Z

@ecordell: This pull request references Bugzilla bug 1849164, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1849164: don't store full manifests in installplan status (for bundle images)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

njhale

This looks like a solid, highly backwards compatible fix. I expect we'll revisit some of the abstractions late (maybe when we implement/design the v2 APIs).

~~I do, however, have a single wrench:~~

~~IIRC, bundles are unpacked -- in their entirety -- into a single ConfigMap, which should run into the resource storage limit before the point that this fix will help.~~

^ this is true in the case of a single big bundle, but the patch introduced in this PR still definitely helps in the case where the number of dependencies included in an InstallPlan is the primary contributor to its size.

njhale · 2020-06-23T22:20:43Z

pkg/controller/operators/catalog/operator.go

+	manifestForStep := func(step *v1alpha1.Step) (string, error) {
+		manifest := step.Resource.Manifest
+		ref := refForStep(step)
+		if ref != nil {


nit: return early if ref == nil to drop the indentation below

njhale · 2020-06-23T22:42:24Z

pkg/controller/registry/resolver/rbac.go

@@ -94,7 +113,7 @@ func RBACForClusterServiceVersion(csv *v1alpha1.ClusterServiceVersion) (map[stri
 		// Create RoleBinding
 		roleBinding := &rbacv1.RoleBinding{
 			ObjectMeta: metav1.ObjectMeta{
-				Name:            generateName(fmt.Sprintf("%s-%s", role.GetName(), permission.ServiceAccountName)),
+				Name:            generateName(fmt.Sprintf("%s-%s", role.GetName(), permission.ServiceAccountName), HashObject(fmt.Sprintf("%s-%s", role.GetName(), permission.ServiceAccountName))),


nit: can this constitute an info leak?

nit: can the Role and RoleBindings share a name since we're generating them 1:1?

njhale · 2020-06-23T22:49:49Z

vendor/github.com/operator-framework/api/pkg/operators/v1alpha1/installplan_types.go

@@ -297,7 +297,7 @@ func (b *BundleLookup) SetCondition(cond BundleLookupCondition) BundleLookupCond
 		if existing.Type != cond.Type {
 			continue
 		}
-		if existing.Status == cond.Status {
+		if existing.Status == cond.Status && existing.Reason == cond.Reason {


It would be nice to separate our utility methods from the data type in the future. What about defining decorator types in OLM that embed the data type instead?

njhale · 2020-06-23T22:55:10Z

pkg/controller/operators/catalog/operator.go

 func (o *Operator) unpackBundles(plan *v1alpha1.InstallPlan) (bool, *v1alpha1.InstallPlan, error) {
 	out := plan.DeepCopy()
 	unpacked := true
+	unpackedCount := len(out.Status.BundleLookups)


Doesn't look like unpackedCount is being used to trigger any conditions.

whoops - a planned refactor that I abandoned. will remove.

ecordell · 2020-06-24T02:00:55Z

IIRC, bundles are unpacked -- in their entirety -- into a single ConfigMap, which should run into the resource storage limit before the point that this fix will help.

There is a limit for how big a bundle can be right now, which is the amount that we can fit into a single configmap. This PR removes (well, extends greatly) the limit of how many bundles can be unpacked and installed via a single installplan. This limit was hit with 19 operators in a cluster.

ecordell · 2020-06-24T02:21:07Z

@exdx @njhale Thanks for the thorough reviews - I've addressed all the feedback and I think it's ready to go

benluddy · 2020-06-24T03:53:00Z

pkg/controller/registry/resolver/rbac.go

+	return utilrand.SafeEncodeString(fmt.Sprint(hasher.Sum32()))
+}
+
+func generateName(base string, hash string) string {


Is the contract of this function to return the concatenation of base+hash, truncating base such that the result is short enough to be an object name? I'm worried about slice range out-of-bound panics and overly-long return values. Couldn't base have length 127 in the invocation on L116?

pushed tests for this - thanks for calling it out. see rbac_test.go

njhale · 2020-06-24T11:45:02Z

/retest

njhale · 2020-06-24T11:47:05Z

pkg/controller/operators/catalog/manifests.go

+	"encoding/json"
+	"fmt"
+
+	"github.com/operator-framework/api/pkg/operators/v1alpha1"


note: I've been asking myself whether we treat in-org dependencies as "external" for the purposes of import grouping. Thoughts?

the installplan status.plan field stores a list of all manifests that will be installed when the installplan is approved. if the installplan is resolving a large set of dependencies, it is possible for it to exceed various limits in kube (grpc request size limit, etcd value size limit, etc). this commit changes the way objects are tracked if they come from bundle images. in that case, the manifests are already unpacked onto the cluster into a configmap. the installplan now stores a reference to the source configmap for each step, rather than storing the manifest itself. this allows the installplan to scale to a large number of resolved objects.

njhale

Nice!

/lgtm

openshift-ci-robot · 2020-06-24T13:06:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ecordell, njhale

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ecordell,njhale]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ecordell · 2020-06-24T14:33:19Z

/cherry-pick release-4.5
/cherry-pick release-4.4

openshift-cherrypick-robot · 2020-06-24T14:33:20Z

@ecordell: once the present PR merges, I will cherry-pick it on top of release-4.5 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.5
/cherry-pick release-4.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ecordell · 2020-06-24T14:50:46Z

/hold cancel

openshift-ci-robot · 2020-06-24T14:54:45Z

@ecordell: All pull requests linked via external trackers have merged: operator-framework/operator-lifecycle-manager#1589. Bugzilla bug 1849164 has been moved to the MODIFIED state.

In response to this:

Bug 1849164: don't store full manifests in installplan status (for bundle images)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2020-06-24T14:54:54Z

@ecordell: new pull request created: #1594

In response to this:

/cherry-pick release-4.5
/cherry-pick release-4.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 23, 2020

openshift-ci-robot requested review from benluddy and njhale June 23, 2020 00:07

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 23, 2020

ecordell force-pushed the ip-status-is-status branch from 8cb0757 to 9816718 Compare June 23, 2020 13:58

ecordell changed the title ~~[WIP] don't copy bundle contents to ip status~~ Bug 1849164: don't store full manifests in installplan status (for bundle images) Jun 23, 2020

ecordell force-pushed the ip-status-is-status branch from 9816718 to 7ad841b Compare June 23, 2020 14:52

exdx reviewed Jun 23, 2020

View reviewed changes

openshift-ci-robot assigned exdx Jun 23, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 23, 2020

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 23, 2020

njhale reviewed Jun 23, 2020

View reviewed changes

ecordell force-pushed the ip-status-is-status branch from 7ad841b to 57cdfe1 Compare June 24, 2020 02:10

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 24, 2020

ecordell force-pushed the ip-status-is-status branch from 57cdfe1 to d0e858d Compare June 24, 2020 02:19

benluddy reviewed Jun 24, 2020

View reviewed changes

njhale reviewed Jun 24, 2020

View reviewed changes

ecordell force-pushed the ip-status-is-status branch from d0e858d to a7659db Compare June 24, 2020 12:43

njhale approved these changes Jun 24, 2020

View reviewed changes

openshift-ci-robot assigned njhale Jun 24, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 24, 2020

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 24, 2020

openshift-merge-robot merged commit a6f90c9 into operator-framework:master Jun 24, 2020

openshift-cherrypick-robot mentioned this pull request Jun 24, 2020

[release-4.5] Bug 1850619: don't store full manifests in installplan status (for bundle images) #1594

Merged

exdx mentioned this pull request Oct 7, 2020

Bug 1880473: remove extraneous manifests in installplan #1798

Merged

5 tasks

Bug 1849164: don't store full manifests in installplan status (for bundle images) #1589

Bug 1849164: don't store full manifests in installplan status (for bundle images) #1589

Uh oh!

Conversation

ecordell commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jun 23, 2020

Uh oh!

ecordell commented Jun 23, 2020

Uh oh!

exdx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

exdx commented Jun 23, 2020

Uh oh!

ecordell commented Jun 23, 2020

Uh oh!

ecordell commented Jun 23, 2020

Uh oh!

exdx commented Jun 23, 2020

Uh oh!

openshift-ci-robot commented Jun 23, 2020

Uh oh!

openshift-ci-robot commented Jun 23, 2020

Uh oh!

njhale left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ecordell commented Jun 24, 2020

Uh oh!

ecordell commented Jun 24, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhale commented Jun 24, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhale left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Jun 24, 2020

Uh oh!

ecordell commented Jun 24, 2020

Uh oh!

openshift-cherrypick-robot commented Jun 24, 2020

Uh oh!

ecordell commented Jun 24, 2020

ecordell commented Jun 23, 2020 •

edited

Loading

njhale left a comment •

edited

Loading