Skip to content

Bug 1849164: don't store full manifests in installplan status (for bundle images) #1589

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ecordell
Copy link
Member

@ecordell ecordell commented Jun 23, 2020

Description of the change:
If a manifest comes from a bundle image, instead of storing the manifest itself in the installplan status, we now store a reference to the configmap that contains the manifest. When installing, OLM will look up the manifest and use the correct one.

One additional change was necessary - role/rolebinding name generation is now deterministic. This is because some resources for an installplan are generated by OLM, rather than read directly from the unpacked bundle, and we need identities for them so that we can track the the status of an individual step.

Steps that are references will look like this:

    - resolving: ibm-cloud-databases-redis.v1.0.0
      resource:
        group: rbac.authorization.k8s.io
        kind: ClusterRoleBinding
        manifest: >-
         {"kind":"ConfigMap","name":"cbd102757181a13f9cee81c42815586023b4840d95543e67c1d68f1572e4667","namespace":"openshift-marketplace","catalogSourceName":"aspera","catalogSourceNamespace":"openshift-marketplace","replaces":""}
        name: ibm-cloud-databases-redis.v1.0.079b98b8bd4-i7bdf84b456
        sourceName: aspera
        sourceNamespace: openshift-marketplace
        version: v1
      status: Created

where the configmap references the unpacked bundle for that step.

Motivation for the change:
It is possible to hit size limits in the installplan, if resolution results in lots of large manifests. Bundle images are already unpacked into configmaps on-cluster, so persisting the unpacked manifests in the installplan is not necessary.

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /docs
  • Commit messages sensible and descriptive

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 23, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 23, 2020
@ecordell ecordell force-pushed the ip-status-is-status branch from 8cb0757 to 9816718 Compare June 23, 2020 13:58
@ecordell ecordell changed the title [WIP] don't copy bundle contents to ip status Bug 1849164: don't store full manifests in installplan status (for bundle images) Jun 23, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jun 23, 2020
@openshift-ci-robot
Copy link
Collaborator

@ecordell: This pull request references Bugzilla bug 1849164, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1849164: don't store full manifests in installplan status (for bundle images)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ecordell ecordell force-pushed the ip-status-is-status branch from 9816718 to 7ad841b Compare June 23, 2020 14:52
@ecordell
Copy link
Member Author

/retest

Copy link
Member

@exdx exdx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good quick and dirty solution to the issue at hand -- I was unaware that we already unpack all objects in the bundle into separate configmaps but if so this makes the most sense.

Let's see how the tests perform but I'm a little concerned about the potential for transient read errors (failed to find configmap) impacting the actual step resolution. We are basically adding a roundtrip to the api-server for each object in the bundle.

Longer term, we are adding more and more objects to the bundle. Eventually we want to support all arbitrary kube objects to the bundle. This approach may still work in that case but kind of curious how it will perform.

@@ -1418,9 +1453,79 @@ func (o *Operator) ExecutePlan(plan *v1alpha1.InstallPlan) error {
ensurer := newStepEnsurer(kubeclient, crclient, dynamicClient)
b := newBuilder(kubeclient, dynamicClient, o.csvProvidedAPIsIndexer, o.logger)

refForStep := func(step *v1alpha1.Step) *UnpackedBundleReference {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it's a bit more clear to hang the refForStep and manifestForStep functions off the builder struct, that way they are logically grouped with the object that executes the steps. From what I recall we wanted to eventually move the step logic into there. It also helps improve the readability of ExecutePlan.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose not to do that originally because the builder does not exist prior to 4.5 and I knew we intend to backport this.

But this is a good point - instead of adding to the builder, I'll make a new interface / method for resolving manifests and handling the step cache. In 4.5, the builder can depend on that interface, and in < 4.4 it can be used directly in the catalog operator.

for i, step := range plan.Status.Plan {
// TODO: should lazy load only when the manifest is actually needed
manifest, err := manifestForStep(step)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above -- I think linking this to step creation via the builder helps with the lazy loading.

for i, step := range plan.Status.Plan {
// TODO: should lazy load only when the manifest is actually needed
manifest, err := manifestForStep(step)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: WDYT about having a type Manifest string and using that in the code? helps improve readability, and we can hang methods off the type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the type on the API is string, we'd have to handle manifests that are already unpacked differently from manifests that are in configmaps - instead I'll move all manifest handling into its own file, which should make it easier to keep track of what's what?

@@ -297,7 +297,7 @@ func (b *BundleLookup) SetCondition(cond BundleLookupCondition) BundleLookupCond
if existing.Type != cond.Type {
continue
}
if existing.Status == cond.Status {
if existing.Status == cond.Status && existing.Reason == cond.Reason {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we open up a PR in api for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call - I will do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to separate our utility methods from the data type in the future. What about defining decorator types in OLM that embed the data type instead?

@exdx
Copy link
Member

exdx commented Jun 23, 2020

I will lgtm since this is an urgent bugfix
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 23, 2020
@ecordell
Copy link
Member Author

/hold

I'd like to test with the set of operators we discovered this with before merging

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 23, 2020
@ecordell
Copy link
Member Author

Let's see how the tests perform but I'm a little concerned about the potential for transient read errors (failed to find configmap) impacting the actual step resolution.

Resyncs should cover those cases, but I agree that this needs more stress testing to get numbers.

We are basically adding a roundtrip to the api-server for each object in the bundle.

The current implementation doesn't do any extra roundtripping - it reads the configmap data from the cache. It also only goes to the cache once per bundle (not once per object in the bundle).

I wanted to do a followup that lazy loads the data and only checks if the step needs some work done (so it would skip installed steps / bundles for a sync). The only reason I didn't do that here is because I would've needed to change the stepper interface a lot more than I did (the stepper would need access to the same cache/listers), and I was wary of backporting that.

@exdx
Copy link
Member

exdx commented Jun 23, 2020

The current implementation doesn't do any extra roundtripping - it reads the configmap data from the cache. It also only goes to the cache once per bundle (not once per object in the bundle).

That makes sense - forgot we have our informer cache to work with

I wanted to do a followup that lazy loads the data and only checks if the step needs some work done (so it would skip installed steps / bundles for a sync). The only reason I didn't do that here is because I would've needed to change the stepper interface a lot more than I did (the stepper would need access to the same cache/listers), and I was wary of backporting that.

That makes sense

@openshift-ci-robot
Copy link
Collaborator

@ecordell: This pull request references Bugzilla bug 1849164, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1849164: don't store full manifests in installplan status (for bundle images)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@openshift-ci-robot
Copy link
Collaborator

@ecordell: This pull request references Bugzilla bug 1849164, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1849164: don't store full manifests in installplan status (for bundle images)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@njhale njhale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a solid, highly backwards compatible fix. I expect we'll revisit some of the abstractions late (maybe when we implement/design the v2 APIs).

I do, however, have a single wrench:

IIRC, bundles are unpacked -- in their entirety -- into a single ConfigMap, which should run into the resource storage limit before the point that this fix will help.

^ this is true in the case of a single big bundle, but the patch introduced in this PR still definitely helps in the case where the number of dependencies included in an InstallPlan is the primary contributor to its size.

manifestForStep := func(step *v1alpha1.Step) (string, error) {
manifest := step.Resource.Manifest
ref := refForStep(step)
if ref != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: return early if ref == nil to drop the indentation below

@@ -94,7 +113,7 @@ func RBACForClusterServiceVersion(csv *v1alpha1.ClusterServiceVersion) (map[stri
// Create RoleBinding
roleBinding := &rbacv1.RoleBinding{
ObjectMeta: metav1.ObjectMeta{
Name: generateName(fmt.Sprintf("%s-%s", role.GetName(), permission.ServiceAccountName)),
Name: generateName(fmt.Sprintf("%s-%s", role.GetName(), permission.ServiceAccountName), HashObject(fmt.Sprintf("%s-%s", role.GetName(), permission.ServiceAccountName))),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can this constitute an info leak?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can the Role and RoleBindings share a name since we're generating them 1:1?

@@ -297,7 +297,7 @@ func (b *BundleLookup) SetCondition(cond BundleLookupCondition) BundleLookupCond
if existing.Type != cond.Type {
continue
}
if existing.Status == cond.Status {
if existing.Status == cond.Status && existing.Reason == cond.Reason {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to separate our utility methods from the data type in the future. What about defining decorator types in OLM that embed the data type instead?

func (o *Operator) unpackBundles(plan *v1alpha1.InstallPlan) (bool, *v1alpha1.InstallPlan, error) {
out := plan.DeepCopy()
unpacked := true
unpackedCount := len(out.Status.BundleLookups)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like unpackedCount is being used to trigger any conditions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops - a planned refactor that I abandoned. will remove.

@ecordell
Copy link
Member Author

IIRC, bundles are unpacked -- in their entirety -- into a single ConfigMap, which should run into the resource storage limit before the point that this fix will help.

There is a limit for how big a bundle can be right now, which is the amount that we can fit into a single configmap. This PR removes (well, extends greatly) the limit of how many bundles can be unpacked and installed via a single installplan. This limit was hit with 19 operators in a cluster.

@ecordell ecordell force-pushed the ip-status-is-status branch from 7ad841b to 57cdfe1 Compare June 24, 2020 02:10
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 24, 2020
@ecordell ecordell force-pushed the ip-status-is-status branch from 57cdfe1 to d0e858d Compare June 24, 2020 02:19
@ecordell
Copy link
Member Author

@exdx @njhale Thanks for the thorough reviews - I've addressed all the feedback and I think it's ready to go

return utilrand.SafeEncodeString(fmt.Sprint(hasher.Sum32()))
}

func generateName(base string, hash string) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the contract of this function to return the concatenation of base+hash, truncating base such that the result is short enough to be an object name? I'm worried about slice range out-of-bound panics and overly-long return values. Couldn't base have length 127 in the invocation on L116?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed tests for this - thanks for calling it out. see rbac_test.go

@njhale
Copy link
Member

njhale commented Jun 24, 2020

/retest

"encoding/json"
"fmt"

"github.com/operator-framework/api/pkg/operators/v1alpha1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I've been asking myself whether we treat in-org dependencies as "external" for the purposes of import grouping. Thoughts?

the installplan status.plan field stores a list of all manifests that
will be installed when the installplan is approved.

if the installplan is resolving a large set of dependencies, it is
possible for it to exceed various limits in kube (grpc request size
limit, etcd value size limit, etc).

this commit changes the way objects are tracked if they come from
bundle images. in that case, the manifests are already unpacked onto
the cluster into a configmap. the installplan now stores a reference
to the source configmap for each step, rather than storing the manifest
itself. this allows the installplan to scale to a large number of
resolved objects.
@ecordell ecordell force-pushed the ip-status-is-status branch from d0e858d to a7659db Compare June 24, 2020 12:43
Copy link
Member

@njhale njhale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 24, 2020
@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ecordell, njhale

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ecordell
Copy link
Member Author

/cherry-pick release-4.5
/cherry-pick release-4.4

@openshift-cherrypick-robot

@ecordell: once the present PR merges, I will cherry-pick it on top of release-4.5 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.5
/cherry-pick release-4.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ecordell
Copy link
Member Author

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 24, 2020
@openshift-merge-robot openshift-merge-robot merged commit a6f90c9 into operator-framework:master Jun 24, 2020
@openshift-ci-robot
Copy link
Collaborator

@ecordell: All pull requests linked via external trackers have merged: operator-framework/operator-lifecycle-manager#1589. Bugzilla bug 1849164 has been moved to the MODIFIED state.

In response to this:

Bug 1849164: don't store full manifests in installplan status (for bundle images)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@ecordell: new pull request created: #1594

In response to this:

/cherry-pick release-4.5
/cherry-pick release-4.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants