Skip to content

OCPBUGS-5523: Catalog, fatal error: concurrent map read and map write #429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 20, 2023

Conversation

dtfranz
Copy link
Contributor

@dtfranz dtfranz commented Jan 16, 2023

Problem:
The metrics pkg uses a shared map var to keep a record of Prometheus counters from Subscription objects, which is not protected from concurrent read/write access. This can result in container restarts.

Solution:
This PR adds a mutex to lock and unlock before and after each map access instance. Any number of requests to read from the map may happen concurrently, but when a write operation occurs nothing else can access the map until the write is finished.

Motivation for the change:

OCPBUGS-5523

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 16, 2023
@openshift-ci-robot
Copy link

@dtfranz: This pull request references Jira Issue OCPBUGS-5523, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is ON_QA instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Problem:
The metrics pkg uses a shared map var to keep a record of Prometheus counters from Subscription objects, which is not protected from concurrent read/write access. This can result in container restarts.

Solution:
This PR adds a mutex to lock and unlock before and after each map access instance. Any number of requests to read from the map may happen concurrently, but when a write operation occurs nothing else can access the map until the write is finished.

Motivation for the change:

OCPBUGS-5523

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dtfranz
Copy link
Contributor Author

dtfranz commented Jan 16, 2023

/cherry-pick 4.10

@openshift-cherrypick-robot

@dtfranz: once the present PR merges, I will cherry-pick it on top of 4.10 in a new PR and assign it to you.

In response to this:

/cherry-pick 4.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

… (#2913)

* protected subscriptionSyncCounters access to prevent concurrent map writes

Signed-off-by: Daniel Franz <[email protected]>

* organize map and mutex into single struct

Signed-off-by: Daniel Franz <[email protected]>

* initialize struct

Signed-off-by: Daniel Franz <[email protected]>

* use RWMutex to allow concurrent reads

Signed-off-by: Daniel Franz <[email protected]>

Upstream-repository: operator-lifecycle-manager
Upstream-commit: 2a49a4dddeb3e0fc38b44925bf9bd0d3931d4ff4
Signed-off-by: dtfranz <[email protected]>
@awgreene
Copy link
Contributor

/approve
/retest

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 17, 2023
@dtfranz
Copy link
Contributor Author

dtfranz commented Jan 17, 2023

/test e2e-gcp-ovn

@dtfranz
Copy link
Contributor Author

dtfranz commented Jan 17, 2023

/retest

@awgreene
Copy link
Contributor

@dtfranz seems like a gcp failure:

{  failed to acquire lease for "gcp-quota-slice": resources not found}

@awgreene
Copy link
Contributor

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 18, 2023

@dtfranz: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dtfranz
Copy link
Contributor Author

dtfranz commented Jan 18, 2023

/jira refresh

@openshift-ci-robot
Copy link

@dtfranz: This pull request references Jira Issue OCPBUGS-5523, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is Verified instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dtfranz
Copy link
Contributor Author

dtfranz commented Jan 18, 2023

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 18, 2023
@openshift-ci-robot
Copy link

@dtfranz: This pull request references Jira Issue OCPBUGS-5523, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is Verified instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@tmshort tmshort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 19, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 19, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: awgreene, dtfranz, tmshort

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dtfranz
Copy link
Contributor Author

dtfranz commented Jan 20, 2023

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jan 20, 2023
@openshift-ci-robot
Copy link

@dtfranz: This pull request references Jira Issue OCPBUGS-5523, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.13.0) matches configured target version for branch (4.13.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jan 20, 2023
@openshift-merge-robot openshift-merge-robot merged commit 932e3a0 into openshift:master Jan 20, 2023
@openshift-ci-robot
Copy link

@dtfranz: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-5523 has been moved to the MODIFIED state.

In response to this:

Problem:
The metrics pkg uses a shared map var to keep a record of Prometheus counters from Subscription objects, which is not protected from concurrent read/write access. This can result in container restarts.

Solution:
This PR adds a mutex to lock and unlock before and after each map access instance. Any number of requests to read from the map may happen concurrently, but when a write operation occurs nothing else can access the map until the write is finished.

Motivation for the change:

OCPBUGS-5523

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@dtfranz: cannot checkout 4.10: error checking out 4.10: exit status 1. output: error: pathspec '4.10' did not match any file(s) known to git

In response to this:

/cherry-pick 4.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dtfranz
Copy link
Contributor Author

dtfranz commented Jan 20, 2023

/cherry-pick release-4.10

@openshift-cherrypick-robot

@dtfranz: #429 failed to apply on top of branch "release-4.10":

Applying: OCPBUGS-5523: Catalog, fatal error: concurrent map read and map write (#2913)
Using index info to reconstruct a base tree...
M	staging/operator-lifecycle-manager/pkg/metrics/metrics.go
Falling back to patching base and 3-way merge...
Auto-merging staging/operator-lifecycle-manager/pkg/metrics/metrics.go
CONFLICT (content): Merge conflict in staging/operator-lifecycle-manager/pkg/metrics/metrics.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 OCPBUGS-5523: Catalog, fatal error: concurrent map read and map write (#2913)
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dtfranz pushed a commit to dtfranz/operator-framework-olm that referenced this pull request Jan 20, 2023
OCPBUGS-5523: Catalog, fatal error: concurrent map read and map write

Signed-off-by: dtfranz <[email protected]>
dtfranz pushed a commit to dtfranz/operator-framework-olm that referenced this pull request Jan 20, 2023
OCPBUGS-5523: Catalog, fatal error: concurrent map read and map write

Signed-off-by: dtfranz <[email protected]>

Upstream-repository: operator-lifecycle-manager
Upstream-commit: 2a49a4dddeb3e0fc38b44925bf9bd0d3931d4ff4
@awgreene
Copy link
Contributor

awgreene commented Feb 1, 2023

/cherry-pick release-4.12

@openshift-cherrypick-robot

@awgreene: new pull request created: #436

In response to this:

/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants