subscription unpack failure troubleshooting

ankitathomas · ankitathomas · commit 4b5a094d0a4f · 2024-02-12T13:04:37.000-05:00
Signed-off-by: Ankita Thomas &lt;ankithom@redhat.com&gt;
diff --git a/content/en/docs/Troubleshooting/subscription.md b/content/en/docs/Troubleshooting/subscription.md
@@ -47,3 +47,34 @@ api-server resource not found installing CustomResourceDefinition my-crd-name: G
 This error indicates that the API apiextensions.k8s.io/v1beta1, Kind=CustomResourceDefinition is not available on-cluster. It indicates this particular GVK is not present on the api-server, which will happen on Kubernetes 1.22+ with deprecated built-in types like v1beta1 CRDs or v1beta1 RBAC.
 
 This error can also arise when installing operator bundles with CustomResources that OLM supports, such as `VerticalPodAutoscalers` and `PrometheusRules`, but the relevant CustomResourceDefinition has not yet been installed. In this case, this error should eventually resolve itself provided the required CustomResourceDefinition gets installed on the cluster and is accepted by the api-server.
+
+### Subscriptions failing due to unpacking errors
+
+If A subscription references an operator bundle fails to unpack successfully, the subscription will fail with the following message:
+
+```
+bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than the specified deadline
+```
+
+This type of failure can happen due to many possible underlying reasons on the cluster, including:
+1. Operator bundle image being unreachable:
+   1. Misconfigured network, such as an incorrectly configured proxy/firewall
+   2. Missing operator bundle images from the reachable image registries
+   3. Invalid or missing image registry credentials/secrets
+   4. Image registry rate limits 
+2. Resource limitations on the cluster
+   1. CPU or network limitations preventing operator bundle images from being pulled within the timeout (10 minutes)
+   2. Inability to schedule pods for unpacking operator bundle images
+   3. etcd performance issues
+
+Once any underlying causes for the unpack failure have been addressed, deleting any failing unpack jobs and their owner configMaps will cause the subscription to retry unpacking the operator bundles once more.
+
+You can enable automated cleanup and retry of failed unpack jobs in a namespace by setting the `operatorframework.io/bundle-unpack-min-retry-interval` annotation on the operatorGroup in that namespace. This annotation indicates the time after the last unpack failure when the unpack may be attempted again. This should not be set to an interval shorter than `5m` to avoid unnecessary load on the cluster.
+
+```
+kubectl annotate operatorgroup <OPERATOR_GROUP> operatorframework.io/bundle-unpack-min-retry-interval=10m
+```
+
+This annotation does not enforce limits on the number of times an operator bundle may be unpacked on failure, preserving only 5 failing unpack attempts for inspection. Unless the underlying cause for the failure is addressed, this may cause OLM to attempt to unsuccessfully unpack the operator bundle indefinitely. Removing the annotation from the operatorGroup disables automated retries for failed unpacking jobs on that namespace.
+
+With older versions of OLM, an installPlan may be generated for the failing subscription, in which case the subscription must be backed up; the failing installPlan, CSV, and subscription deleted; after which the subscription must be reapplied.