(DOCSP-33590): Create tutorial for OM & Operator recovery in case of cluster failure (#1484)

davidhou17 · mircea-cosbuc · dan-mckean · web-flow · commit c039fed3cb0c · 2023-11-13T08:54:27.000-06:00
* Create OM recovery tutorial

* Tech review suggestions pt 1

Co-authored-by: mircea-cosbuc &lt;mircea-cosbuc@users.noreply.github.com&gt;
Co-authored-by: Dan Mckean &lt;44909130+dan-mckean@users.noreply.github.com&gt;

* tech review pt 2

* Add last step

* Apply Mircea's comment

* copy review feedback

---------

Co-authored-by: mircea-cosbuc &lt;mircea-cosbuc@users.noreply.github.com&gt;
Co-authored-by: Dan Mckean &lt;44909130+dan-mckean@users.noreply.github.com&gt;
diff --git a/source/includes/steps-recover-k8s-om-multi-appdb-deployments.yaml b/source/includes/steps-recover-k8s-om-multi-appdb-deployments.yaml
@@ -0,0 +1,124 @@
+---
+title: "Configure the |k8s-op-short| in a new cluster."
+level: 4
+stepnum: 1
+ref: recover-om-new-cluster
+content: |
+
+  Follow the instructions to :ref:`install the Kubernetes Operator 
+  <install-k8s>` in a new |k8s| cluster.
+
+  .. note:: 
+
+     If you plan to re-use a member cluster, ensure that the 
+     appropriate service account and role exist. These values can overlap
+     and have different permissions between the central cluster and member
+     cluster.
+
+     To see the appropriate role required for the 
+     |k8s-op-short|, refer to the :github:`sample in the public repository
+     </mongodb/mongodb-enterprise-kubernetes/blob/master/samples/multi-cluster-cli-gitops/resources/rbac/namespace_scoped_central_cluster.yaml>`.
+
+---
+title: "Retrieve the backed-up resources from the failed |onprem| resource."
+level: 4
+stepnum: 2
+ref: recover-om-retrieve-backups
+content: |
+
+  Copy the |k8s-obj| specification for the failed |onprem| resource and
+  retrieve the following resources, replacing the placeholder text with 
+  your specific |onprem| resource name and namespace.
+
+  .. list-table::
+     :widths: 40 60
+     :header-rows: 1
+
+     * - Resource Type
+       - Values
+
+     * - Secrets
+       - - ``<om-name>-db-om-password``
+         - ``<om-name>-db-agent-password``
+         - ``<om-name>-db-keyfile``
+         - ``<om-name>-db-om-user-scram-credentials``
+         - ``<om-namespace>-<om-name>-admin-key``
+         - ``<om-name>-admin-secret``
+         - ``<om-name>-gen-key``
+         - |tls| certificate secrets (optional)
+
+     * - ConfigMaps
+       - - ``<om-name>-db-cluster-mapping``
+         - ``<om-name>-db-member-spec``
+         - Custom CA for |tls| certificates (optional)
+
+     * - OpsManager
+       - - ``<om-name>``
+
+  Then, paste the specification that you copied into a new file and 
+  configure the new resource by using the preceding values. To 
+  learn more, see :ref:`deploy-om-container`.
+ 
+---
+title: "Re-apply the |onprem| resource to the new cluster."
+level: 4
+stepnum: 3
+ref: recover-om-re-apply-resource
+content: |
+
+  Use the following command to apply the updated resource: 
+
+  .. code-block:: sh
+
+      kubectl apply \
+        --context "$MDB_CENTRAL_CLUSTER_FULL_NAME" \
+        --namespace "mongodb" 
+         -f https://raw.githubusercontent.com/mongodb/mongodb-enterprise-kubernetes/master/samples/ops-manager/ops-manager-external.yaml
+
+  To check the status of your |onprem| resource, use the following command:
+
+  .. code-block:: sh
+
+     kubectl get om -o yaml -w
+
+  Once the central cluster reaches a ``Running`` state, you can 
+  re-scale the Application Database to your desired 
+  distribution of member clusters.
+
+---
+title: "Re-apply the MongoDB resources to the new cluster."
+level: 4
+stepnum: 4
+ref: recover-om-apply-new-cluster
+content: |
+
+  To host your |k8s-mdbrsc| or |mongodb-multi| on the new 
+  |k8s-op-short| instance, apply the following resources to the 
+  new cluster:
+  
+  - The :ref:`ConfigMap <create-k8s-project>` used to create the initial project.
+
+  - The :ref:`secrets <create-k8s-credentials>` used in the previous |k8s-op-short| 
+    instance.
+
+  - The ``MongoDB`` or ``MongoDBMulticluster`` |k8s-custom-resource| at its last 
+    available state on the source cluster, including any :k8sdocs:`Annotations 
+    </concepts/overview/working-with-objects/annotations/>` added by the |k8s-op-short|
+    during its lifecycle.
+
+  .. note::
+
+     If you deployed a |k8s-mdbrsc| and not a |mongodb-multi|
+     and wish to migrate the failed |k8s| cluster's data
+     to the new cluster, you must complete the following additional steps:
+
+     1. Create a new |k8s-mdbrsc| on the new cluster.
+     #. Migrate the data to the new resource by 
+        :opsmgr:`Backing Up and Restoring </tutorial/nav/backup-use/>` 
+        the data in |onprem|.
+
+     If you deployed a |mongodb-multi|, you must re-scale the resource that you 
+     applied on the new healthy clusters if the failed cluster contained any 
+     Application Database nodes.
+
+...
diff --git a/source/multi-cluster-arch.txt b/source/multi-cluster-arch.txt
@@ -43,7 +43,7 @@ The following limitations exist for |multi-clusters|:
      If you host |onprem| in the same |k8s| cluster as the |k8s-op-short| and
      the cluster fails, you can restore the |multi-cluster| to a new |k8s|
      cluster. However, restoring |onprem| into another cluster in this case
-     is a lengthy manual process.
+     is a lengthy manual process. To learn more, see :ref:`recover-om-appdb-deployments`.
 
      In addition to deploying the Application Database outside of |k8s|,
      you can deploy the Application Database on selected member |k8s| clusters
diff --git a/source/om-resources.txt b/source/om-resources.txt
@@ -44,6 +44,10 @@ Deploy and Configure Ops Manager Resources
 :ref:`cert-manager-integration`
   Configure automated certificate renewal for |onprem| deployments with ``cert-manager``.
 
+:ref:`recover-om-appdb-deployments`
+  Manually recover the |k8s-op-short| and |onprem| for an |onprem| resource with 
+  Multi-Cluster AppDB Deployments in the event that the |k8s| cluster fails.
+
 .. class:: hidden
 
    .. toctree::
@@ -59,3 +63,4 @@ Deploy and Configure Ops Manager Resources
       /tutorial/configure-kmip-backup-encryption
       /tutorial/configure-file-store
       /tutorial/cert-manager-integration
+      /tutorial/recover-om-appdb-deployments
diff --git a/source/tutorial/recover-om-appdb-deployments.txt b/source/tutorial/recover-om-appdb-deployments.txt
@@ -0,0 +1,106 @@
+.. _recover-om-appdb-deployments:
+
+===========================================================================
+Recover the |k8s-op-short| and |onprem| for Multi-Cluster AppDB Deployments
+===========================================================================
+
+.. default-domain:: mongodb
+
+.. contents:: On this page
+   :local:
+   :backlinks: none
+   :depth: 1
+   :class: singlecol
+
+If you host an |onprem| resource in the same |k8s| cluster as 
+the |k8s-op-short| and have the Application Database (AppDB)
+deployed on selected member clusters in your |multi-cluster|,
+you can manually recover the |k8s-op-short| and |onprem|
+in the event that the cluster fails.
+
+To learn more about deploying |onprem| on a central
+cluster and the Application Database across member clusters,
+see :ref:`om_with_multi-clusters`.
+
+Prerequisites
+-------------
+
+Before you can recover the |k8s-op-short| and |onprem|, ensure
+that you meet the following requirements:
+
+- Configure backups for your |onprem| and
+  Application Database resources, including any 
+  |k8s-configmaps| and |k8s-secrets| created by the |k8s-op-short|,
+  to indicate the previous running state of |onprem|.
+  To learn more, see :ref:`om-rsrc-backup`.
+
+- The Application Database must have at least three healthy 
+  nodes remaining after failure of the |k8s-op-short|'s cluster.
+
+- The healthy clusters in your |multi-cluster| must contain 
+  a sufficient number of members to elect a primary node. 
+  To learn more, see :ref:`appdb-architecture`.
+
+Considerations 
+--------------
+
+.. _appdb-architecture:
+
+Application Database Architecture
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Because the |k8s-op-short| doesn't support forcing a replica set 
+reconfiguration, the healthy |k8s| clusters 
+must contain a sufficient number of Application Database members to elect a primary node
+for this manual recovery process. A majority of the Application Database 
+members must be available to elect a primary. To learn more, see 
+:manual:`Replica Set Deployment Architectures </core/replica-set-architectures/>`.
+
+If possible, use an odd number of member |k8s| clusters. Proper distribution of your 
+Application Database members can help to maximize the likelihood that 
+the remaining replica set members can form a majority during an outage. 
+To learn more, see :manual:`Replica Sets Distributed Across Two or More Data Centers
+</core/replica-set-architecture-geographically-distributed/>`.
+
+Consider the following examples:
+
+.. tabs::
+
+   .. tab:: Five-member Application Database
+      :tabid: five-member
+
+      For a five-member Application Database, some possible distributions of members include:
+
+      - Two clusters: three members to Cluster 1 and two members to Cluster 2.
+
+        - If Cluster 2 fails, there are enough members on Cluster 1 to elect a primary node.
+        - If Cluster 1 fails, there are not enough members on Cluster 2 to elect a primary node.
+
+      - Three clusters: two members to Cluster 1, two members to Cluster 2, and one member to Cluster 3.
+
+        - If any single cluster fails, there are enough members on the remaining clusters to elect a primary node.
+        - If two clusters fail, there are not enough members on any remaining cluster to elect a primary node.
+
+   .. tab:: Seven-member Application Database
+      :tabid: seven-member
+
+      For a seven-member Application Database, consider the following distribution of members:
+
+      - Two clusters: four members to Cluster 1 and three members to Cluster 2.
+
+        - If Cluster 2 fails, there are enough members on Cluster 1 to elect a primary node.
+        - If Cluster 1 fails, there are not enough members on Cluster 2 to elect a primary node.
+        
+      Although Cluster 2 meets the three member minimum for the Application Database,
+      a majority of the Application Database's seven members must be available 
+      to elect a primary node.
+
+------------
+
+Procedure
+---------
+
+To recover the |k8s-op-short| and |onprem|,
+restore the |onprem| resource on a new |k8s| cluster:
+
+.. include:: /includes/steps/recover-k8s-om-multi-appdb-deployments.rst