Skip to content

Commit c039fed

Browse files
davidhou17mircea-cosbucdan-mckean
authored
(DOCSP-33590): Create tutorial for OM & Operator recovery in case of cluster failure (#1484)
* Create OM recovery tutorial * Tech review suggestions pt 1 Co-authored-by: mircea-cosbuc <[email protected]> Co-authored-by: Dan Mckean <[email protected]> * tech review pt 2 * Add last step * Apply Mircea's comment * copy review feedback --------- Co-authored-by: mircea-cosbuc <[email protected]> Co-authored-by: Dan Mckean <[email protected]>
1 parent 6488bca commit c039fed

File tree

4 files changed

+236
-1
lines changed

4 files changed

+236
-1
lines changed
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
---
2+
title: "Configure the |k8s-op-short| in a new cluster."
3+
level: 4
4+
stepnum: 1
5+
ref: recover-om-new-cluster
6+
content: |
7+
8+
Follow the instructions to :ref:`install the Kubernetes Operator
9+
<install-k8s>` in a new |k8s| cluster.
10+
11+
.. note::
12+
13+
If you plan to re-use a member cluster, ensure that the
14+
appropriate service account and role exist. These values can overlap
15+
and have different permissions between the central cluster and member
16+
cluster.
17+
18+
To see the appropriate role required for the
19+
|k8s-op-short|, refer to the :github:`sample in the public repository
20+
</mongodb/mongodb-enterprise-kubernetes/blob/master/samples/multi-cluster-cli-gitops/resources/rbac/namespace_scoped_central_cluster.yaml>`.
21+
22+
---
23+
title: "Retrieve the backed-up resources from the failed |onprem| resource."
24+
level: 4
25+
stepnum: 2
26+
ref: recover-om-retrieve-backups
27+
content: |
28+
29+
Copy the |k8s-obj| specification for the failed |onprem| resource and
30+
retrieve the following resources, replacing the placeholder text with
31+
your specific |onprem| resource name and namespace.
32+
33+
.. list-table::
34+
:widths: 40 60
35+
:header-rows: 1
36+
37+
* - Resource Type
38+
- Values
39+
40+
* - Secrets
41+
- - ``<om-name>-db-om-password``
42+
- ``<om-name>-db-agent-password``
43+
- ``<om-name>-db-keyfile``
44+
- ``<om-name>-db-om-user-scram-credentials``
45+
- ``<om-namespace>-<om-name>-admin-key``
46+
- ``<om-name>-admin-secret``
47+
- ``<om-name>-gen-key``
48+
- |tls| certificate secrets (optional)
49+
50+
* - ConfigMaps
51+
- - ``<om-name>-db-cluster-mapping``
52+
- ``<om-name>-db-member-spec``
53+
- Custom CA for |tls| certificates (optional)
54+
55+
* - OpsManager
56+
- - ``<om-name>``
57+
58+
Then, paste the specification that you copied into a new file and
59+
configure the new resource by using the preceding values. To
60+
learn more, see :ref:`deploy-om-container`.
61+
62+
---
63+
title: "Re-apply the |onprem| resource to the new cluster."
64+
level: 4
65+
stepnum: 3
66+
ref: recover-om-re-apply-resource
67+
content: |
68+
69+
Use the following command to apply the updated resource:
70+
71+
.. code-block:: sh
72+
73+
kubectl apply \
74+
--context "$MDB_CENTRAL_CLUSTER_FULL_NAME" \
75+
--namespace "mongodb"
76+
-f https://raw.githubusercontent.com/mongodb/mongodb-enterprise-kubernetes/master/samples/ops-manager/ops-manager-external.yaml
77+
78+
To check the status of your |onprem| resource, use the following command:
79+
80+
.. code-block:: sh
81+
82+
kubectl get om -o yaml -w
83+
84+
Once the central cluster reaches a ``Running`` state, you can
85+
re-scale the Application Database to your desired
86+
distribution of member clusters.
87+
88+
---
89+
title: "Re-apply the MongoDB resources to the new cluster."
90+
level: 4
91+
stepnum: 4
92+
ref: recover-om-apply-new-cluster
93+
content: |
94+
95+
To host your |k8s-mdbrsc| or |mongodb-multi| on the new
96+
|k8s-op-short| instance, apply the following resources to the
97+
new cluster:
98+
99+
- The :ref:`ConfigMap <create-k8s-project>` used to create the initial project.
100+
101+
- The :ref:`secrets <create-k8s-credentials>` used in the previous |k8s-op-short|
102+
instance.
103+
104+
- The ``MongoDB`` or ``MongoDBMulticluster`` |k8s-custom-resource| at its last
105+
available state on the source cluster, including any :k8sdocs:`Annotations
106+
</concepts/overview/working-with-objects/annotations/>` added by the |k8s-op-short|
107+
during its lifecycle.
108+
109+
.. note::
110+
111+
If you deployed a |k8s-mdbrsc| and not a |mongodb-multi|
112+
and wish to migrate the failed |k8s| cluster's data
113+
to the new cluster, you must complete the following additional steps:
114+
115+
1. Create a new |k8s-mdbrsc| on the new cluster.
116+
#. Migrate the data to the new resource by
117+
:opsmgr:`Backing Up and Restoring </tutorial/nav/backup-use/>`
118+
the data in |onprem|.
119+
120+
If you deployed a |mongodb-multi|, you must re-scale the resource that you
121+
applied on the new healthy clusters if the failed cluster contained any
122+
Application Database nodes.
123+
124+
...

source/multi-cluster-arch.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ The following limitations exist for |multi-clusters|:
4343
If you host |onprem| in the same |k8s| cluster as the |k8s-op-short| and
4444
the cluster fails, you can restore the |multi-cluster| to a new |k8s|
4545
cluster. However, restoring |onprem| into another cluster in this case
46-
is a lengthy manual process.
46+
is a lengthy manual process. To learn more, see :ref:`recover-om-appdb-deployments`.
4747

4848
In addition to deploying the Application Database outside of |k8s|,
4949
you can deploy the Application Database on selected member |k8s| clusters

source/om-resources.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,10 @@ Deploy and Configure Ops Manager Resources
4444
:ref:`cert-manager-integration`
4545
Configure automated certificate renewal for |onprem| deployments with ``cert-manager``.
4646

47+
:ref:`recover-om-appdb-deployments`
48+
Manually recover the |k8s-op-short| and |onprem| for an |onprem| resource with
49+
Multi-Cluster AppDB Deployments in the event that the |k8s| cluster fails.
50+
4751
.. class:: hidden
4852

4953
.. toctree::
@@ -59,3 +63,4 @@ Deploy and Configure Ops Manager Resources
5963
/tutorial/configure-kmip-backup-encryption
6064
/tutorial/configure-file-store
6165
/tutorial/cert-manager-integration
66+
/tutorial/recover-om-appdb-deployments
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
.. _recover-om-appdb-deployments:
2+
3+
===========================================================================
4+
Recover the |k8s-op-short| and |onprem| for Multi-Cluster AppDB Deployments
5+
===========================================================================
6+
7+
.. default-domain:: mongodb
8+
9+
.. contents:: On this page
10+
:local:
11+
:backlinks: none
12+
:depth: 1
13+
:class: singlecol
14+
15+
If you host an |onprem| resource in the same |k8s| cluster as
16+
the |k8s-op-short| and have the Application Database (AppDB)
17+
deployed on selected member clusters in your |multi-cluster|,
18+
you can manually recover the |k8s-op-short| and |onprem|
19+
in the event that the cluster fails.
20+
21+
To learn more about deploying |onprem| on a central
22+
cluster and the Application Database across member clusters,
23+
see :ref:`om_with_multi-clusters`.
24+
25+
Prerequisites
26+
-------------
27+
28+
Before you can recover the |k8s-op-short| and |onprem|, ensure
29+
that you meet the following requirements:
30+
31+
- Configure backups for your |onprem| and
32+
Application Database resources, including any
33+
|k8s-configmaps| and |k8s-secrets| created by the |k8s-op-short|,
34+
to indicate the previous running state of |onprem|.
35+
To learn more, see :ref:`om-rsrc-backup`.
36+
37+
- The Application Database must have at least three healthy
38+
nodes remaining after failure of the |k8s-op-short|'s cluster.
39+
40+
- The healthy clusters in your |multi-cluster| must contain
41+
a sufficient number of members to elect a primary node.
42+
To learn more, see :ref:`appdb-architecture`.
43+
44+
Considerations
45+
--------------
46+
47+
.. _appdb-architecture:
48+
49+
Application Database Architecture
50+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
51+
52+
Because the |k8s-op-short| doesn't support forcing a replica set
53+
reconfiguration, the healthy |k8s| clusters
54+
must contain a sufficient number of Application Database members to elect a primary node
55+
for this manual recovery process. A majority of the Application Database
56+
members must be available to elect a primary. To learn more, see
57+
:manual:`Replica Set Deployment Architectures </core/replica-set-architectures/>`.
58+
59+
If possible, use an odd number of member |k8s| clusters. Proper distribution of your
60+
Application Database members can help to maximize the likelihood that
61+
the remaining replica set members can form a majority during an outage.
62+
To learn more, see :manual:`Replica Sets Distributed Across Two or More Data Centers
63+
</core/replica-set-architecture-geographically-distributed/>`.
64+
65+
Consider the following examples:
66+
67+
.. tabs::
68+
69+
.. tab:: Five-member Application Database
70+
:tabid: five-member
71+
72+
For a five-member Application Database, some possible distributions of members include:
73+
74+
- Two clusters: three members to Cluster 1 and two members to Cluster 2.
75+
76+
- If Cluster 2 fails, there are enough members on Cluster 1 to elect a primary node.
77+
- If Cluster 1 fails, there are not enough members on Cluster 2 to elect a primary node.
78+
79+
- Three clusters: two members to Cluster 1, two members to Cluster 2, and one member to Cluster 3.
80+
81+
- If any single cluster fails, there are enough members on the remaining clusters to elect a primary node.
82+
- If two clusters fail, there are not enough members on any remaining cluster to elect a primary node.
83+
84+
.. tab:: Seven-member Application Database
85+
:tabid: seven-member
86+
87+
For a seven-member Application Database, consider the following distribution of members:
88+
89+
- Two clusters: four members to Cluster 1 and three members to Cluster 2.
90+
91+
- If Cluster 2 fails, there are enough members on Cluster 1 to elect a primary node.
92+
- If Cluster 1 fails, there are not enough members on Cluster 2 to elect a primary node.
93+
94+
Although Cluster 2 meets the three member minimum for the Application Database,
95+
a majority of the Application Database's seven members must be available
96+
to elect a primary node.
97+
98+
------------
99+
100+
Procedure
101+
---------
102+
103+
To recover the |k8s-op-short| and |onprem|,
104+
restore the |onprem| resource on a new |k8s| cluster:
105+
106+
.. include:: /includes/steps/recover-k8s-om-multi-appdb-deployments.rst

0 commit comments

Comments
 (0)