Skip to content

DOCS-309 Migrate sharding failover #168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 31, 2012
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 74 additions & 29 deletions source/administration/sharding-architectures.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,49 +31,31 @@ and development. Such a cluster will have the following components:

- 1 :program:`mongos` instance.

.. _sharding-production-deployment:

Deploying a Production Cluster
------------------------------

When deploying a shard cluster to production, you must ensure that the data
is redundant and that your individual nodes are highly available. To that end,
a production-level shard cluster should have the following:

- 3 :ref:`config servers <sharding-config-server>`, each residing on a separate node.

- For each shard, a three member :term:`replica set <replica set>` consisting of:

- 3 :program:`mongod` replicas or

.. seealso:: ":doc:`/administration/replication-architectures`"
and ":doc:`/administration/replica-sets`."

- 2 :program:`mongod` replicas and a single
:program:`mongod` instance acting as a :term:`arbiter`.

.. optional::

All replica set configurations and options are available.

You may also choose to deploy a :ref:`hidden member
<replica-set-hidden-members>` for backups or a
:ref:`delayed member <replica-set-delayed-members>`.
- 3 :ref:`config servers <sharding-config-server>`, each residing on a separate system.

You might also keep a member of each replica set in a
geographically distinct data center in case the primary data
center becomes unavailable.

See ":doc:`/replication`" for more information on replication
and :term:`replica sets <replica set>`.

.. seealso:: The ":ref:`sharding-procedure-add-shard`" and
":ref:`sharding-procedure-remove-shard`" procedures for more
information.
- 3 member :term:`replica set <replica set>` for each shard.

- :program:`mongos` instances. Typically, you will deploy a single
:program:`mongos` instance on every application server. Alternatively,
you may deploy several `mongos` nodes and let your application connect
to these via a load balancer.

.. seealso:: ":doc:`/administration/replication-architectures`"
and ":doc:`/administration/replica-sets`."

.. seealso:: The ":ref:`sharding-procedure-add-shard`" and
":ref:`sharding-procedure-remove-shard`" procedures for more
information.

Sharded and Non-Sharded Data
----------------------------

Expand Down Expand Up @@ -118,3 +100,66 @@ created subsequently, may reside on any shard in the cluster.
.. [#overloaded-primary-term] The term "primary" in the context of
databases and sharding, has nothing to do with the term
:term:`primary` in the context of :term:`replica sets <replica set>`.

Failover scenarios within MongoDB
---------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section is not publishable in its current form.

I'm revising now.


A properly deployed MongoDB shard cluster will have no single point
of failure. This section describes potential points of failure within
a shard cluster and its recovery method.

For reference, a properly deployed MongoDB shard cluster consists of:

- 3 :term:`config database`,

- 3 member :term:`replica set` for each shard and

- :program:`mongos` running on each application server.

Potential failure scenarios:

- A :term:`mongos` or the application server failing.

As each application server is running its own :program:`mongos`
instance, the database is still accessible for other application
servers and the data is intact. :program:`mongos` is stateless, so
if it fails, no critical information is lost. When :program:`mongos`
restarts, it will retrieve a copy of the configuration from the
:term:`config database` and resume working.

Suggested user intervention: restart application servers and/or
:program:`mongos`.

- A single :term:`mongod` suffers a failure in a shard.

A single :term:`mongod` instance failing within a shard will be
recovered by a :term:`secondary` member of the :term:`replica
set`. As each shard will have two :term:`secondary` members with the
exact same copy of the information, :term:`secondary` members will
be able to replace the failed :term:`primary` member.

Suggested course of action: investigate failure and replace
:term:`primary` member as soon as possible. Additional loss of
members on same shard will reduce availablility and the shard
cluster's data set reliability.

- All three replica set members of a shard fail.

All data within that shard will be unavailable, but the shard
cluster's other data will still be operational for applications and
new data can be written to other shard members.

Suggested course of action: investigate situation immediately.

- A :term:`config database` server suffers a failure.

As the :term:`config database` is deployed in a 3 member
configuration with two-phase commits to maintain synchronization
between all members. Shard cluster operation will continue as normal
but :ref:`chunk migration` will not occur.

Suggested course of action: replace :term:`config database` server
as soon as possible. Shards will become unbalanced without chunk
migration capability. Additional loss of :term:`config database`
servers will put the shard cluster metadata in jeopardy.