Skip to content

DOCS-1731: Split manage chunks in sharded cluster tutorial page #1284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 93 additions & 25 deletions source/core/bulk-inserts.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,47 +6,115 @@ Bulk Inserts in MongoDB

.. default-domain:: mongodb

.. contents::
:backlinks: none
:local:

In some situations you may need to insert or ingest a large amount of
data into a MongoDB database. These *bulk inserts* have some
special considerations that are different from other write
operations.

.. TODO: section on general write operation considerations?

Use the ``insert()`` Method
---------------------------

The :method:`insert() <db.collection.insert()>` method, when passed an
array of documents, will perform a bulk insert, and inserts each
document atomically. :doc:`Drivers </applications/drivers>`
provide their own interface for this kind of operation.
array of documents, performs a bulk insert, and inserts each document
atomically. Bulk inserts can significantly increase performance by
amortizing :ref:`write concern <write-operations-write-concern>` costs.

.. versionadded:: 2.2
:method:`insert() <db.collection.insert()>` in the :program:`mongo`
shell gained support for bulk inserts in version 2.2.

Bulk insert can significantly increase performance by amortizing
:ref:`write concern <write-operations-write-concern>` costs. In the
drivers, you can configure write concern for batches rather than on a
per-document level.
In the :doc:`drivers </applications/drivers>`, you can configure write
concern for batches rather than on a per-document level.

Drivers also have a ``ContinueOnError`` option in their insert
operation, so that the bulk operation will continue to insert
remaining documents in a batch even if an insert fails.
Drivers have a ``ContinueOnError`` option in their insert operation, so
that the bulk operation will continue to insert remaining documents in a
batch even if an insert fails.

.. note::

.. versionadded:: 2.0
Support for ``ContinueOnError`` depends on version 2.0 of the
core :program:`mongod` and :program:`mongos` components.
If multiple errors occur during a bulk insert, clients only receive
the last error generated.

.. seealso::

If the bulk insert process generates more than one error in a batch
job, the client will only receive the most recent error. All bulk
:doc:`Driver documentation </applications/drivers>` for details
on performing bulk inserts in your application. Also see
:doc:`/core/import-export`.

Bulk Inserts on Sharded Clusters
--------------------------------

While ``ContinueOnError`` is optional on unsharded clusters, all bulk
operations to a :term:`sharded collection <sharded cluster>` run with
``ContinueOnError``, which applications cannot disable. See
:ref:`sharding-bulk-inserts` section for more information on
consideration for bulk inserts in sharded clusters.

For more information see your :doc:`driver documentation
</applications/drivers>` for details on performing bulk inserts in
your application. Also consider the following resources:
:ref:`write-operations-sharded-clusters`,
:ref:`sharding-bulk-inserts`, and
:doc:`/core/import-export`.
``ContinueOnError``, which cannot be disabled.

Large bulk insert operations, including initial data inserts or routine
data import, can affect :term:`sharded cluster` performance. For
bulk inserts, consider the following strategies:

Pre-Split the Collection
~~~~~~~~~~~~~~~~~~~~~~~~

If the sharded collection is empty, then the collection has only
one initial :term:`chunk`, which resides on a single shard.
MongoDB must then take time to receive data, create splits, and
distribute the split chunks to the available shards. To avoid this
performance cost, you can pre-split the collection, as described in
:doc:`split-chunks-in-sharded-cluster`.

Insert to Multiple ``mongos``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To parallelize import processes, send insert operations to more than
one :program:`mongos` instance. Pre-split empty collections first as
described in :doc:`split-chunks-in-sharded-cluster`.

Avoid Monotonic Throttling
~~~~~~~~~~~~~~~~~~~~~~~~~~

If your shard key increases monotonically during an insert, then all
inserted data goes to the last chunk in the collection, which will
always end up on a single shard. Therefore, the insert capacity of the
cluster will never exceed the insert capacity of that single shard.

If your insert volume is larger than what a single shard can process,
and if you cannot avoid a monotonically increasing shard key, then
consider the following modifications to your application:

- Reverse the binary bits of the shard key. This preserves the
information and avoids correlating insertion order with increasing
sequence of values.

- Swap the first and last 16-bit words to "shuffle" the inserts.

.. example:: The following example, in C++, swaps the leading and
trailing 16-bit word of :term:`BSON` :term:`ObjectIds <ObjectId>`
generated so that they are no longer monotonically increasing.

.. code-block:: cpp

using namespace mongo;
OID make_an_id() {
OID x = OID::gen();
const unsigned char *p = x.getData();
swap( (unsigned short&) p[0], (unsigned short&) p[10] );
return x;
}

void foo() {
// create an object
BSONObj o = BSON( "_id" << make_an_id() << "x" << 3 << "name" << "jane" );
// now we may insert o into a sharded collection
}

.. seealso:: :ref:`sharding-shard-key` for information
on choosing a sharded key. Also see :ref:`Shard Key
Internals <sharding-internals-shard-keys>` (in particular,
:ref:`sharding-internals-operations-and-reliability`).

4 changes: 2 additions & 2 deletions source/core/sharded-cluster-high-availability.txt
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,8 @@ Three distinct :program:`mongod` instances provide the :term:`config
database` using a special two-phase commits to maintain consistent state
between these :program:`mongod` instances. Cluster operation will
continue as normal but :ref:`chunk migration <sharding-balancing>` and
the cluster can create no new :ref:`chunk splits
<sharding-procedure-create-split>`. Replace the config server as soon as
the cluster can create no new :doc:`chunk splits
</tutorial/split-chunks-in-sharded-cluster>`. Replace the config server as soon as
possible. If all multiple config databases become unavailable, the
cluster can become inoperable.

Expand Down
4 changes: 2 additions & 2 deletions source/core/sharding-chunk-migration.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ chunks of a sharded collection evenly among shards. Migrations may be
either:

- Manual. Only use manual migration in limited cases, such as
to distribute data during bulk inserts. See :ref:`Migrating Chunks
Manually <sharding-balancing-manual-migration>` for more details.
to distribute data during bulk inserts. See :doc:`Migrating Chunks
Manually </tutorial/migrate-chunks-in-sharded-cluster>` for more details.

- Automatic. The :doc:`balancer </core/sharding-balancing>` process
automatically migrates chunks when there is an uneven distribution of
Expand Down
4 changes: 2 additions & 2 deletions source/core/sharding-chunk-splitting.txt
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@ Chunk Size
.. todo:: link this section to <glossary:chunk size>

The default :term:`chunk` size in MongoDB is 64 megabytes. You can
:ref:`increase or reduce the chunk size
<sharding-balancing-modify-chunk-size>`, mindful of its effect on the
:doc:`increase or reduce the chunk size
</tutorial/modify-chunk-size-in-sharded-cluster>`, mindful of its effect on the
cluster's efficiency.

#. Small chunks lead to a more even distribution of data at the
Expand Down
4 changes: 2 additions & 2 deletions source/faq/diagnostics.txt
Original file line number Diff line number Diff line change
Expand Up @@ -269,7 +269,7 @@ As a related problem, the system will split chunks only on
inserts or updates, which means that if you configure sharding and do not
continue to issue insert and update operations, the database will not
create any chunks. You can either wait until your application inserts
data *or* :ref:`split chunks manually <sharding-procedure-create-split>`.
data *or* :doc:`split chunks manually </tutorial/split-chunks-in-sharded-cluster>`.

Finally, if your shard key has a low :ref:`cardinality
<sharding-shard-key-cardinality>`, MongoDB may not be able to create
Expand Down Expand Up @@ -339,7 +339,7 @@ consider the following options, depending on the nature of the impact:
#. If the balancer is always migrating chunks to the detriment of
overall cluster performance:

- You may want to attempt :ref:`decreasing the chunk size <sharding-balancing-modify-chunk-size>`
- You may want to attempt :doc:`decreasing the chunk size </tutorial/modify-chunk-size-in-sharded-cluster>`
to limit the size of the migration.

- Your cluster may be over capacity, and you may want to attempt to
Expand Down
2 changes: 1 addition & 1 deletion source/faq/sharding.txt
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ is to:

- configure sharding using a more ideal shard key.

- :ref:`pre-split <sharding-administration-pre-splitting>` the shard
- :doc:`pre-split </tutorial/create-chunks-in-sharded-cluster>` the shard
key range to ensure initial even distribution.

- restore the dumped data into MongoDB.
Expand Down
2 changes: 1 addition & 1 deletion source/reference/command/moveChunk.txt
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,6 @@ side effects.
In most cases allow the balancer to create and balance chunks
in sharded clusters.
See
:ref:`sharding-administration-create-chunks` for more information.
:doc:`/tutorial/create-chunks-on-sharded-cluster` for more information.

.. admin-only
2 changes: 1 addition & 1 deletion source/reference/config-database.txt
Original file line number Diff line number Diff line change
Expand Up @@ -280,7 +280,7 @@ Collections
sharding configuration settings:

- Chunk size. To change chunk size,
see :ref:`sharding-balancing-modify-chunk-size`.
see :doc:`/tutorial/modify-chunk-size-in-sharded-cluster`.

- Balancer status. To change status,
see :ref:`sharding-balancing-disable-temporarily`.
Expand Down
14 changes: 7 additions & 7 deletions source/reference/configuration-options.txt
Original file line number Diff line number Diff line change
Expand Up @@ -213,7 +213,7 @@ Settings

*Default:* None.

Specify a file location to hold the ":term:`PID`" or process ID of the
Specify a file location to hold the :term:`PID` or process ID of the
:program:`mongod` process. Useful for tracking the :program:`mongod` process in
combination with the :setting:`fork` setting.

Expand All @@ -230,7 +230,7 @@ Settings
information. This option is only useful for the connection between
replica set members.

.. seealso:: ":ref:`Replica Set Security <replica-set-security>`"
.. seealso:: :ref:`Replica Set Security <replica-set-security>`

.. setting:: nounixsocket

Expand Down Expand Up @@ -556,7 +556,7 @@ Settings
profiler is on, :program:`mongod` the profiler writes to the
``system.profile`` collection.

.. seealso:: ":setting:`profile`"
.. seealso:: :setting:`profile`

.. setting:: smallfiles

Expand Down Expand Up @@ -698,9 +698,9 @@ Replication Options
sets. Specify a replica set name as an argument to this set. All
hosts must have the same set name.

.. seealso:: ":doc:`/replication`,"
":doc:`/administration/replica-set-deployment`," and
":doc:`/reference/replica-configuration`"
.. seealso:: :doc:`/replication`,
:doc:`/administration/replica-set-deployment`, and
:doc:`/reference/replica-configuration`

.. setting:: oplogSize

Expand Down Expand Up @@ -897,7 +897,7 @@ Sharded Cluster Options
:setting:`chunkSize` *only* sets the chunk size when initializing
the cluster for the first time. If you modify the run-time option
later, the new value will have no effect. See the
":ref:`sharding-balancing-modify-chunk-size`" procedure if you
:doc:`/tutorial/modify-chunk-size-in-sharded-cluster` procedure if you
need to change the chunk size on an existing sharded cluster.

.. setting:: localThreshold
Expand Down
2 changes: 1 addition & 1 deletion source/reference/glossary.txt
Original file line number Diff line number Diff line change
Expand Up @@ -567,7 +567,7 @@ Glossary
expedites the initial distribution of documents in :term:`sharded
cluster` by manually dividing the collection rather than waiting
for the MongoDB :term:`balancer` to do so. See
:ref:`sharding-administration-pre-splitting`.
:doc:`/tutorial/create-chunks-in-sharded-cluster`.

primary
In a :term:`replica set`, the primary member is the current
Expand Down
6 changes: 3 additions & 3 deletions source/reference/program/mongos.txt
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ Options
runtime-configurations. While the options are equivalent and
accessible via the other command line arguments, the configuration
file is the preferred method for runtime configuration of
mongod. See the ":doc:`/reference/configuration-options`" document
mongod. See the :doc:`/reference/configuration-options` document
for more information about these options.

Not all configuration options for :program:`mongod` make sense in
Expand Down Expand Up @@ -153,7 +153,7 @@ Options

.. option:: --pidfilepath <path>

Specify a file location to hold the ":term:`PID`" or process ID of the
Specify a file location to hold the :term:`PID` or process ID of the
:program:`mongos` process. Useful for tracking the :program:`mongos` process in
combination with the :option:`mongos --fork` option.

Expand Down Expand Up @@ -239,7 +239,7 @@ Options
This option *only* sets the chunk size when initializing the
cluster for the first time. If you modify the run-time option
later, the new value will have no effect. See the
":ref:`sharding-balancing-modify-chunk-size`" procedure if you
:doc:`/tutorial/modify-chunk-size-in-sharded-cluster` procedure if you
need to change the chunk size on an existing sharded cluster.

.. option:: --ipv6
Expand Down
2 changes: 1 addition & 1 deletion source/tutorial/configure-sharded-cluster-balancer.txt
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ details, see :ref:`sharding-chunk-size`.
Changing the default chunk size affects chunks that are processes during
migrations and auto-splits but does not retroactively affect all chunks.

To configure default chunk size, see :ref:`sharding-balancing-modify-chunk-size`.
To configure default chunk size, see :doc:`modify-chunk-size-in-sharded-cluster`.

.. _sharded-cluster-config-max-shard-size:

Expand Down
66 changes: 66 additions & 0 deletions source/tutorial/create-chunks-in-sharded-cluster.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
==================================
Create Chunks in a Sharded Cluster
==================================

.. default-domain:: mongodb

Pre-splitting the chunk ranges in an empty sharded collection allows
clients to insert data into an already partitioned collection. In most
situations a :term:`sharded cluster` will create and distribute chunks
automatically without user intervention. However, in a limited number
of cases, MongoDB cannot create enough chunks or distribute
data fast enough to support required throughput. For example:

- If you want to partition an existing data collection that resides on a
single shard.

- If you want to ingest a large volume of data into a cluster that isn't
balanced, or where the ingestion of data will lead to data imbalance.
For example, monotonically increasing or decreasing shard keys insert
all data into a single chunk.

These operations are resource intensive for several reasons:

- Chunk migration requires copying all the data in the chunk from one shard to
another.

- MongoDB can migrate only a single chunk at a time.

- MongoDB creates splits only after an insert operation.

.. warning::

Only pre-split an empty collection. If a collection already has data,
MongoDB automatically splits the collection's data when you enable
sharding for the collection. Subsequent attempts to manually create
splits can lead to unpredictable chunk ranges and sizes as well as
inefficient or ineffective balancing behavior.

To create chunks manually, use the following procedure:

#. Split empty chunks in your collection by manually performing
the :dbcommand:`split` command on chunks.

.. example::

To create chunks for documents in the ``myapp.users``
collection using the ``email`` field as the :term:`shard key`,
use the following operation in the :program:`mongo` shell:

.. code-block:: javascript

for ( var x=97; x<97+26; x++ ){
for( var y=97; y<97+26; y+=6 ) {
var prefix = String.fromCharCode(x) + String.fromCharCode(y);
db.runCommand( { split : "myapp.users" , middle : { email : prefix } } );
}
}

This assumes a collection size of 100 million documents.

For information on the balancer and automatic distribution of
chunks across shards, see :ref:`sharding-balancing-internals`
and :ref:`sharding-chunk-migration`. For
information on manually migrating chunks, see
:doc:`/tutorial/migrate-chunks-in-sharded-cluster`.

Loading