mongodb · Zackrobat · Aug 28, 2013
diff --git a/source/core/bulk-inserts.txt b/source/core/bulk-inserts.txt
@@ -6,47 +6,115 @@ Bulk Inserts in MongoDB
 
 .. default-domain:: mongodb
 
+.. contents::
+   :backlinks: none
+   :local:
+
 In some situations you may need to insert or ingest a large amount of
 data into a MongoDB database. These *bulk inserts* have some
 special considerations that are different from other write
 operations.
 
 .. TODO: section on general write operation considerations?
 
+Use the ``insert()`` Method
+---------------------------
+
 The :method:`insert() <db.collection.insert()>` method, when passed an
-array of documents, will perform a bulk insert, and inserts each
-document atomically. :doc:`Drivers </applications/drivers>`
-provide their own interface for this kind of operation.
+array of documents, performs a bulk insert, and inserts each document
+atomically. Bulk inserts can significantly increase performance by
+amortizing :ref:`write concern <write-operations-write-concern>` costs.
 
 .. versionadded:: 2.2
    :method:`insert() <db.collection.insert()>` in the :program:`mongo`
    shell gained support for bulk inserts in version 2.2.
 
-Bulk insert can significantly increase performance by amortizing
-:ref:`write concern <write-operations-write-concern>` costs. In the
-drivers, you can configure write concern for batches rather than on a
-per-document level.
+In the :doc:`drivers </applications/drivers>`, you can configure write
+concern for batches rather than on a per-document level.
 
-Drivers also have a ``ContinueOnError`` option in their insert
-operation, so that the bulk operation will continue to insert
-remaining documents in a batch even if an insert fails.
+Drivers have a ``ContinueOnError`` option in their insert operation, so
+that the bulk operation will continue to insert remaining documents in a
+batch even if an insert fails.
 
 .. note::
 
-   .. versionadded::  2.0
-      Support for ``ContinueOnError`` depends on version 2.0 of the
-      core :program:`mongod` and :program:`mongos` components.
+   If multiple errors occur during a bulk insert, clients only receive
+   the last error generated.
+
+.. seealso::
 
-If the bulk insert process generates more than one error in a batch
-job, the client will only receive the most recent error. All bulk
+   :doc:`Driver documentation </applications/drivers>` for details
+   on performing bulk inserts in your application. Also see
+   :doc:`/core/import-export`.
+
+Bulk Inserts on Sharded Clusters
+--------------------------------
+
+While ``ContinueOnError`` is optional on unsharded clusters, all bulk
 operations to a :term:`sharded collection <sharded cluster>` run with
-``ContinueOnError``, which applications cannot disable. See
-:ref:`sharding-bulk-inserts` section for more information on
-consideration for bulk inserts in sharded clusters.
-
-For more information see your :doc:`driver documentation
-</applications/drivers>` for details on performing bulk inserts in
-your application. Also consider the following resources:
-:ref:`write-operations-sharded-clusters`,
-:ref:`sharding-bulk-inserts`, and
-:doc:`/core/import-export`.
+``ContinueOnError``, which cannot be disabled.
+
+Large bulk insert operations, including initial data inserts or routine
+data import, can affect :term:`sharded cluster` performance. For
+bulk inserts, consider the following strategies:
+
+Pre-Split the Collection
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+If the sharded collection is empty, then the collection has only
+one initial :term:`chunk`, which resides on a single shard.
+MongoDB must then take time to receive data, create splits, and
+distribute the split chunks to the available shards. To avoid this
+performance cost, you can pre-split the collection, as described in
+:doc:`split-chunks-in-sharded-cluster`.
+
+Insert to Multiple ``mongos``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To parallelize import processes, send insert operations to more than
+one :program:`mongos` instance. Pre-split empty collections first as
+described in :doc:`split-chunks-in-sharded-cluster`.
+
+Avoid Monotonic Throttling
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If your shard key increases monotonically during an insert, then all
+inserted data goes to the last chunk in the collection, which will
+always end up on a single shard. Therefore, the insert capacity of the
+cluster will never exceed the insert capacity of that single shard.
+
+If your insert volume is larger than what a single shard can process,
+and if you cannot avoid a monotonically increasing shard key, then
+consider the following modifications to your application:
+
+- Reverse the binary bits of the shard key. This preserves the
+  information and avoids correlating insertion order with increasing
+  sequence of values.
+
+- Swap the first and last 16-bit words to "shuffle" the inserts.
+
+.. example:: The following example, in C++, swaps the leading and
+   trailing 16-bit word of :term:`BSON` :term:`ObjectIds <ObjectId>`
+   generated so that they are no longer monotonically increasing.
+
+   .. code-block:: cpp
+
+      using namespace mongo;
+      OID make_an_id() {
+        OID x = OID::gen();
+        const unsigned char *p = x.getData();
+        swap( (unsigned short&) p[0], (unsigned short&) p[10] );
+        return x;
+      }
+
+      void foo() {
+        // create an object
+        BSONObj o = BSON( "_id" << make_an_id() << "x" << 3 << "name" << "jane" );
+        // now we may insert o into a sharded collection
+      }
+
+.. seealso:: :ref:`sharding-shard-key` for information
+   on choosing a sharded key. Also see :ref:`Shard Key
+   Internals <sharding-internals-shard-keys>` (in particular,
+   :ref:`sharding-internals-operations-and-reliability`).
+
diff --git a/source/core/sharded-cluster-high-availability.txt b/source/core/sharded-cluster-high-availability.txt
@@ -54,8 +54,8 @@ Three distinct :program:`mongod` instances provide the :term:`config
 database` using a special two-phase commits to maintain consistent state
 between these :program:`mongod` instances. Cluster operation will
 continue as normal but :ref:`chunk migration <sharding-balancing>` and
-the cluster can create no new :ref:`chunk splits
-<sharding-procedure-create-split>`. Replace the config server as soon as
+the cluster can create no new :doc:`chunk splits
+</tutorial/split-chunks-in-sharded-cluster>`. Replace the config server as soon as
 possible. If all multiple config databases become unavailable, the
 cluster can become inoperable.
 

diff --git a/source/core/sharding-chunk-migration.txt b/source/core/sharding-chunk-migration.txt
@@ -22,8 +22,8 @@ chunks of a sharded collection evenly among shards. Migrations may be
 either:
 
 - Manual. Only use manual migration in limited cases, such as
-  to distribute data during bulk inserts. See :ref:`Migrating Chunks
-  Manually <sharding-balancing-manual-migration>` for more details.
+  to distribute data during bulk inserts. See :doc:`Migrating Chunks
+  Manually </tutorial/migrate-chunks-in-sharded-cluster>` for more details.
 
 - Automatic. The :doc:`balancer </core/sharding-balancing>` process
   automatically migrates chunks when there is an uneven distribution of

diff --git a/source/core/sharding-chunk-splitting.txt b/source/core/sharding-chunk-splitting.txt
@@ -23,8 +23,8 @@ Chunk Size
 .. todo:: link this section to <glossary:chunk size>
 
 The default :term:`chunk` size in MongoDB is 64 megabytes. You can
-:ref:`increase or reduce the chunk size
-<sharding-balancing-modify-chunk-size>`, mindful of its effect on the
+:doc:`increase or reduce the chunk size
+</tutorial/modify-chunk-size-in-sharded-cluster>`, mindful of its effect on the
 cluster's efficiency.
 
 #. Small chunks lead to a more even distribution of data at the

diff --git a/source/faq/diagnostics.txt b/source/faq/diagnostics.txt
@@ -269,7 +269,7 @@ As a related problem, the system will split chunks only on
 inserts or updates, which means that if you configure sharding and do not
 continue to issue insert and update operations, the database will not
 create any chunks. You can either wait until your application inserts
-data *or* :ref:`split chunks manually <sharding-procedure-create-split>`.
+data *or* :doc:`split chunks manually </tutorial/split-chunks-in-sharded-cluster>`.
 
 Finally, if your shard key has a low :ref:`cardinality
 <sharding-shard-key-cardinality>`, MongoDB may not be able to create
@@ -339,7 +339,7 @@ consider the following options, depending on the nature of the impact:
 #. If the balancer is always migrating chunks to the detriment of
    overall cluster performance:
 
-   - You may want to attempt :ref:`decreasing the chunk size <sharding-balancing-modify-chunk-size>`
+   - You may want to attempt :doc:`decreasing the chunk size </tutorial/modify-chunk-size-in-sharded-cluster>`
      to limit the size of the migration.
 
    - Your cluster may be over capacity, and you may want to attempt to

diff --git a/source/faq/sharding.txt b/source/faq/sharding.txt
@@ -55,7 +55,7 @@ is to:
 
 - configure sharding using a more ideal shard key.
 
-- :ref:`pre-split <sharding-administration-pre-splitting>` the shard
+- :doc:`pre-split </tutorial/create-chunks-in-sharded-cluster>` the shard
   key range to ensure initial even distribution.
 
 - restore the dumped data into MongoDB.

diff --git a/source/reference/command/moveChunk.txt b/source/reference/command/moveChunk.txt
@@ -55,6 +55,6 @@ side effects.
    In most cases allow the balancer to create and balance chunks
    in sharded clusters.
    See
-   :ref:`sharding-administration-create-chunks` for more information.
+   :doc:`/tutorial/create-chunks-on-sharded-cluster` for more information.
 
 .. admin-only
diff --git a/source/reference/config-database.txt b/source/reference/config-database.txt
@@ -280,7 +280,7 @@ Collections
    sharding configuration settings:
 
    - Chunk size. To change chunk size,
-     see :ref:`sharding-balancing-modify-chunk-size`.
+     see :doc:`/tutorial/modify-chunk-size-in-sharded-cluster`.
 
    - Balancer status. To change status,
      see :ref:`sharding-balancing-disable-temporarily`.

diff --git a/source/reference/configuration-options.txt b/source/reference/configuration-options.txt
@@ -213,7 +213,7 @@ Settings
 
    *Default:* None.
 
-   Specify a file location to hold the ":term:`PID`" or process ID of the
+   Specify a file location to hold the :term:`PID` or process ID of the
    :program:`mongod` process. Useful for tracking the :program:`mongod` process in
    combination with the :setting:`fork` setting.
 
@@ -230,7 +230,7 @@ Settings
    information. This option is only useful for the connection between
    replica set members.
 
-   .. seealso:: ":ref:`Replica Set Security <replica-set-security>`"
+   .. seealso:: :ref:`Replica Set Security <replica-set-security>`
 
 .. setting:: nounixsocket
 
@@ -556,7 +556,7 @@ Settings
    profiler is on, :program:`mongod` the profiler writes to the
    ``system.profile`` collection.
 
-   .. seealso:: ":setting:`profile`"
+   .. seealso:: :setting:`profile`
 
 .. setting:: smallfiles
 
@@ -698,9 +698,9 @@ Replication Options
    sets. Specify a replica set name as an argument to this set. All
    hosts must have the same set name.
 
-   .. seealso:: ":doc:`/replication`,"
-      ":doc:`/administration/replica-set-deployment`," and
-      ":doc:`/reference/replica-configuration`"
+   .. seealso:: :doc:`/replication`,
+      :doc:`/administration/replica-set-deployment`, and
+      :doc:`/reference/replica-configuration`
 
 .. setting:: oplogSize
 
@@ -897,7 +897,7 @@ Sharded Cluster Options
    :setting:`chunkSize` *only* sets the chunk size when initializing
    the cluster for the first time. If you modify the run-time option
    later, the new value will have no effect. See the
-   ":ref:`sharding-balancing-modify-chunk-size`" procedure if you
+   :doc:`/tutorial/modify-chunk-size-in-sharded-cluster` procedure if you
    need to change the chunk size on an existing sharded cluster.
 
 .. setting:: localThreshold

diff --git a/source/reference/glossary.txt b/source/reference/glossary.txt
@@ -567,7 +567,7 @@ Glossary
       expedites the initial distribution of documents in :term:`sharded
       cluster` by manually dividing the collection rather than waiting
       for the MongoDB :term:`balancer` to do so. See
-      :ref:`sharding-administration-pre-splitting`.
+      :doc:`/tutorial/create-chunks-in-sharded-cluster`.
 
    primary
       In a :term:`replica set`, the primary member is the current

diff --git a/source/reference/program/mongos.txt b/source/reference/program/mongos.txt
@@ -50,7 +50,7 @@ Options
    runtime-configurations. While the options are equivalent and
    accessible via the other command line arguments, the configuration
    file is the preferred method for runtime configuration of
-   mongod. See the ":doc:`/reference/configuration-options`" document
+   mongod. See the :doc:`/reference/configuration-options` document
    for more information about these options.
 
    Not all configuration options for :program:`mongod` make sense in
@@ -153,7 +153,7 @@ Options
 
 .. option:: --pidfilepath <path>
 
-   Specify a file location to hold the ":term:`PID`" or process ID of the
+   Specify a file location to hold the :term:`PID` or process ID of the
    :program:`mongos` process. Useful for tracking the :program:`mongos` process in
    combination with the :option:`mongos --fork` option.
 
@@ -239,7 +239,7 @@ Options
    This option *only* sets the chunk size when initializing the
    cluster for the first time. If you modify the run-time option
    later, the new value will have no effect. See the
-   ":ref:`sharding-balancing-modify-chunk-size`" procedure if you
+   :doc:`/tutorial/modify-chunk-size-in-sharded-cluster` procedure if you
    need to change the chunk size on an existing sharded cluster.
 
 .. option:: --ipv6

diff --git a/source/tutorial/configure-sharded-cluster-balancer.txt b/source/tutorial/configure-sharded-cluster-balancer.txt
@@ -46,7 +46,7 @@ details, see :ref:`sharding-chunk-size`.
 Changing the default chunk size affects chunks that are processes during
 migrations and auto-splits but does not retroactively affect all chunks.
 
-To configure default chunk size, see :ref:`sharding-balancing-modify-chunk-size`.
+To configure default chunk size, see :doc:`modify-chunk-size-in-sharded-cluster`.
 
 .. _sharded-cluster-config-max-shard-size:
 

diff --git a/source/tutorial/create-chunks-in-sharded-cluster.txt b/source/tutorial/create-chunks-in-sharded-cluster.txt
@@ -0,0 +1,66 @@
+==================================
+Create Chunks in a Sharded Cluster
+==================================
+
+.. default-domain:: mongodb
+
+Pre-splitting the chunk ranges in an empty sharded collection allows
+clients to insert data into an already partitioned collection. In most
+situations a :term:`sharded cluster` will create and distribute chunks
+automatically without user intervention. However, in a limited number
+of cases, MongoDB cannot create enough chunks or distribute
+data fast enough to support required throughput. For example:
+
+- If you want to partition an existing data collection that resides on a
+  single shard.
+
+- If you want to ingest a large volume of data into a cluster that isn't
+  balanced, or where the ingestion of data will lead to data imbalance.
+  For example, monotonically increasing or decreasing shard keys insert
+  all data into a single chunk.
+
+These operations are resource intensive for several reasons:
+
+- Chunk migration requires copying all the data in the chunk from one shard to
+  another.
+
+- MongoDB can migrate only a single chunk at a time.
+
+- MongoDB creates splits only after an insert operation.
+
+.. warning::
+
+   Only pre-split an empty collection. If a collection already has data,
+   MongoDB automatically splits the collection's data when you enable
+   sharding for the collection. Subsequent attempts to manually create
+   splits can lead to unpredictable chunk ranges and sizes as well as
+   inefficient or ineffective balancing behavior.
+
+To create chunks manually, use the following procedure:
+
+#. Split empty chunks in your collection by manually performing
+   the :dbcommand:`split` command on chunks.
+
+   .. example::
+
+      To create chunks for documents in the ``myapp.users``
+      collection using the ``email`` field as the :term:`shard key`,
+      use the following operation in the :program:`mongo` shell:
+
+        .. code-block:: javascript
+
+           for ( var x=97; x<97+26; x++ ){
+             for( var y=97; y<97+26; y+=6 ) {
+               var prefix = String.fromCharCode(x) + String.fromCharCode(y);
+               db.runCommand( { split : "myapp.users" , middle : { email : prefix } } );
+             }
+           }
+
+      This assumes a collection size of 100 million documents.
+
+   For information on the balancer and automatic distribution of
+   chunks across shards, see :ref:`sharding-balancing-internals`
+   and :ref:`sharding-chunk-migration`. For
+   information on manually migrating chunks, see
+   :doc:`/tutorial/migrate-chunks-in-sharded-cluster`.
+