Skip to content

DOCS-458 draft pre-splitting doc #165

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 31, 2012
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 35 additions & 41 deletions draft/faq-sharding-addition.txt
Original file line number Diff line number Diff line change
@@ -1,54 +1,48 @@
What are the best ways to successfully insert larger volumes of data into as sharded collection?
------------------------------------------------------------------------------------------------
For high-volume inserts, when is it necessary to first pre-split data?
----------------------------------------------------------------------

- what is pre-splitting
Whether to pre-split before a high-volume insert depends on the
:term:`shard key`, the existing distribution of :term:`chunks <chunk>`,
and how evenly distributed the insert operation is.

In sharded environments, MongoDB distributes data into :term:`chunks
<chunk>`, each defined by a range of shard key values. Pre-splitting is a command run
prior to data insertion that specifies the shard key values on which to split up chunks.
In the following cases, we recommend pre-splitting before a large insert:

- Pre-splitting is useful before large inserts into a sharded collection when:
- Inserting data into an empty collection

1. inserting data into an empty collection
If a collection is empty, the database takes time to determine the
optimal key distribution. If you insert many documents in rapid
succession, MongoDB initially directs writes to a single chunk, which
can affect performance. Predefining splits improves write performance
in the early stages of a bulk import by eliminating the database's
"learning" period.

If a collection is empty, the database takes time to determine the optimal key
distribution. If you insert many documents in rapid succession, MongoDB will initially
direct writes to a single chunk, potentially having significant impacts on performance.
Predefining splits may improve write performance in the early stages of a bulk import by
eliminating the database's "learning" period.
- Data is not evenly distributed

2. data is not evenly distributed
Even if the sharded collection contains existing documents balanced
over multiple chunks, :term:`pre-splitting` is beneficial if the write
operation itself isn't evenly distributed, i.e., if the inserts
include shard-key values that are contained on only a small number of
chunks. By pre-splitting and using an increasing shard key, you can
prevent writes from monopolizing a single :term:`shard`.

Even if the sharded collection was previously populated with documents and contains multiple
chunks, pre-splitting may be beneficial if the write operation isn't evenly distributed, in
other words, if the inserts have shard keys values contained on a small number of chunks.
- Monotomically increasing shard key.

3. monotomically increasing shard key
If you attempt to insert data with monotonically increasing shard
keys, the writes will always occur on the last chunk in the
collection. Predefining splits helps to cycle a large write operation
around the cluster; however, pre-splitting in this instance will not
prevent consecutive inserts from hitting a single shard.

If you attempt to insert data with monotonically increasing shard keys, the writes will
always hit the last chunk in the collection. Predefining splits may help to cycle a large
write operation around the cluster; however, pre-splitting in this instance will not
prevent consecutive inserts from hitting a single shard.
Pre-splitting might *not* be necessary in the following cases:

- when does it not matter
- If data insertion is not rapid, MongoDB may have enough time to split
and migrate chunks without affecting performance.

If data insertion is not rapid, MongoDB may have enough time to split and migrate chunks without
impacts on performance. In addition, if the collection already has chunks with an even key
distribution, pre-splitting may not be necessary.
- If the collection already has chunks with an even key distribution,
pre-splitting may not be necessary.

See the ":doc:`/tutorial/inserting-documents-into-a-sharded-collection`" tutorial for more
information.
For more information, see :doc:`/tutorial/inserting-documents-into-a-sharded-collection`.


Is it necessary to pre-split data before high volume inserts into a sharded collection?
---------------------------------------------------------------------------------------

The answer depends on the shard key, the existing distribution of chunks, and how
evenly distributed the insert operation is. If a collection is empty prior to a
bulk insert, the database will take time to determine the optimal key
distribution. Predefining splits improves write performance in the early stages
of a bulk import.

Pre-splitting is also important if the write operation isn't evenly distributed.
When using an increasing shard key, for example, pre-splitting data can prevent
writes from hammering a single shard.
.. SK, I flipped the above sentence, which could instead read:
.. See :doc:`/tutorial/inserting-documents-into-a-sharded-collection` for more information.
.. I prefer the former, but I think you prefer the latter. Let me know. -BG
Loading