Skip to content

Map reduce 1 #437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 30, 2012
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions source/aggregation.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,5 @@ The following is the outline of the aggregation documentation:
applications/aggregation
tutorial/aggregation-examples
reference/aggregation
applications/map-reduce

2 changes: 2 additions & 0 deletions source/applications.txt
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ The following documents outline basic application development topics:
- :doc:`/applications/replication`
- :doc:`/applications/indexes`
- :doc:`/applications/aggregation`
- :doc:`/applications/map-reduce`

.. _application-patterns:

Expand All @@ -50,5 +51,6 @@ The following documents provide patterns for developing application features:
.. toctree::
:maxdepth: 2

tutorial/troubleshoot-map-reduce
tutorial/perform-two-phase-commits
tutorial/expire-data
287 changes: 287 additions & 0 deletions source/applications/map-reduce.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,287 @@
==========
Map-Reduce
==========

.. default-domain:: mongodb

Map-reduce operations can handle complex aggregation tasks. To perform
map-reduce operations, MongoDB provides the :dbcommand:`mapReduce`
command and, in the :program:`mongo` shell, the
:method:`db.collection.mapReduce()` wrapper method.

.. contents:: This overview will cover:
:backlinks: none
:local:
:depth: 1

For many simple aggregation tasks, see the :doc:`aggregation framework
</applications/aggregation>`.

.. _map-reduce-examples:

Map-Reduce Examples
-------------------

This section provides some map-reduce examples in the :program:`mongo`
shell using the :method:`db.collection.mapReduce()` method:

.. code-block:: javascript

db.collection.mapReduce(
<map>,
<reduce>,
{
<out>,
<query>,
<sort>,
<limit>,
<finalize>,
<scope>,
<jsMode>,
<verbose>
}
)

For more information on the parameters, see the
:method:`db.collection.mapReduce()` reference page .

.. include:: /includes/examples-map-reduce.rst
:start-after: map-reduce-document-prototype-begin

.. _map-reduce-incremental:

Incremental Map-Reduce
----------------------

If the map-reduce dataset is constantly growing, then rather than
performing the map-reduce operation over the entire dataset each time
you want to run map-reduce, you may want to perform an incremental
map-reduce.

To perform incremental map-reduce:

#. Run a map-reduce job over the current collection and output the
result to a separate collection.

#. When you have more data to process, run subsequent map-reduce job
with:

- the ``<query>`` parameter that specifies conditions that match
*only* the new documents.

- the ``<out>`` parameter that specifies the ``reduce`` action to
merge the new results into the existing output collection.

Consider the following example where you schedule a map-reduce
operation on a ``sessions`` collection to run at the end of each day.

**Data Setup**

The ``sessions`` collection contains documents that log users' session
each day and can be simulated as follows:

.. code-block:: javascript

db.sessions.save( { userid: "a", ts: ISODate('2011-11-03 14:17:00'), length: 95 } );
db.sessions.save( { userid: "b", ts: ISODate('2011-11-03 14:23:00'), length: 110 } );
db.sessions.save( { userid: "c", ts: ISODate('2011-11-03 15:02:00'), length: 120 } );
db.sessions.save( { userid: "d", ts: ISODate('2011-11-03 16:45:00'), length: 45 } );

db.sessions.save( { userid: "a", ts: ISODate('2011-11-04 11:05:00'), length: 105 } );
db.sessions.save( { userid: "b", ts: ISODate('2011-11-04 13:14:00'), length: 120 } );
db.sessions.save( { userid: "c", ts: ISODate('2011-11-04 17:00:00'), length: 130 } );
db.sessions.save( { userid: "d", ts: ISODate('2011-11-04 15:37:00'), length: 65 } );

**Initial Map-Reduce of Current Collection**

#. Define the ``<map>`` function that maps the ``userid`` to an
object that contains the fields ``userid``, ``total_time``, ``count``,
and ``avg_time``:

.. code-block:: javascript

var mapFunction = function() {
var key = this.userid;
var value = {
userid: this.userid,
total_time: this.length,
count: 1,
avg_time: 0
};

emit( key, value );
};

#. Define the corresponding ``<reduce>`` function with two arguments
``key`` and ``values`` to calculate the total time and the count.
The ``key`` corresponds to the ``userid``, and the ``values`` is an
array whose elements corresponds to the individual objects mapped to the
``userid`` in the ``mapFunction``.

.. code-block:: javascript

var reduceFunction = function(key, values) {

var reducedObject = {
userid: key,
total_time: 0,
count:0,
avg_time:0
};

values.forEach( function(value) {
reducedObject.total_time += value.total_time;
reducedObject.count += value.count;
}
);
return reducedObject;
};

#. Define ``<finalize>`` function with two arguments ``key`` and
``reducedValue``. The function modifies the ``reducedValue`` document
to add another field ``average`` and returns the modified document.

.. code-block:: javascript

var finalizeFunction = function (key, reducedValue) {

if (reducedValue.count > 0)
reducedValue.avg_time = reducedValue.total_time / reducedValue.count;

return reducedValue;
};

#. Perform map-reduce on the ``session`` collection using the
``mapFunction``, the ``reduceFunction``, and the
``finalizeFunction`` functions. Output the results to a collection
``session_stat``. If the ``session_stat`` collection already exists,
the operation will replace the contents:

.. code-block:: javascript

db.sessions.mapReduce( mapFunction,
reduceFunction,
{
out: { reduce: "session_stat" },
finalize: finalizeFunction
}
)

**Subsequent Incremental Map-Reduce**

Assume the next day, the ``sessions`` collection grows by the following documents:

.. code-block:: javascript

db.sessions.save( { userid: "a", ts: ISODate('2011-11-05 14:17:00'), length: 100 } );
db.sessions.save( { userid: "b", ts: ISODate('2011-11-05 14:23:00'), length: 115 } );
db.sessions.save( { userid: "c", ts: ISODate('2011-11-05 15:02:00'), length: 125 } );
db.sessions.save( { userid: "d", ts: ISODate('2011-11-05 16:45:00'), length: 55 } );

5. At the end of the day, perform incremental map-reduce on the
``sessions`` collection but use the ``query`` field to select only the
new documents. Output the results to the collection ``session_stat``,
but ``reduce`` the contents with the results of the incremental
map-reduce:

.. code-block:: javascript

db.sessions.mapReduce( mapFunction,
reduceFunction,
{
query: { ts: { $gt: ISODate('2011-11-05 00:00:00') } },
out: { reduce: "session_stat" },
finalize: finalizeFunction
}
);

.. _map-reduce-temporay-collection:

Temporary Collection
--------------------

The map-reduce operation uses a temporary collection during processing.
At completion, the temporary collection will be renamed to the
permanent name atomically. Thus, one can perform a map-reduce operation
periodically with the same target collection name without worrying
about a temporary state of incomplete data. This is very useful when
generating statistical output collections on a regular basis.

.. _map-reduce-sharded-cluster:

Sharded Cluster
---------------

Sharded Input
~~~~~~~~~~~~~

If the input collection is sharded, :program:`mongos` will
automatically dispatch the map-reduce job to each shard to be executed
in parallel. There is no special option required. :program:`mongos`
will wait for jobs on all shards to finish.

Sharded Output
~~~~~~~~~~~~~~

By default the output collection will not be sharded. The process is:

- :program:`mongos` dispatches a map-reduce finish job to the shard
that will store the target collection.

- The target shard will pull results from all other shards, run a final
reduce/finalize, and write to the output.

- If using the sharded option in the ``<out>`` parameter, the output will be
sharded using ``_id`` as the shard key.

.. versionchanged:: 2.2

- If the output collection does not exist, the collection is created
and sharded on the ``_id`` field. Even if empty, its initial chunks
are created based on the result of the first step of the map-reduce
operation.

- :program:`mongos` dispatches, in parallel, a map-reduce finish job
to every shard that owns a chunk.

- Each shard will pull the results it owns from all other shards, run a
final reduce/finalize, and write to the output collection.

.. note::

- During additional map-reduce jobs, chunk splitting will be done as needed.

- Balancing of chunks for the output collection is automatically
prevented during post-processing to avoid concurrency issues.

Prior to version 2.1:

- :program:`mongos` retrieves the results from each shard, doing a
merge sort to order the results, and performs a reduce/finalize as
needed. :program:`mongos` then writes the result to the output
collection in sharded mode.

- Only a small amount of memory is required even for large datasets.

- Shard chunks do not get automatically split and migrated during
insertion. Manual intervention is required until the chunks are
granular and balanced.

.. warning::

Sharded output for mapreduce has been overhauled in v2.2. Its use in
earlier versions is not recommended.

.. _map-reduce-additional-references:

Additional References
---------------------

.. seealso::

- :doc:`/tutorial/troubleshoot-map-reduce`

- :wiki:`Map-Reduce Concurrency
<How+does+concurrency+work#Howdoesconcurrencywork-MapReduce>`

- `MapReduce, Geospatial Indexes, and Other Cool Features <http://www.slideshare.net/mongosf/mapreduce-geospatial-indexing-and-other-cool-features-kristina-chodorow>`_ - Kristina Chodorow at MongoSF (April 2010)
Loading