Skip to content

PHPLIB-53 Add documentation for mapReduce #419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Oct 25, 2017

Conversation

kvwalker
Copy link
Contributor

type: boolean
description: |
Specifies whether to convert intermediate data into BSON format between the execution
of the map and reduce functions. The default is false.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with other doc pages, let's wrap boolean literals true and false with two back-ticks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: if it's not too much trouble to setup in your editor, most of the docs pages wrap at 80 characters. I noticed that some of these lines ran a bit long, and paragraphs in MongoDBCollection-mapReduce.txt wrapped well before 80.

name: scope
type: document
description: |
Specifies global variables that are accessible in the map, reduce and finalize functions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oxford comma after "reduce, "

type: boolean
description: |
Specifies whether to include the timing information in the result information.
The default is true.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true


The :manual:`mapReduce </core/map-reduce>` command
allows you to run map-reduce aggregation operations
over a collection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that this is based on the intro paragraph on db.collection.mapReduce(), I think we can link to both the command reference and "core" manual section here.

If we're referring to the mapReduce command, let's use:

:manual:`mapReduce </reference/command/mapReduce>`

Later in the sentence "map-reduce" can then link to the page in "/core".


.. code-block:: php

function mapReduce($map, $reduce, $out, array $options = []) MongoDB\MapReduceResult
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a ": " before the return type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that $map and $reduce may get type hints after other PR changes.

result documents must be within the :limit:`BSON Document Size` limit,
which is currently 16 megabytes. For additional information on limits
and restrictions on map-reduce operations, see the
:manual:`mapReduce </reference/command/mapReduce>` reference page.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this last sentence comes straight from https://docs.mongodb.com/manual/core/map-reduce/#map-reduce-behavior, but the mapReduce command page doesn't really have any additional info. The Output Inline section merely repeats the BSON document limit, so I propose we just drop the last sentence here.

or return the results inline. If you write map-reduce output to a
collection, you can perform subsequent map-reduce operations on the
same input collection that merge replace, merge, or reduce new results
with previous results. See :manual:`mapReduce </core/map-reduce>` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As done for the intro section, the link to "core" can be "Map-Reduce" since we're referring to the title of the linked page and not the command reference.


Example
-------
This example will get all users with at least one "sale" event and count the number of sales for each of these users.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RST doesn't seem to mind, but a link break between the heading and following paragraph would be consistent with other doc sources.

$out = array("merge" => "eventCounts");

$sales = new MapReduce($this->getDatabaseName(), $this->getCollectionName(), $map, $reduce, $out);
$sales->execute($this->getPrimaryServer());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that #392 never added a Collection::mapReduce() method for the operation. Would you mind taking a crack at adding that?

I'd like to steer clear of talking about the operation classes in our documentation, so this example would ideally just use the collection method. $this->getPrimaryServer() refers to a method in the test suite, so that will naturally get removed.

Existing Collection methods should provide examples of how to merge in special options such as readConcern and writeConcern.

Lastly, I would propose we change the operation class to check map, reduce, and finalize as an instance of MongoDB\BSON\JavascriptInterface instead of MongoDB\BSON\Javascript. That type hint can also be enforced on the Collection method for $map and $reduce. Since $out allows multiple types, that won't need a type hint (we can reply on the operation's constructor to validate).


When the time comes to rewrite the example using the Collection method, the examples for find() and the various findOneAnd___() methods might be helpful. Those are all based on existing data from one of the example data collections, so if you can cook up a mapReduce command for one of those data sets, I might suggest running it in inline mode and then printing output as is done for find() (iterating on the MapReduceResult and dumping each document).

$sales = new MapReduce($this->getDatabaseName(), $this->getCollectionName(), $map, $reduce, $out);
$sales->execute($this->getPrimaryServer());

This will create a MapReduceResult.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the example ends up printing output and using The output would then resemble:: boilerplate, we can drop this line.

Otherwise, we'll want to use the :phpclass: notation so we can link to a forthcoming documentation page for the result class.

-------------

A :phpclass:`MongoDB\\MapReduceResult` object, which allows for iteration of mapReduce results
irrespective of the output method (e.g. inline, collection) via the IteratorAggregate interface.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the following to link to IteratorAggregate:

:php:`IteratorAggregate <iteratoraggregate>`

Return Values
-------------

A :phpclass:`MongoDB\\MapReduceResult` object, which allows for iteration of mapReduce results
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also realize that we don't have an existing page for MapReduceResult. We can take care of that in a later PR (a usage example on this page is going to be more helpful for users anyway). Once a stub for the class exists, this syntax should start linking to it automatically.

The library reference has sections for both write result and enumeration classes, and MapReduceResult overlaps with both. Down the line, I imagine we can consolidate both of those pages into a single, large, "result classes" page.

$collection = (new MongoDB\Client)->test->zips;

$map = new Javascript('function() { emit(this.state, this.pop); }');
$reduce = new Javascript('function(key, values) { return Array.sum(values) }');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MongoDB\BSON\Javascript for the class names. Alternatively, you can add a use statement at the top of the example.


$map = new Javascript('function() { emit(this.state, this.pop); }');
$reduce = new Javascript('function(key, values) { return Array.sum(values) }');
$out = '{ inline: 1 }';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant ['inline' => 1] here. If a string is provided for $out, it's assumed to be the collection name (see: out options).

$reduce = new Javascript('function(key, values) { return Array.sum(values) }');
$out = '{ inline: 1 }';

$populations = new MapReduce($map, $reduce, $out);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$collection->mapReduce(...)?

*/
public function mapReduce(JavascriptInterface $map, JavascriptInterface $reduce, $out, array $options = [])
{
$operation = new MapReduce($this->databaseName, $this->collectionName, $map, $reduce, $out, $options);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a use statement for MongoDB\Operation\MapReduce.

string(2) "AZ"
["value"]=>
float(3665228)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent example!

* @param array $options Command options
* @return MapReduceResult
* @throws UnsupportedException if options are not supported by the selected server
* @throws InvalidArgumentException for parameter/option parsing errors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@throws UnexpectedValueException if the command response was malformed can be added here. Similar to what we have in aggregate().

public function mapReduce(JavascriptInterface $map, JavascriptInterface $reduce, $out, array $options = [])
{
$operation = new MapReduce($this->databaseName, $this->collectionName, $map, $reduce, $out, $options);
$server = $this->manager->selectServer(new ReadPreference(ReadPreference::RP_PRIMARY));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The read preference is configurable, but we will want to coerce a primary read preference if the out option is not inline. That will be similar to how aggregate() checks if the $out pipeline operator is being used and, if so, forces a primary read preference. After we establish the read preference, we then use that to select a server upon which the operation class will be executed.

Since the out option can be a document (i.e. PHP array or object), I'd suggest looking at is_last_pipeline_operator_out() in functions.php for a resilient way to check. Something like the following should work:

// In this case, $out is probably a string (i.e. collection name)
if ( ! is_array($out) && ! is_object($out)) {
    return false;
}

// Cast a possible object to an array so we can conveniently check the first field name
$out = (array) $out;

return key($out) === 'inline';

Additionally, the type map, read concern, and write concern can all be inherited from the collection (if unspecified in the options). See aggregate() for an example of how we merge those in. Notably, read concern depends on the server's wire version, so we only inject a default value if we know it's supported.

@@ -810,6 +811,28 @@ public function listIndexes(array $options = [])
}

/**
* Runs map-reduce aggregation operations as defined by the $map and $reduce parameters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest "Executes a map-reduce aggregation on the collection." based on aggregate().

@@ -78,7 +79,7 @@ name: verbose
type: boolean
description: |
Specifies whether to include the timing information in the result information.
The default is true.
The default is `true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here re: two back-ticks.

Specifies whether to convert intermediate data into BSON format between the execution
of the map and reduce functions. The default is false.
Specifies whether to convert intermediate data into BSON format between the
execution of the map and reduce functions. The default is `false`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think RST requires two back-ticks, unlike Markdown, which allows one.


// Check if the out option is inline because we will want to coerce a primary read preference if not
if ( ! is_array($out) && ! is_object($out)) {
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole logic for determining if $out is inline would be better relegated to a function. A private method at the bottom of the Collection class can work (it doesn't need to be in functions.php if we won't need to use it outside of this class).

As-is, this logic might return false and abort the mapReduce() method if $out is a collection name string (a valid use case).


$server = $this->manager->selectServer($options['readPreference']);

// Set options if not set already
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest copying the same comment from aggregate(), as the explanation for readConcern is the same (see: SERVER-20214).

/* A "majority" read concern is not compatible with the $out stage, so
 * avoid providing the Collection's read concern if it would conflict.
 */

Also, a majority read concern is only a problem if $out is not inline (i.e. mapReduce will write results to an output collection). To that end, we probably want to track $isOutInline (or its inverse, $hasOutputCollection) and implement similar logic as aggregate() does with the $hasOutStage variable. For aggregate(), $hasOutStage means the command will write to a collection.

$options['readConcern'] = $this->readConcern;
}

if ( ! isset($options['typeMap']) && ( ! isset($options['useCursor']) || $options['useCursor'])) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MapReduce doesn't have a useCursor option, so this extra logic isn't relevant.

Aggregate happens to be limited because we never implemented support for type maps when it returns results inline without using a cursor. MapReduce::createGetIteratorCallable() actually gets around that limitation, which I didn't think of at the time when implementing Aggregate. That may be something to fix down the line for Aggregate, depending on how long we'll keep supporting non-cursor replies (that support may end when we remove support for MongoDB 2.6).


$options = ["readPreference" => ReadPreference::RP_SECONDARY_PREFERRED];

$operation = new MapReduce($this->getDatabaseName(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be testing the Operation class directly, as the test doesn't interact with the Collection object at all.

I think we can just remove the testMapReduce() test, as the read preference option isn't testing much. In most environments, this is just going to select the primary server.

$this->createFixtures(3);

$map = new Javascript('function() { emit(this._id, this.x); }');
$reduce = new Javascript('function(key, values) { return Array.sum(values); }');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might suggest emitting on 1 instead of this._id, so that we collect all documents into a single bucket. Then, the reduce function can compute the sum of all x values. We can later assert that the result of map reduce is a single document with [ '_id' => 1, 'x' => 66 ]. Something like:

$expected = [
    [ '_id' => 1, 'x' => 66 ],
];

$this->assertSameDocuments($expected, $mapReduceResult);

$options['readPreference'] = new ReadPreference(ReadPreference::RP_PRIMARY);
}

$server = $this->manager->selectServer($options['readPreference']);

// Set options if not set already
if ( ! isset($options['readConcern']) && ! ($this->readConcern->getLevel() === ReadConcern::MAJORITY) && \MongoDB\server_supports_feature($server, self::$wireVersionForReadConcern)) {
/* A "majority" read concern is not compatible with $out inline, so
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"$out" was a reference to the aggregation pipeline operator. "with inline output, so" may read better.

/* A "majority" read concern is not compatible with $out inline, so
* avoid providing the Collection's read concern if it would conflict.
*/
if ( ! isset($options['readConcern']) && ! ($isOutInline && $this->readConcern->getLevel() === ReadConcern::MAJORITY) && \MongoDB\server_supports_feature($server, self::$wireVersionForReadConcern)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a logic error in the middle condition:

! ($isOutInline && $this->readConcern->getLevel() === ReadConcern::MAJORITY)

If we apply De Morgan's Law to distribute the negation, we get:

( ! $isOutInline || $this->readConcern->getLevel() !== ReadConcern::MAJORITY )

We can write this in prose as:

mapReduce is writing to a collection OR read concern is not "majority"

And back in the original condition:

  • the user hasn't specified a read concern AND
  • (mapReduce is writing to a collection OR read concern is not "majority") AND
  • the server supports read concerns

Given this, it's possible that the Collection could have a majority read concern and we'd still inherit it as a default when mapReduce is writing to a collection. Bad news bears.

While we can fix this by negating $isOutInline within the middle conditional, it's already hard to read as it is. I'd suggest changing the variable to track the inverse (e.g. $hasOutputCollection as mentioned in #419 (comment)). Then, this condition basically copies the aggregate() snippet with one variable rename:

/* A "majority" read concern is not compatible with writing output to a
 * collection, so avoid providing the Collection's read concern if it
 * would conflict.
 */
if ( ! isset($options['readConcern']) &&
     ! ($hasOutputCollection && $this->readConcern->getLevel() === ReadConcern::MAJORITY) &&
    \MongoDB\server_supports_feature($server, self::$wireVersionForReadConcern)) {
    $options['readConcern'] = $this->readConcern;
}


if (key($out) !== 'inline') {
if ($this->isOutInline($out)) {
$isOutInline = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues here:

  1. isOutInline is used later in a conditional, but its assignment here is nested within a condition. As such, it's possible we'd get an undefined variable error later on.
  2. This condition forces a primary read preference is output is inline (i.e. not writing to a collection), which is backwards.

If we make the swap to $hasOutputCollection, I'd suggest assigning $hasOutputCollection = ! $this->isOutInline($out); in the method's top scope and then replacing this condition with if ($hasOutputCollection) to force a primary read preference.

@jmikola
Copy link
Member

jmikola commented Oct 25, 2017

LGTM. I noticed there were a few documentation changes mixed in with the last four commits related to Collection::mapReduce(), so feel free to squash everything into a single commit for merging.

@kvwalker kvwalker merged commit d7ba99e into mongodb:master Oct 25, 2017
@kvwalker kvwalker deleted the PHPLIB-53 branch October 25, 2017 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants