MongoDB Map-Reduce Basics

Note: Updated on Feb. 25, 2011 for MongoDB v1.8

If you’re accustomed to working with relational databases, the thought of specifying aggregations with map-reduce may be a bit intimidating. Here, in the third in a series of articles on MongoDB aggregation, I explain map-reduce. After reading this, and with a little practice, you’ll be able to apply the map-reduce paradigm to a huge number of aggregation problems.

A comments example, of course

Let’s start with an example. Suppose we have a collection of comments with the following document structure:

{
    text: "lmao! great article!",
    author: 'kbanker',
    votes: 2
}

Here we have a comment authored by ‘kbanker’ with two votes. Now, we want to find the total number of votes each comment author has earned across the entire comment collection. It’s a problem easily solved with map-reduce.

Mapping

As its name suggests, map-reduce essentially involves two operations. The first, specified by our map function, formats our data as a series of key-value pairs. Our key is the comment author’s name (this makes sense only if this username is unique). Our value is a document containing the number of votes. We generate these key-value pairs by emitting them. See below:

// Our key is author's userame; 
// our value, the number of votes for the current comment.
var map = function() {
    emit(this.author, {votes: this.votes});
};

When we run map-reduce, the map function is applied to each document. This results in a collection of key-value pairs. What do we do with these results? It turns out that we don’t even have to think about them because they’re automatically passed on to our reduce function.

Reducing

Specifically, the reduce function will be invoked with two arguments: a key and an array of values associated with that key. Returning to our example, we can imagine our reduce function receiving something like this:

reduce('kbanker', [{votes: 2}, {votes: 1}, {votes: 4}]);

Given that, it’s easy to come up with a reduce function for tallying these votes:

// Add up all the votes for each key.
var reduce = function(key, values) {
    var sum = 0;
    values.forEach(function(doc) {
        sum += doc.votes;
    });
    return {votes: sum};
};

Results

So how do we we run it? From the shell, we pass our map and reduce functions to the mapReduce helper. Note that as of MongoDB v1.8, you must specify an output collection name.

// Running mapReduce.
var op = db.comments.mapReduce(map, reduce, {out: "mr_results"});

{
    "result" : "mr_results",
    "timeMillis" : 8,
    "counts" : {
        "input" : 6,
        "emit" : 6,
        "output" : 2
    },
    "ok" : 1
}

Notice that running the mapReduce helper returns stats on the operation; the results of the operation itself are stored in the collection specified, here called “mr_results”.

The other stats also prove informative. First is the operation time in milliseconds. Next are the number of input documents, the number of times we called emit (this can be more than once per document), and the number of output documents. Finally, we can be assured that the operation has succeeded because “ok” is 1.

Of course, what we really want are the results. To get them, just query the output collection:

// Getting the results from the shell
db[op.result].find();

{ "_id" : "hwaet", "value" : { "votes" : 21 } }
{ "_id" : "kbanker", "value" : { "votes" : 13 } }

Output types: merge, reduce, and inline

MongoDB v1.8 introduces a couple changes to map-reduce’s output. First, temporary collections are no more. All output now must go into a specifically named collection or be returned inline.

If you specify the collection name only, then any existing collection of the same name will be overwritten.

But there are now two new options that allow you to preserve the existing collection either by merging the new set of results or folding them in with the reduce function. Let’s take a closer look at these.

The syntax for merging is simple:

db.comments.mapReduce(map, reduce, {out: {merge: "mr_results"}}); A merge will overwrite any existing keys with the newly-generated keys.

The syntax for reducing is similar:

db.comments.mapReduce(map, reduce, {out: {reduce: "mr_results"}});

Reducing works like this: whenever a key in the new results already exists in the output collection, the reduce function is applied to both keys, and the return value replaces the existing key.

Definitions are always hard to visualize; an example should help to clarify the difference between merge and reduce. So suppose an earlier map-reduce places the following two results into an output collection.

{ "_id" : "hwaet", "value" : { "votes" : 5 } }
{ "_id" : "kbanker", "value" : { "votes" : 5 } }

Now imagine that we run map-reduce again with a more recent data set, producing the following results:

{ "_id" : "hwaet", "value" : { "votes" : 1 } }
{ "_id" : "jones", "value" : { "votes" : 100 } }

If we run a map-reduce merge, the final collection will contain these values:

{ "_id" : "hwaet", "value" : { "votes" : 1 } }
{ "_id" : "kbanker", "value" : { "votes" : 5 } }
{ "_id" : "jones", "value" : { "votes" : 100 } }

With reduce, the values will be these:

{ "_id" : "hwaet", "value" : { "votes" : 6 } }
{ "_id" : "kbanker", "value" : { "votes" : 5 } }
{ "_id" : "jones", "value" : { "votes" : 100 } }

Notice that the values for the “hwaet” key are reduced using the reduce function. In this case, this simply means adding the votes together.

The final new map-reduce option allows you to return the results rather than writing them to a collection. Here’s how you use it in JavaScript:

db.comments.mapReduce(map, reduce, {out: {inline: 1}});

If you’re using :out => {:inline => 1} with the Ruby driver, be sure that you also specify the :raw => true parameter to prevent the Collection#map_reduce method from attempting to return an instance of Mongo::Collection.

How do I execute map-reduce from Ruby?

Like this:

# Running map-reduce from Ruby (irb) assuming
# that @comments references the comments collection

# Specify the map and reduce functions in JavaScript, as strings
>> map    = "function() { emit(this.author, {votes: this.votes}); }"
>> reduce = "function(key, values) { " +
    "var sum = 0; " +
    "values.forEach(function(doc) { " +
    " sum += doc.votes; " +
    "}); " +
    "return {votes: sum}; " +
"};"

# Pass those to the map_reduce helper method
@results = @comments.map_reduce(map, reduce, :out => "mr_results")

# Since this method returns an instantiated results collection,
# we just have to query that collection and iterate over the cursor.
>> @results.find().to_a
=> [
    {"_id" => "hwaet",   "value"=>{"votes"=>21.0}},
    {"_id" => "kbanker", "value"=>{"votes"=>13.0}}
]

Practice

If you’ve followed along, you should understand the basics of map-reduce in MongoDB. For all the details on options, see the docs. For extra practice, fire up the MongoDB shell and experiment away.

Original: http://kylebanker.com/blog/2009/12/mongodb-map-reduce-basics/

A+ a-
Clip in Evernote