Understanding Aggregation In MongoDB: Aggregation Pipeline and MapReduce

The applications, nowadays are generating a huge amount of data and this data is mainly raw data. For doing statistical analysis on this data, we need to present this data in the summarized format. Here the requirement of data aggregation arrives. Data aggregation helps us to collect and summarize the data for statistical analysis.

MongoDB provides an aggregation framework for performing aggregation on the data in the collections. MongoDB provides three ways to perform aggregation, viz, aggregation pipeline, MapReduce function, and single-purpose aggregation methods. In this article, we will be learning how to use the aggregation pipeline and MapReduce function. In the end of this article, I will show you how to write a python code for performing MapReduce on the MongoDB database.

Aggregation Pipeline:

For data aggregation, MongoDB provides one framework called aggregation pipeline. The aggregation pipeline is modeled on the concept of data processing pipelines.
The aggregation pipeline is multistage. The document is processed through this pipeline and the aggregated results are presented.

MongoDB provides a db.collection.aggregate() method for running the aggregation pipeline in the mongo shell.

Let us see how to connect to the MongoDB running in the cloud and get the mongo shell.

  1. Create one database user for the cluster. Refer to this article to know more.

2. Go to the Clusters and click on connect button.

3. This will prompt to ask for a connection method. Choose Connect with the mongo shell option.

4. If you don’t have a mongo shell installed then download and install the mongo shell. After installing the mongo shell, copy the connection string(command) and run it in the terminal.

5. Run the connection string in the terminal. This will prompt for a password. Enter the password of the database user you have created, in this case, admin, and press enter.

Now, we are all set to run MongoDB commands in this mongo shell.

For inserting multiple data at once, MongoDB provides one command db.collection.insertMany().
Run the following command to insert the data into the collection.

Let’s check if data is inserted.

The aggregation pipeline consists of multiple stages. Here, I’m will be using a two-stage aggregation pipeline.

First stage: The $match is the first stage in this aggregation pipeline. It will filter the documents using the status field for which the status is verified. It will pass the filtered documents to the next stage.

Second stage: The $group is the second stage. This will group the documents by resource field and calculate the total number of each resource.

After running this pipeline, we will get the following output:

Now, you got an idea of the aggregation framework in MongoDB and why aggregation is useful.
Let us see how to do aggregation with the help of MapReduce in MongoDB.

MapReduce in MongoDB:

MapReduce is used for reducing large volumes of raw data into meaningful aggregated results. MapReduce consists of two programs/parts. The mapper program performs filtering and sorting of the data and the reducer program performs summary operations like counting the number of words.

To perform MapReduce operations, MongoDB provides db.collection.mapReduce() command.

This command requires two functions that are mapper function and the reducer function. Optionally, we can store the output of the mapReduce() method in the new collection using out keyword. Optionally, we can query using the query keyword.

Mapper function:

function() {
emit(this.resource, {$sum: 1});
}

This map function is applied to every document present in the collection that will satisfy the query. This function will emit key-value pairs.

After that, the reducer function will be applied to those key-value pairs.

Reducer function:

function(key, values) {
return Array.sum(values)
}

This reducer function will count the total number of each resource.

The query is used to filter the documents based on verified status.

Finally, we will be storing the result of this map-reduce operation in the collection named total_resources using the out keyword.

Note:
For performing map-reduce operations, the collection must be sharded.

Now, let's run this mapReduce() function.

Command failed!

And the error message is saying that the command is not allowed.

This is because of the limitations of the MongoDB Atlas free account.

Atlas free tier account doesn’t support some of the commands and the mapReduce() is one of the command/methods that is unsupported in Atlas free tier. You can get the list of all of the unsupported commands in the Atlas free tier here.

So, what is the solution? How we can run map-reduce operations on the MongoDB collection?

The solution is, install MongoDB locally OR upgrade your Atlas account. With the local installation of MongoDB, there is no restriction on the commands.

Note:
You can still use remote MongoDB database. You just have to change the data source from local to Atlas DB in the local MongoDB installation.

I will not be going for it in this article. That may be the topic of another article.

For MongoDB installation refer to this documentation.

After installing MongoDB locally, we will be doing map-reduce operations on this local database.

Insert data into local MongoDB collection.

Now, run the mapReduce() method.

Query succeeded!

This will insert the result of the map-reduce operation to the total_resources collection.

Let’s see what the total_resources collection contains.

The aggregated data is added to the total_resources collection.

Now, we have understood how the mapReduce() function works in MongoDB.

Let’s write a Python program for running the map-reduce operations on MongoDB.

Writing a program to perform MapReduce in Python

We will be writing a MapReduce program in Python language. So, for using the MongoDB database, we will require a MongoDB driver.

The pymongo is one of the MongoDB drivers for using the MongoDB database from the python program.

Let’s install pymongo driver.

$ pip install pymongo

The mapper and reducer functions that we will be writing are in the JavaScript language. So, for using JavaScript code inside python, we need to install one library called bson.

$ pip install bson

Now, we are all set to write a MapReduce program.

This code will insert the records into the myColl collection.

After that, the MapReduce program will count the resources and stores the aggregated result into the myresult collection.

Finally, we have written a Python code that will perform MapReduce operations on the MongoDB collection.

I believe you got an idea of the usefulness of data aggregation and the way to perform aggregation on the MongoDB database. I would love to know if you find this article helpful. Show some appreciation through applauds if you like this article.

For any help or suggestions connect with me on Twitter at @TheNameIsAnkush or find me on LinkedIn.

Tech blogger, researcher and integrator