Share
Explore

Lab Workbook: MONGO GRIDFS, Aggregation, Sharding, Aggregation for Data Analysis

Last edited 191 days ago by Peter Sigurdson

Section 1: MongoDB's GridFS system for storing and retrieving large files


Theoretical Overview:

MongoDB's GridFS system is used to store and retrieve large files in a database system. It allows you to access information from portions of large files without having to load the whole file into memory.
GridFS stores files in two collections: one for metadata and another for the file chunks. Each file is split into smaller chunks and stored in the chunks collection.
The metadata collection stores information about the file, such as the filename, content type, and file size.

Practical Exercise:
Create a sample database and a sample file to use in the exercise.
use mydbdb.fs.files.insertOne({ filename: "sample_file.txt", contentType: "text/plain", length: 1024})
var file = new Buffer(1024)for (var i = 0; i < 1024; i++) { file[i] = i % 256}
db.fs.chunks.insert({ files_id: ObjectId("5d0a8b8065af192c7b01f5bf"), n: 0, data: file})

Retrieve a section of the file using GridFS.
var readStream = db.fs.chunks.find({ files_id: ObjectId("5d0a8b8065af192c7b01f5bf"), n: 0}).sort({n: 1})
var writeStream = fs.createWriteStream('./output.txt')
readStream.on('data', function(chunk) { writeStream.write(chunk.data)})
readStream.on('end', function() { writeStream.end()})

Concept Review Questions:

What is the maximum file size that can be stored in MongoDB's GridFS system?
What are the advantages of using GridFS over a file system for storing large files in a database system?

Section 2: Aggregation in MongoDB for data analysis

Theoretical Overview:

Aggregation in MongoDB is a framework for data analysis that allows you to perform complex queries on a collection and return the results in a structured format.
Aggregation pipelines consist of stages, each of which performs a specific operation on the data.
The output of one stage serves as the input to the next stage. There are many operators that can be used in the pipeline to perform various operations, such as filtering, grouping, and projecting.

Practical Exercise:

Create a sample database and a sample collection to use in the exercise.

use mydbdb.sales.insertMany([ { _id: 1, item: "apple", qty: 5, price: 0.5 }, { _id: 2, item: "banana", qty: 10, price: 0.25 }, { _id: 3, item: "orange", qty: 15, price: 0.75 }, { _id: 4, item: "peach", qty: 20, price: 1 }])

Use aggregation to calculate the total revenue from sales.

db.sales.aggregate([ { $project: { revenue: { $multiply: [ "$qty", "$price" ] } } }, { $group: { _id: null, totalRevenue: { $sum: "$revenue" } } }])

Concept Review Questions:

What are the stages involved in the aggregation pipeline?
What are some of the operators that can be used in the aggregation pipeline?

Section 3: MongoDB's sharding feature for horizontally scaling a database system to handle large volumes of data

Theoretical Overview:

Sharding in MongoDB is a feature that allows you to horizontally scale a database system to handle large volumes of data. It distributes data across multiple servers, or shards, and balances the load between them. Sharding is typically used when a single server is no longer capable of handling the amount of data or traffic in the system.

Practical Exercise:

Create a sample sharded cluster with two shards.

mongod --shardsvr --replSet shard1 --port 27018 --dbpath /data/db1mongod --shardsvr --replSet shard2 --port 27019 --dbpath /data/db2mongos --configdb configserver:27017 --port 27017sh.addShard("shard1/shard1:27018")sh.addShard("shard2/shard2:27019")

Insert sample data into the sharded cluster.
use mydbsh.enableSharding("mydb")sh.shardCollection("mydb.sales", { "_id": "hashed" })
for (var i = 1; i <= 1000000; i++) { db.sales.insert({ _id: i, item: "item" + i, qty: Math.floor(Math.random() * 1000), price: Math.floor(Math.random() * 100) / 100 })}

Concept Review Questions:

What is the difference between sharding and replication in MongoDB?
What are the key components of a sharded cluster in MongoDB?

Conclusion:
Congratulations! You have completed this lab workbook on MongoDB.
By using GridFS for storing and retrieving large files, aggregation for data analysis, and MongoDB's sharding feature for horizontally scaling a database system, you have learned about some of the most powerful features of MongoDB. We hope that these exercises have helped you to improve your MongoDB query skills.

Lab Workbook: Aggregation in MongoDB for Data Analysis


What is aggregation in MongoDB, and how is it used to analyze data?
Aggregation in MongoDB is a powerful tool that allows for the processing and analysis of large amounts of data.
In this lab workbook, we will explore what aggregation is, how it is used to analyze data, and provide practical exercises to help you master this important concept.

Section 1: Understanding Aggregation in MongoDB

What is aggregation in MongoDB?
Aggregation in MongoDB refers to the process of grouping together multiple documents from one or more collections and performing operations on the grouped data to return a single result.
Aggregation operations can be used to analyze data changes over time and to extract meaningful insights from large datasets. [1]
What are the stages involved in the aggregation pipeline?
The aggregation pipeline consists of a series of stages that are executed in sequence.
Each stage takes the output of the previous stage and performs a specific operation on the data.
The stages include: $match, $project, $group, $sort, $limit, and $skip. [2]

Section 2: Practical Exercises

In this exercise, we will create a sample database and use aggregation to analyze data.| Follow the instructions in the provided file "Aggregation_Exercise.md".
In this exercise, we will use aggregation to analyze a dataset of customer orders. Follow the instructions in the provided file "Customer_Orders.md".

Concept Review Questions:

What are some of the operators that can be used in the aggregation pipeline?
How does the $group stage work in the aggregation pipeline?
Learning Outcomes:
Aggregation in MongoDB.
Understanding the stages involved in the aggregation pipeline and using practical exercises to analyze data, Gain a deeper understanding of how MongoDB can be used to extract insights from large datasets.


References:



Section 1: Introduction


What is aggregation in MongoDB and how is it used to analyze data? [1]
Aggregation in MongoDB is a process of extracting data from multiple documents and performing transformations on the data to produce a result. Aggregation framework provides a set of operators to group, filter, sort, and perform mathematical computations on the data. This makes it easier to analyze data and extract insights from large datasets.
What are some of the operators that can be used in the aggregation pipeline? [2]
Some of the operators that can be used in the aggregation pipeline include $match, $group, $project, $sort, $skip, and $limit.


Section 2: Practical Exercises

Exercise 1: Creating a sample database and using aggregation to analyze data

a. Create a sample database "mydb" with a collection "orders" using the following command:


use mydbdb.createCollection("orders")

b. Insert sample data into the "orders" collection using the following command:

db.orders.insertMany([ { "_id": 1, "product": "apple", "price": 0.5, "quantity": 10 }, { "_id": 2, "product": "banana", "price": 0.25, "quantity": 20 }, { "_id": 3, "product": "orange", "price": 0.3, "quantity": 15 }, { "_id": 4, "product": "pear", "price": 0.4, "quantity": 12 }, { "_id": 5, "product": "kiwi", "price": 0.6, "quantity": 8 }])

c. Use aggregation pipeline to analyze data and find the total revenue generated from each product using the following command:


db.orders.aggregate([ { $project: { product: 1, revenue: { $multiply: ["$price", "$quantity"] } } }, { $group: { _id: "$product", total_revenue: { $sum: "$revenue" } } }])


Exercise 2: Using aggregation to analyze a dataset of customer orders
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.