Explore

Data Engineering and ML Track

DE and ML Lessons

DE and ML Lessons

Topics

Sub-Topic

Lesson

Status

Notes

Introduction to Scalable Data Pipelines

What is a data pipeline?

Definition and use cases of data pipelines

Open

Importance in data-driven decision making

Open

Real-time vs batch processing

Open

Key components of a scalable data pipeline:

Data sources, data ingestion, data transformation, and data storage

Open

Tools Overview: Apache Kafka, Apache NiFi, Apache Airflow

Open

Used Case

Use Apache Airflow to build and schedule a basic ETL pipeline that reads data from a file and writes to a database.

Open

Data Integration and ETL Processes

ETL Overview

What is ETL (Extract, Transform, Load)

Open

The role of ETL in data pipelines

Open

Types of ETL (Batch vs. Real-Time)

Open

Practical

Build an ETL pipeline using Apache Spark that extracts data from CSV, transforms it (e.g., cleaning data), and loads it into a MySQL database.

Open

Data Transformation and Cleaning

Practical

Use Python and Pandas to clean and transform a messy dataset (e.g., removing NaNs, converting data types, and scaling numeric columns).

Open

Perform data transformations (e.g., filtering, joining) using PySpark for larger datasets.

Open

Data Transformation Techniques

Normalization, aggregation, filtering, and reshaping data

Open

Data Cleaning

Handling missing data, removing duplicates, identifying outliers

Open

Tools Overview

Pandas, NumPy, PySpark

Open

Database Schemas and Design

Practical

Design a normalized database schema in MySQL or PostgreSQL for an e-commerce application (products, orders, customers).

Open

Set up a NoSQL database (MongoDB) and design a flexible schema for a product catalog.

Open

Tools Overview

MySQL, PostgreSQL, MongoDB, DBML for schema modeling

Open

Database Design

Relational vs NoSQL Databases

Open

Key concepts: normalization, primary/foreign keys, indexing

Open

Database Schema Creation:

Designing efficient schemas for scalable systems

Open

Tableau

Working with Tableau

Basic Charts: Bar, Line, Scatter

Open

Filters & Sorting

Open

Interactive Dashboards

Open

Calculated Fields & Parameters

Open

Best Practices

Open

Introduction to Machine Learning

Machine Learning Basics

What is machine learning? (Supervised vs Unsupervised learning)

Open

Common ML algorithms: Linear Regression, Decision Trees, K-Means Clustering

Open

Tools Overview: Scikit-learn, TensorFlow, Keras

Open

Practicals

Use Scikit-learn to train a simple linear regression model for predicting housing prices based on features like square footage, location, etc.

Open

Evaluate model performance using metrics like accuracy, precision, recall, and F1-score.

Open

Feature Engineering and Selection

Practical

Use Python and Scikit-learn to engineer new features from raw data (e.g., extracting date parts, creating interaction terms).

Open

Apply recursive feature elimination (RFE) to select important features for an ML model.

Open

Feature Engineering

Creating new features from existing data (e.g., log transformations, polynomial features)

Open

One-hot encoding, binning, and feature scaling (standardization, normalization)

Open

Feature Selection

Methods for selecting important features (e.g., recursive feature elimination)

Open

Advanced Machine Learning Algorithms

Advanced Algorithms

Decision Trees, Random Forests, Support Vector Machines (SVM)

Open

Unsupervised Learning: K-Means Clustering, Hierarchical Clustering

Open

Model Deployment

Practicals

Create a Flask REST API to expose your trained ML model as a web service.

Open

Create a Docker image for the Flask application and run it in a container.

Open

Introduction to Model Deployment:

Deploying ML models using APIs (Flask)

Open

Dockerizing applications for portability

Open

Tools Overview: Flask, Docker

Open

Introduction to MLOps

Practicals

Use Jenkins to automate the process of training and deploying an ML model.

Open

MLOps Concepts

Introduction to MLOps

Open

Continuous Integration and Continuous Deployment (CI/CD) for machine learning

Open

Model Monitoring, Retraining, and Versioning

Open

⁠

Want to print your doc?
This is not the way.

Try clicking the ··· in the right corner or using a keyboard shortcut (

CtrlP

) instead.