How to become Data Engineer?
Watch this below video
Must go-through (free resources):
Azure Data factory
- Azure Data Factory by WafaStudies [Mandatory]
- Azure Data Factory by Adam Marczak [Mandatory]
Azure Synapse
- Azure Synapse Analytics by WafaStudies [Optional]
Azure Databricks
- Azure Databricks by WafaStudies [Mandatory]
- Azure Databricks by Adam Marczak [Mandatory]
Pyspark
- Pyspark wiki [Mandatory]
- Pyspark Notes [Interview Questions]
SQL
- SQL by kudvenkat [Mandatory]
- SQL Practice HackerRank [Mandatory]
- SQL Practice Naukri [Mandatory]
- SQL Practice Datalemur [Mandatory]
Python
- Python by Corey Schafer [Mandatory]
git
- Git Essentials by Corey Schafer [Preferred]
- Git Hindi by CodeWithHarry [Optional]
Azure Fundamentals
- Azure Data Fundamentals [Preferred]
Spark Advanced videos with slides (must for Interview)
- Making Apache Spark Better with Delta Lake [Presentation slides here]
- Understanding Query Plans and Spark UIs - Xiao Li Databricks [Presentation slides here]
- Optimizing Delta Parquet Data Lakes for Apache Spark - Matthew Powers [Presentation slides here]
- Everyday I'm Shuffling - Tips for Writing Better Apache Spark Programs [Presentation slides here]
- Optimizing Apache Spark SQL Joins: Spark Summit East talk by Vida Ha [Presentation slides here]
- Apache Spark Core—Deep Dive—Proper Optimization Daniel Tomes Databricks [Presentation slides here]
- The Parquet Format and Performance Optimization Opportunities Boudewijn Braams [Presentation slides here]
- Easy, Scalable, Fault Tolerant Stream Processing with Structured Streaming in Apache Spark [Presentation slides here]
- Spark Architecture, Alexey Grishchenko [Presentation slides here]
- Deeper Understanding of Spark Internals - Aaron Davidson [Presentation slides here]
- Advanced Apache Spark Training - Sameer Farooqui [Presentation slides here]
- Top 5 Mistakes When Writing Spark Applications [Presentation slides here]
- Spark SQL: A compiler from Queries to RDDS with Sameer Agarwal [Presentation slides here]
- Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenchen Fan [Presentation slides here]
- A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai [Presentation slides here]
- Spark + Parquet In Depth: Spark Summit East talk by: Emily Curtin and Robbie Strickland [Presentation slides here]
- Change Data Feed in Delta [Presentation slides here]
- Deep Dive into Delta Lake [Presentation slides here]
- Diving into Delta Lake: Unpacking the Transaction Log [Presentation slides here]
- Delta Lake 2.0 Overview [Presentation slides here]
- Accelerating Data Ingestion with Databricks Autoloader Simon [Presentation slides here]
- Tuning and Debugging in Apache Spark Patrick Wendell [Presentation slides here]
- Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia [Presentation slides here]
- Understanding the Performance of Spark Applications - Patrick Wendell [Presentation slides here]
- SQL, DataFrames, Datasets And Streaming - by Michael Armbrust [Presentation slides here]
- Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das [Presentation slides here]
- Designing ETL Pipelines with Structured Streaming and Delta Lake How to Architect Things Right [Presentation slides here]
- Deep Dive: Apache Spark Memory Management [Presentation slides here]
Fastrack Interview
WIP
Optional resources for Mechanical Engineers:
- Azure IoT Hub Part 1
- Azure IoT Hub Part 2
- Modern Industrial IoT Analytics on Azure - Part 1
- Modern Industrial IoT Analytics on Azure - Part 2
- Modern Industrial IoT Analytics on Azure - Part 3
- Azure Stream Analytics
- Azure Stream Analytics by Adam
- Azure Stream Analytics by Pragmatics
Certifications:
- Exam DP-203: Data Engineering on Microsoft Azure [Preferred] [Questions]
- Databricks Certified Data Engineer Associate [Preferred] [Questions]
- Databricks Certified Data Engineer Professional [Preferred] [Questions]
- Exam DP-900: Microsoft Azure Data Fundamentals [Optional]
Advanced Reads:
- Delta Lake Essentials [Preferred]
- Delta Lake blogs [Preferred]
Free Cloud Resources:
- Azure Free Account [Use new Credit Card]
- Databricks Community [choose - “Get started with Community Edition”]
𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 Burger
Road Map to Data Engineer
Data Warehouse vs Lake vs Mesh
Data Warehouse vs Lake vs Lakehouse vs Mesh
Cloud Platform Models
ETL vs ELT vs reverse ETL
Star vs Snowflake Schema
Medallion Architecture

Database Types
Database Indexing

SQL Execution Order

HTTP status code

Containerization

