Edited By Siddhartha Reddy Jonnalagadda, PhD
Written By Hundreds of Parents
Introduction
Welcome back! This book is the next step in your coding journey. Having built a strong foundation with Python, you’re now ready to apply those skills to the world of data engineering. It’s a field for people who love systems, thrive on finding patterns, and enjoy turning messy information into a clear, usable resource.
The first book was about the parts of a car—the engine (variables), the steering wheel (operators), and the gas pedal (functions). This book is about building the road itself and creating a seamless, efficient transportation system. We’ll learn how to get data from a starting point to an ending point, making it cleaner and more useful along the way.
The world of data is changing quickly, and this book reflects that. We’ll explore new ideas like data lakes and the use of AI to help us solve problems. We’ll maintain the same neuro-diverse friendly approach, using metaphors, clear visualizations, and hands-on practice. There is no single "right" way to learn; the most important thing is that you feel comfortable and confident throughout the process.
Let’s begin.
The Problem:
So far, we’ve only worked with data that lives in our Python programs or in simple text files. But in the real world, data comes from many sources and in many different formats. It can be a massive collection of customer reviews, a live stream of weather updates, or a neatly organized sales report. How do we store all this information and make it useful? We need to understand the different kinds of places data lives.
Imagine you have a big collection of documents, photos, and handwritten notes. You wouldn’t just throw them all in one giant pile on the floor. You would organize them in different ways depending on what they are and how you plan to use them. In data engineering, we have similar, specialized places for our data.
Data Lakes vs. Data Warehouses
Think of your data as things you might put in a house. You have a kitchen where you can store raw, unwashed vegetables in a basket. You also have a pantry where you store neatly packaged and labeled cans and boxes. In data engineering, we have two similar concepts for storing data: a data lake and a data warehouse.
A Data Lake: Think of a data lake as that kitchen basket. It’s a place where you can store all of your raw data, regardless of its format. It can be images, videos, unstructured text, or messy, uncleaned files. The data is kept in its original form, which means you can always go back to the raw source. This is great for when you don’t know exactly what you’ll need the data for in the future. It’s like keeping all your ingredients on hand without having a specific recipe in mind.
A Data Warehouse: A data warehouse is like your pantry. It’s a highly structured and organized system. The data has been cleaned, transformed, and neatly arranged in tables so that it is easy to find and use for a specific purpose, like creating reports or running analysis. This is not for raw data; it’s for data that has been processed and is ready to be used. It’s perfect when you have a specific goal in mind, like figuring out how many cans of tomatoes you have.
The most important difference is this: a data warehouse stores data in a structured format, ready for a specific purpose. A data lake stores data in its raw, unstructured format, making it flexible for many future uses.
Visualizing the Difference:
graph TD
A[Raw Data] --> B(Data Lake)
B --> C{Process & Clean}
C --> D[Data Warehouse]
D --> E(Reports)
D --> F(Dashboards)
E --> G[Business Decisions]
F --> G
This simple diagram shows the flow. You can put raw data directly into a data lake. From the data lake, you can choose what to process and clean, and then move it into a data warehouse for specific, structured uses like reports.
Hands-on: Simulating a Data Lake and Data Warehouse
We don’t need to set up a real data lake or data warehouse to understand the concepts. We can use our local file system and a simple, built-in Python database to simulate them.
First, let’s create a simulated data lake. We’ll just create a folder and put some raw data in it. In Google Colab, you can do this easily.
In a code cell, run this to create a folder and some raw data files:
import os
# Create a folder to act as our data lake
if not os.path.exists("data_lake"):
os.makedirs("data_lake")
# Create some raw, messy data files
with open("data_lake/sales_data_raw.csv", "w") as f:
f.write("id,sale_amount,store\n")
f.write("1,100,A\n")
f.write("2,150,B\n")
f.write("3,200,A\n")
f.write("4,50,C\n")
with open("data_lake/web_traffic_raw.json", "w") as f:
f.write('{"user_id": 1, "page": "/home"}\n')
f.write('{"user_id": 2, "page": "/products"}\n')
Now, we have a folder named data_lake with a raw CSV file and a raw JSON file. It’s unorganized and contains different types of data, just like a real data lake.
Next, we’ll learn how to take a piece of this raw data, process it, and load it into a structured data warehouse.
Building a Simple Data Warehouse
The raw data in our data_lake is not very useful for analysis in its current state. For example, to find out the total sales, we would have to write a custom Python script that reads the file, extracts the numbers, and adds them up. What if we want to run that report every day? We’d have to run the script every time.
A data warehouse solves this problem by providing a structured home for our data. For our simple example, we’ll use SQLite, which is a lightweight, serverless database that is perfect for learning. It is built into Python, so you don’t need to install anything. Think of SQLite as a single, neat filing cabinet that holds all your important documents.
The core of a data warehouse is the table, which is a structured grid of data with rows and columns, just like a spreadsheet. Each column has a specific data type, such as a number or a text string. This structure is what makes a data warehouse so powerful for quick analysis.
Let’s create a database and a table for our sales data.
First, we need to import the sqlite3 library. Then, we can create a connection to a database file. If the file doesn’t exist, Python will create it for you. We’ll name our database analytics_warehouse.db.
import sqlite3
# Connect to the database. This will create the file if it doesn't exist.
conn = sqlite3.connect('analytics_warehouse.db')
# Create a cursor object to execute SQL commands
cursor = conn.cursor()
Next, we’ll create a table named sales in our database. We’ll define the columns and their data types.
id: an integer that automatically increases.
sale_amount: a number with a decimal point.
store: text that represents the store name.
Here is the SQL command to create the table. We’ll use a try-except block, so we don’t get an error if the table already exists.
try:
cursor.execute('''
CREATE TABLE sales (
id INTEGER PRIMARY KEY AUTOINCREMENT,
sale_amount REAL,
store TEXT
)
''')
print("Table 'sales' created successfully!")
except sqlite3.OperationalError as e:
print(f"Table 'sales' already exists. Error: {e}")
The table is now created. Let’s insert the raw data from our data_lake/sales_data_raw.csv file into our new database table. We’ll need to read the file, process each line, and then insert it into our database. This is a simple Extract, Transform, and Load (ETL) process.
# Insert data from the raw CSV file into our new database table
with open("data_lake/sales_data_raw.csv", "r") as f:
# Skip the header row
next(f)
# Process each line and insert it into the database
for line in f:
# Split the line by comma and clean up any whitespace
parts = [part.strip() for part in line.split(',')]
# Check if the parts are valid before trying to insert
if len(parts) == 3:
try:
# Cast the sales amount to a number (REAL)
sale_amount = float(parts[1])
store = parts[2]
# Use a parameterized query to prevent SQL injection and errors
cursor.execute("INSERT INTO sales (sale_amount, store) VALUES (?, ?)", (sale_amount, store))
except ValueError:
print(f"Skipping invalid data: {line.strip()}")
# Commit the changes to the database
conn.commit()
print("Data loaded into 'sales' table.")
What happened here?
We opened our raw CSV file.
We looped through each line, skipping the header.
We extracted the values and transformed the sale_amount into a number.
We loaded the clean data into our sales table using a SQL command.
Now, our data is structured and ready for easy analysis. We can run a single SQL command to find the total sales, which is much easier than writing a new Python script every time.
# A simple SQL query to find the total sales
cursor.execute("SELECT SUM(sale_amount) FROM sales")
total_sales = cursor.fetchone()[0]
print(f"The total sales amount is: {total_sales}")
The data is now in its proper place—the data warehouse—ready for a specific task. We’ve gone from a pile of raw data in our data lake to a clean, structured resource.
In the next chapter, we’ll dive deeper into the different ways we can talk to our data using SQL and the command line, which are essential tools for any data engineer.
The Problem:
In the last chapter, we saw how SQL is the language for interacting with a database. But as a data engineer, you’ll need to do more than just write simple queries. You’ll need to work with files, manage directories, and execute scripts from a powerful text-based interface. This is where the command line comes in.
Think of the command line as a direct line of communication with your computer’s operating system. Instead of clicking on icons and menus, you type in commands to tell the computer what to do. It can seem intimidating at first, but it is one of the most powerful and efficient tools in a programmer’s toolkit. Learning a few basic commands will give you full control over your files and scripts.
Revisiting SQL
We’ve already used a simple INSERT and SELECT statement in the last chapter. Let’s review the main actions you can perform with SQL.
SELECT: The "show me" command. It retrieves data from one or more tables. This is how you ask questions about your data.
INSERT: The "add this" command. It adds new rows of data to a table.
UPDATE: The "change this" command. It modifies existing data in a table.
DELETE: The "remove this" command. It removes rows from a table.
Together, these are often referred to as CRUD operations: Create, Read, Update, and Delete. Mastering these allows you to manage data in a database.
Let’s try a few more SQL statements with our analytics_warehouse.db database.
First, let’s connect to our database again and get a cursor.
import sqlite3
conn = sqlite3.connect('analytics_warehouse.db')
cursor = conn.cursor()
Now, let’s use the SELECT command to get all the data from our sales table.
# Select all rows from the sales table
cursor.execute("SELECT * FROM sales")
# Fetch all the results and print them
rows = cursor.fetchall()
for row in rows:
print(row)
The output should be the data we inserted in the last chapter.
Now let’s UPDATE a record. What if we made a mistake and the sale amount for ID 1 should have been 105? We can fix it with a single command.
# Update the sale_amount for the record with id = 1
cursor.execute("UPDATE sales SET sale_amount = 105 WHERE id = 1")
conn.commit()
# Verify the change
cursor.execute("SELECT * FROM sales WHERE id = 1")
updated_row = cursor.fetchone()
print(f"Updated record: {updated_row}")
Finally, let’s DELETE a record. We can remove the record for ID 4.
# Delete the record with id = 4
cursor.execute("DELETE FROM sales WHERE id = 4")
conn.commit()
# Verify the deletion
cursor.execute("SELECT * FROM sales")
rows = cursor.fetchall()
print("Records after deletion:")
for row in rows:
print(row)
You can now see how powerful SQL is for directly manipulating the data in your data warehouse.
Introducing the Command Line
While SQL is great for databases, the command line is for everything else. It’s a text-based interface where you type commands to interact with your computer. This might feel like a step back from graphical user interfaces (GUIs) with their icons and windows, but it’s much faster and more powerful for a data engineer’s tasks. It allows for automation and scripting that a GUI can’t easily do.
In Google Colab, you can use the command line by adding a ! at the beginning of any code cell. This tells Colab to run the command directly on the underlying operating system.
Let’s learn a few basic commands.
ls (list): This command lists all the files and folders in your current directory.
!ls
You should see data_lake and analytics_warehouse.db in the output, which are the files and folders we created.
cd (change directory): This command allows you to move into a different folder.
# Move into the data_lake folder
!cd data_lake
!ls
Note: When you run !cd in a Colab cell, the change is temporary and only lasts for that cell’s execution. A better way to use it is in combination with another command.
cat (concatenate and display): This command is used to display the contents of a file directly in your terminal. It’s useful for quickly checking the contents of a small text file.
!cat data_lake/sales_data_raw.csv
mkdir (make directory): This command creates a new folder.
!mkdir new_project_folder
!ls
You can see that a new folder named new_project_folder has been created. The command line is an essential tool for navigating, managing, and interacting with files and folders, which are the bedrock of data engineering.
In the next chapter, we’ll combine these skills to build a complete ETL data pipeline from start to finish.
The Problem:
In the first part of this book, we learned about the places where data lives (data lakes and warehouses) and the languages we use to talk to them (Python and SQL). Now, we need to create a system that moves data between these places and prepares it for use. This is a data pipeline, and the most common type is an ETL pipeline.
ETL stands for Extract, Transform, Load. Think of it as a three-step recipe for making data useful:
Extract: This is the first step, where you get the data from its source. The source can be a file, a database, a website, or a simple text document. It’s like gathering all your raw ingredients before you start cooking.
Transform: This is the most crucial step. It’s where you clean, filter, and shape the data so it’s ready for its final destination. This could involve removing bad data, converting data types, or combining data from multiple sources. This is like washing and chopping your vegetables before you cook them.
Load: This is the final step, where you put the cleaned, prepared data into its final home, such as a data warehouse or a new file. This is like putting the finished meal on a plate, ready to be served.
This method of processing data in large chunks is called batch processing. It’s useful when you have a lot of data that arrives at a specific time, like a daily sales report or a monthly inventory count.
Hands-on: Building a Simple ETL Script
In this project, we’ll build a simple ETL script that extracts data from a simulated web API (a JSON file), transforms it by cleaning and filtering, and then loads it into our analytics_warehouse.db database.
First, let’s create a simulated API file in our data_lake folder. This data is messy and contains information we don’t need.
In a code cell, run this to create the file:
import os
import json
if not os.path.exists("data_lake"):
os.makedirs("data_lake")
# Create a simulated raw JSON file with some messy data
raw_data = [
{"user_id": 1, "page": "/home", "timestamp": "2025-07-01T08:00:00Z", "extra_info": "irrelevant"},
{"user_id": 2, "page": "/products", "timestamp": "2025-07-01T08:01:00Z", "extra_info": "irrelevant"},
{"user_id": 3, "page": None, "timestamp": "2025-07-01T08:02:00Z", "extra_info": "irrelevant"},
{"user_id": 4, "page": "/cart", "timestamp": "2025-07-01T08:03:00Z", "extra_info": "irrelevant"}
]
with open("data_lake/web_traffic_raw.json", "w") as f:
json.dump(raw_data, f, indent=2)
Now, let’s build our ETL script. We’ll extract the data, filter out any records where the page is None, and then load the clean data into a new table in our database.
import sqlite3
import json
# E: Extract - Get the data from the source (the JSON file)
def extract_data(file_path):
with open(file_path, 'r') as f:
data = json.load(f)
return data
# T: Transform - Clean and filter the data
def transform_data(raw_data):
clean_data = []
for record in raw_data:
# Check if the 'page' key exists and is not None
if record.get("page"):
# Create a new dictionary with only the useful information
clean_record = {
"user_id": record["user_id"],
"page": record["page"]
}
clean_data.append(clean_record)
return clean_data
# L: Load - Put the clean data into the data warehouse
def load_data(data, db_name):
conn = sqlite3.connect(db_name)
cursor = conn.cursor()
# Create a new table for web traffic
try:
cursor.execute('''
CREATE TABLE web_traffic (
user_id INTEGER,
page TEXT
)
''')
print("Table 'web_traffic' created successfully!")
except sqlite3.OperationalError:
print("Table 'web_traffic' already exists. Skipping creation.")
# Insert the clean data into the table
for record in data:
cursor.execute("INSERT INTO web_traffic (user_id, page) VALUES (?, ?)",
(record["user_id"], record["page"]))
conn.commit()
conn.close()
print("Data loaded into 'web_traffic' table.")
# Run the full ETL pipeline
raw_web_traffic = extract_data('data_lake/web_traffic_raw.json')
clean_web_traffic = transform_data(raw_web_traffic)
load_data(clean_web_traffic, 'analytics_warehouse.db')
# Verify the results
conn = sqlite3.connect('analytics_warehouse.db')
cursor = conn.cursor()
cursor.execute("SELECT * FROM web_traffic")
print("\nContents of the web_traffic table:")
for row in cursor.fetchall():
print(row)
conn.close()
What happened here?
Extract: The extract_data function read the JSON file from our data lake and gave us the raw data as a list of dictionaries.
Transform: The transform_data function took that raw data. It then looped through each item, checked if the page was valid, and created a new list of dictionaries with only the information we needed.
Load: The load_data function took the clean list and loaded it into a new web_traffic table in our database.
You’ve now built your first complete ETL data pipeline! This is a core task of a data engineer—automating the flow of data from its source to a usable destination.
In the next chapter, we’ll introduce a new challenge: what if the data doesn’t arrive in big, scheduled batches, but comes in a continuous stream? We’ll learn how to build pipelines that handle real-time data.
The Problem:
Our ETL script in the last chapter was perfect for batch processing, where we collect data over a period of time and process it all at once. This is great for things like a daily sales report. But what if we need to process data as it’s created, in real-time? Think of a stock market ticker, a social media feed, or a live sensor reading. In these cases, we need a streaming data pipeline.
Streaming vs. Batch:
The difference between these two types of data processing is like the difference between getting a physical letter in the mail once a day versus getting a text message instantly.
Batch Processing: You wait for all the data to arrive before you start working on it. It’s efficient for large volumes of historical data and is usually scheduled to run at specific intervals.
Streaming Processing: You work on the data as soon as it arrives, one piece at a time. This is essential for applications that need up-to-the-second information to make decisions.
In a streaming pipeline, we think in terms of producers and consumers.
A producer is an application or device that sends data into a stream.
A consumer is an application that reads data from the stream and processes it.
Think of a social media app. Every time someone posts, that post is a data point from a producer. Your application, which displays a real-time feed, is a consumer that reads each post as it happens.
Visualizing the difference:
graph TD
subgraph "Batch Processing"
A(Source) --> B[Collect Data Over Time] --> C[Run ETL] --> D(Destination)
end
subgraph "Streaming Processing"
E(Producer) --> F[Continuous Stream] --> G[Process as it Arrives] --> H(Consumer)
end
Hands-on: Simulating a Streaming Data Pipeline
We can simulate a streaming data source with a simple Python script. We’ll have a producer function that generates new data points (like sensor readings) and a consumer function that processes them as they are created.
First, let’s create a "producer" function that sends a data point every few seconds. We’ll use the time library to create a delay.
import time
import random
import json
def produce_sensor_data():
"""Simulates a sensor generating data points."""
while True:
# Generate a random temperature and a timestamp
temperature = round(random.uniform(20.0, 30.0), 2)
timestamp = time.time()
data_point = {
"sensor_id": "sensor_A",
"temperature": temperature,
"timestamp": timestamp
}
# Print the data point as if it's being sent to a stream
print(f"PRODUCER: Sending data point: {json.dumps(data_point)}")
# Pause for 3 seconds before sending the next data point
time.sleep(3)
Now, let’s create a "consumer" function that processes this data as it arrives. For this simple example, we’ll just read from a stream that we simulate with a Python list. In a real-world scenario, the data would be coming from a service like Apache Kafka or AWS Kinesis.
def consume_and_process_data():
"""Simulates a consumer processing data from a stream."""
print("CONSUMER: Starting to listen for data...")
# This loop simulates listening to a continuous stream
for i in range(3): # We will only run this for 3 data points for the example
# In a real pipeline, we would read from a data stream here
# For our simulation, we will just 'get' a data point
# The data point here would come from our produce_sensor_data function
# A simulated data point from our 'stream'
simulated_data = {
"sensor_id": "sensor_B",
"temperature": round(random.uniform(20.0, 30.0), 2),
"timestamp": time.time()
}
# Process the data as it arrives
temperature = simulated_data["temperature"]
if temperature > 25.0:
print(f"CONSUMER: Alert! High temperature detected: {temperature}°C")
else:
print(f"CONSUMER: Temperature is normal: {temperature}°C")
# Run the consumer to see it in action
consume_and_process_data()
What happened here?
The consume_and_process_data function acts as a consumer. It continuously "listens" and processes each piece of data as it becomes available. This is a simple but fundamental concept of streaming data: the processing happens immediately, on a per-event basis, rather than on a pre-defined schedule.
Streaming data pipelines are at the heart of many modern applications. They allow companies to make instant decisions, such as a ride-sharing app matching a rider with the nearest driver or a fraud detection system flagging a suspicious transaction in real time.
In the next chapter, we’ll explore serverless data processing, which makes it even easier to build and manage these pipelines without having to worry about the underlying infrastructure.
The Problem:
In our last few chapters, we’ve built data pipelines using Python code that runs on a computer. In a real-world setting, that means we have to manage a server—a powerful computer that is always on, ready to run our code. This involves a lot of work: setting it up, maintaining it, and ensuring it has enough power to handle our tasks. What if our pipeline only runs a few times a day? We would still have to pay for the server to be on all the time. Can someone else handle that for us?
Introducing Serverless Computing:
This is where serverless computing comes in. The name is a bit misleading—there are still servers involved, but you don’t have to manage them. Instead, you write your code and give it to a cloud provider (like Google Cloud, Amazon Web Services, or Microsoft Azure), who handles all the details of running it for you. You only pay for the time your code is actually running.
Think of a serverless function like a vending machine. When you want a soda, you don’t have to buy a refrigerator, fill it with drinks, and plug it in. You just walk up to the machine, pay for your drink, and it’s dispensed. With serverless computing, you write the code for a specific task (your "soda"), and the cloud provider handles the "machine." Your code runs instantly when it’s needed, and you don’t pay for the time it’s sitting idle.
This model is perfect for data engineering tasks that happen in response to events, such as:
Responding to a file upload: A serverless function can be triggered every time a new sales report is added to a data lake.
Processing a streaming data point: A function can process each new piece of data as it arrives in a data stream.
Running a scheduled task: A function can be set to run every night to clean up data.
The Magic Behind the Scenes:
The core of a serverless architecture is a function that is triggered by an event. The cloud provider manages the underlying server and automatically scales up and down to meet demand. If your function is called a million times, the provider will run a million instances of it in parallel without any extra work from you. When it’s not being used, it costs you nothing.
Hands-on: A Conceptual Serverless Exercise
Since we can’t set up a full serverless environment here, let’s use a simple conceptual example to understand how it works. We’ll imagine we have a process_data function that runs every time a new file is added to a folder.
# A simple function that represents our serverless task
def process_data(file_path):
print(f"EVENT: A new file has been added to our data lake: {file_path}")
# We would add our ETL logic here
# For now, we'll just print a message
print(f"FUNCTION: Processing file at {file_path}...")
# A serverless function would then automatically return a result
return "Processing complete!"
# Let's simulate a serverless event
# The 'file_added_event' is the trigger
file_added_event = "data_lake/new_sales_report.csv"
# This is how the serverless platform would call our function
result = process_data(file_added_event)
print(f"PLATFORM: The function returned: {result}")
In this simulation, the process_data function is our "serverless function." The file_added_event is the "trigger." The imaginary "serverless platform" calls our function and handles all the complexities of running it, giving us the result without us ever needing to worry about a server.
Serverless computing is a powerful and increasingly popular tool in data engineering. It allows you to focus on writing the code that solves your problem, not on the servers that run it.
In the next chapter, we’ll explore Data Mesh, a new way to think about how data teams are organized to scale with the needs of a growing business.
The Problem:
As companies grow and their data needs become more complex, a single, central data team can become a bottleneck. This team, responsible for building and maintaining all data pipelines, can become overwhelmed by the sheer volume and variety of requests from different departments. The team might not fully understand the specific needs of each department, leading to slow delivery and solutions that don’t quite fit the problem.
The Data Mesh Concept:
Data Mesh is an organizational and technical approach that addresses this problem by decentralizing data ownership. Instead of one central team, the responsibility for data is given to the business domains that create and use it. This makes the data teams smaller and more focused, allowing them to better understand their specific data and serve their users more efficiently.
Think of it this way: a traditional, centralized data team is like a single, massive restaurant kitchen that prepares all the food for a huge city. It’s hard for the chefs to know exactly what each person wants. A Data Mesh is like a city of small, specialized food trucks. Each truck is an expert at a specific type of food, and it can serve its customers much faster and more personally.
The key principles of Data Mesh are:
Domain Ownership: The teams that own and create the data are also responsible for making it available to others. This means a sales team is responsible for their sales data, and a marketing team is responsible for their marketing data.
Data as a Product: Data is treated like a finished product. It must be easy to find, trustworthy, and well-documented. Other teams should be able to use it without having to ask the original team for help.
Self-Serve Data Platform: A central platform team provides the tools and infrastructure that allow the domain teams to build their data products independently, without having to be data engineering experts.
Federated Governance: There are global rules and standards for how data should be handled across the organization, but the specific implementation is up to the individual teams.
Visualizing the Change:
This diagram illustrates the shift from a centralized to a decentralized data architecture.
graph LR
subgraph "Centralized"
direction LR
A[Sales App] --> C(Central Data Team)
B[Marketing App] --> C
C --> D[Data Warehouse]
D --> E[Analysts]
end
subgraph "Data Mesh"
direction LR
F[Sales App] --> G(Sales Data Product)
G --> J[Analysts]
H[Marketing App] --> I(Marketing Data Product)
I --> J
end
In the centralized model, all data flows through one team. In the Data Mesh model, the data is managed by the teams that own it, and it is made directly available to the analysts.
Data Mesh is more of a philosophy than a specific tool, and it represents a significant change in how companies approach their data. It’s all about empowering teams, reducing bottlenecks, and treating data as a valuable asset that can be shared easily across an organization.
In the next chapter, we’ll shift our focus from architecture to the hands-on process of data exploration and preparation, which is a crucial first step in any data project.
The Problem:
You’ve built a data pipeline that moves data from a source to a destination. But when you get a brand new dataset, where do you even begin? It’s rarely perfect. It might be missing information, have inconsistent values, or contain errors. Before you can build a model or create a report, you need to understand the data, and then prepare it. This is a crucial, often-overlooked step in any data project.
Workflow Discovery and Visualization:
The first thing a data professional does when given a new dataset is explore it. This process, which we can call workflow discovery, is about asking questions to figure out what the data is, how it’s structured, and what problems it might have. This is where you use visualization to gain a feel for the data.
Think of it like being a detective. You are handed a mysterious file (the dataset). You can’t just jump to conclusions. You need to look through it, organize the clues, and find inconsistencies or missing pieces. This helps you form a plan for how you’ll solve the mystery.
For this chapter, we will use a real tool for data exploration and preparation: the Pandas library. Pandas is a powerful Python library that provides a way to work with structured data in a format called a DataFrame. Think of a DataFrame as a spreadsheet with superpowers. It organizes data in rows and columns and comes with a set of built-in functions for quick analysis and cleaning.
Hands-on: Data Exploration with Pandas
For this exercise, we will use a publicly available dataset of Olympic athletes.
First, let’s install the Pandas library and a library to help us get the data. We do this by using pip, the package installer for Python, from the command line.
!pip install pandas
!pip install xlrd openpyxl
Now, let’s load our data. We will get our data from an Excel file.
import pandas as pd
# Load the Olympic data from a URL
data_url = 'https://github.com/microsoft/O-r-a-l/raw/main/data/olympics.xlsx'
df = pd.read_excel(data_url, sheet_name='ALL')
print("Data loaded successfully! Here are the first 5 rows:")
print(df.head())
The output will show the first few rows of the DataFrame. This is your first look at the data. It’s an important step.
Now, let’s ask some questions to discover our data.
What’s the overall structure of the data? We can use the .info() method to get a summary. It tells us the number of rows and columns, the column names, the number of non-missing values in each column, and the data types.
print("\nData structure information:")
df.info()
The output will tell you if any columns have missing values and what kind of data they hold (e.g., numbers or text).
What are the key statistics of the data? We can use the .describe() method to get a statistical summary of the numerical columns. This gives us the count, mean, standard deviation, and other key information.
print("\nStatistical summary of numerical data:")
print(df.describe())
Are there any inconsistencies? We can check for unique values in a specific column to see if there are any typos or variations. For example, let’s look at the Discipline column.
print("\nUnique values in the 'Discipline' column:")
print(df['Discipline'].unique())
You might find variations like swimming and Swimming or other inconsistencies.
This process of data exploration and preparation is iterative. You start by looking at the raw data, asking questions, and then using tools like Pandas to find answers. Once you have a good understanding of your data, you can start the data preparation phase, where you clean and transform it.
Data Preparation:
Based on our exploration, we can start to clean the data. For example, if we found that the Discipline column has inconsistencies due to capitalization, we can fix that.
# Convert all values in the 'Discipline' column to lowercase to make them consistent
df['Discipline'] = df['Discipline'].str.lower()
print("\nUnique values after cleaning the 'Discipline' column:")
print(df['Discipline'].unique())
This is just a small example, but it shows the power of data preparation. A clean dataset is the foundation for any successful data project.
In the next chapter, we will learn how to leverage powerful AI models like Gemini and ChatGPT to assist with these tasks, making our lives as data engineers even easier.
The Problem:
You now have a solid understanding of how to collect, organize, and prepare data. But the ultimate goal of data engineering is to use that data to solve real business problems. This is where the magic happens, and modern AI models can be powerful allies in this process. AI models like Gemini and ChatGPT are not just for writing essays or generating images; they can act as a partner in your data workflow. They can help you brainstorm solutions, write and explain code, and troubleshoot problems faster than ever before.
The Power of AI Models:
Think of an AI model as an incredibly knowledgeable and fast-thinking assistant. You don’t just ask it to solve the problem for you; you work with it. You provide the context, and it provides suggestions, code snippets, and explanations. This partnership allows you to focus on the higher-level logic of your solution while the AI handles some of the more repetitive or complex coding tasks. This is called prompt engineering, which is the art of crafting effective questions to get the best results from an AI.
Code Generation and Troubleshooting:
One of the most useful things an AI model can do is help you write and debug code. For example, let’s say we have our list of Olympic athletes, and we want to find the top 5 most common sports. We could try to write the code from scratch, or we could ask an AI for help.
Let’s imagine our "prompt" to the AI is something like this:
"I have a Pandas DataFrame named df with a column called 'Discipline'. Can you write Python code to find the five most common disciplines and print them?"
The AI might respond with a code snippet similar to this:
# Assuming 'df' is your DataFrame
top_5_disciplines = df['Discipline'].value_counts().head(5)
print(top_5_disciplines)
You can then run this code in a cell and see the results. The AI can also explain each part of the code, so you understand why it works. For example, it would explain that .value_counts() counts the occurrences of each unique value in the column, and .head(5) returns the top five results. This helps you learn new methods and debug your own code when it’s not working as expected.
Hands-on: Using an AI to Solve a Business Problem
For our final project, we’ll solve a simple business problem: "What are the most popular sports among athletes from the United States?" We’ll use our skills from Chapter 7 and then use a prompt to an AI model to help us write the final code.
Step 1: Get the Data
We’ll use the same Olympic data from the last chapter.
import pandas as pd
data_url = 'https://github.com/microsoft/O-r-a-l/raw/main/data/olympics.xlsx'
df = pd.read_excel(data_url, sheet_name='ALL')
Step 2: Define the Problem and Prompt the AI
Our goal is to find the most popular sports for a specific country. A good prompt for an AI would be:
"I have a Pandas DataFrame named df with a 'NOC' column (which stands for National Olympic Committee, like USA or CHN) and a 'Discipline' column. I need to find the top 5 most common disciplines for athletes from the United States. Can you give me the Python code to do this?"
The AI would likely provide a solution that involves filtering the DataFrame and then counting the values. Let’s see what that looks like:
# Filter the DataFrame to only include athletes from the USA
usa_athletes_df = df[df['NOC'] == 'USA']
# Now, find the top 5 disciplines for these athletes
top_5_usa_disciplines = usa_athletes_df['Discipline'].value_counts().head(5)
print("The top 5 most popular disciplines for USA athletes are:")
print(top_5_usa_disciplines)
Step 3: Run and Refine the Code
You would run this code in a cell. The output would likely show the top 5 disciplines. This process demonstrates how you can use an AI model to break down a problem and provide the specific code snippets you need, allowing you to focus on the overall logic of your solution.
This is just a small example, but it shows how AI can be a powerful partner in your data engineering work, helping you with everything from writing new functions to troubleshooting complex problems.
The Problem:
As your programs grow in size and complexity, you’ll find that organizing your code is just as important as writing it. This is where Object-Oriented Programming (OOP) comes in. OOP is a powerful way to organize your code that lets you model real-world things as digital objects, each with its own information and abilities.
Think of an object as a single, self-contained unit. For example, your phone is an object. It has attributes (data) like its color, brand, and battery level. It also has methods (abilities or functions) like making calls, taking pictures, and sending texts. In OOP, you create a blueprint for an object, and then you can create as many objects as you need from that blueprint.
Classes: The Blueprints
A class is the blueprint for creating objects. It defines the attributes and methods that all objects of that class will have. You define a class using the class keyword.
Let’s create a class called DataPipeline. This will be our blueprint for all the pipelines we might build.
class DataPipeline:
def __init__(self, name, source, destination):
self.name = name
self.source = source
self.destination = destination
self.status = "idle"
def run(self):
print(f"Starting pipeline '{self.name}' from {self.source} to {self.destination}.")
self.status = "running"
# In a real pipeline, we'd add our ETL/streaming code here.
print(f"Pipeline '{self.name}' finished successfully.")
self.status = "completed"
def get_status(self):
return f"Pipeline '{self.name}' is {self.status}."
class DataPipeline:: This line tells Python you’re creating a blueprint for a DataPipeline.
def __init__(self, name, source, destination):: This is a special method called the constructor. Its job is to set up the object’s initial attributes. The self parameter refers to the specific object you are creating.
self.status = "idle": We’ve given our pipeline a default status.
def run(self): and def get_status(self):: These are methods (functions) that belong to our DataPipeline object. They can access the object’s attributes using self.
Objects: The Instances
An object is a specific instance created from a class. Once you have a DataPipeline blueprint, you can create as many DataPipeline objects as you like, each with its own unique attributes.
To create an object, you call the class name as if it were a function:
# Create two different pipeline objects from our blueprint
daily_report_pipeline = DataPipeline("Daily Sales Report", "sales.csv", "analytics_warehouse.db")
streaming_data_pipeline = DataPipeline("Real-time Sensor Data", "sensor_stream", "data_lake")
# Accessing attributes and calling methods
print(daily_report_pipeline.get_status())
print(streaming_data_pipeline.get_status())
print("\nRunning the daily report pipeline:")
daily_report_pipeline.run()
print("\nChecking the status again:")
print(daily_report_pipeline.get_status())
The output will show that each object has its own separate status and attributes. When you call daily_report_pipeline.run(), the method knows to use the attributes of that specific object.
OOP is a fundamental concept in all programming languages. It lets you organize your code into logical, reusable units. In the context of data engineering, this allows you to create flexible, maintainable, and scalable systems. For example, a single DataPipeline class can be used to manage hundreds of different pipelines, each with its own unique name, source, and destination.
In the next chapter, we’ll dive into errors and debugging, which are crucial skills for any programmer.
I’m glad you’re ready to continue. Let’s move on to the next chapter, where we will discuss troubleshooting and debugging.
The Problem:
No matter how carefully you write your code, you will encounter errors. This isn’t a sign of failure; it’s a normal and expected part of the process. A good programmer isn’t someone who never makes a mistake, but someone who knows how to deal with them effectively. This chapter will teach you how to understand errors, find bugs in your code, and build a more resilient mindset.
Understanding Errors and Exceptions
In Python, errors often come with a detailed message called a traceback. A traceback can look intimidating, but it’s a valuable map that tells you exactly where and why your code broke. It reads from the bottom up, with the last line being the most important.
Let’s look at a common error.
x = 10
y = "hello"
z = x + y
When you run this, you’ll get an error message that looks something like this:
Traceback (most recent call last):
File "<ipython-input-1-c12e8b2f9a94>", line 3, in <module>
z = x + y
TypeError: unsupported operand type(s) for +: 'int' and 'str'
Let’s break it down:
The last line, TypeError: unsupported operand..., is the error message itself. It tells you the type of error (TypeError) and a short description of what went wrong. In this case, you can’t add a number (int) and a string (str).
The lines above that are the traceback. They show the sequence of calls that led to the error. The last line in the traceback points to the line in your code where the error occurred.
Most errors are exceptions, which are problems that arise during the execution of your code. Common exceptions include:
NameError: You’ve tried to use a variable that hasn’t been created yet.
IndexError: You’ve tried to access an item in a list or tuple using an index that doesn’t exist.
KeyError: You’ve tried to get a value from a dictionary using a key that isn’t there.
Handling Exceptions
Sometimes, you can anticipate that an error might happen. For example, what if a user is asked to enter a number but types a word instead? Your program would crash. You can use try and except blocks to handle these situations gracefully.
try:
user_input = int(input("Enter a number: "))
result = 10 / user_input
print(f"The result is: {result}")
except ValueError:
print("That's not a valid number. Please try again.")
except ZeroDivisionError:
print("You can't divide by zero. Please try again.")
The code inside the try block is what you want to run.
If a ValueError occurs (e.g., the user types a word), the code inside the except ValueError block runs. The program doesn’t crash; it simply prints the helpful message.
Similarly, if a ZeroDivisionError occurs, the program handles it and continues.
You can also use a generic except to catch any type of error, but it’s often better to be specific so you know what went wrong.
Debugging Techniques
Debugging is the process of finding and fixing bugs. It’s a key skill for any programmer.
Read the Traceback Carefully: The traceback is your best friend. Pay close attention to the last line, as it often contains the most important information.
Use print() statements: The simplest way to debug is to print the values of your variables at different points in your code. This helps you see how data is changing and where things might be going wrong.
Break It Down: If a large piece of code isn’t working, break it into smaller parts. Test each part individually to find the exact location of the bug.
Google the Error: Copy and paste the error message into a search engine. Chances are, someone else has had the same problem, and the solution is online.
Debugging requires patience and a systematic approach. With practice, you’ll become more adept at identifying and solving problems, which will make you a more confident programmer.
Congratulations! You’ve successfully completed this book. In just a short time, you’ve gone from a Python beginner to a data practitioner who can build pipelines, work with modern data architectures, and even leverage AI to solve problems. This is a significant milestone.
This journey has taught you valuable skills that go beyond just coding:
Patience and Resilience: You’ve learned that encountering errors is a normal part of the process, and you now have the tools to troubleshoot them.
Systematic Thinking: You’ve learned to break down complex problems into manageable, sequential steps, whether it’s building a pipeline or exploring a new dataset.
Adaptability: You’ve seen how the field of data engineering is constantly evolving, and you now have a foundation to adapt and learn new tools and concepts as they emerge.
The world of data is vast and exciting. The skills you’ve acquired in this book are the bedrock upon which you can build a rewarding career or simply use data to enrich your life. Happy data crunching!