Python for AI, Part 3: Mastering Pandas for Data Handling
Introduction
And welcome to Part 3 of your journey into Python-for-AI! With your experience in Java and your expanding skills in Python, you’re prepared to conquer Pandas, a premier library for working with structured data, such as tables or spreadsheets. Pandas is a powerful data manipulation library or building block in AI for preprocessing dataset and exploring it which is extremely important and unavoidable step before training models. Let’s get down to the nitty gritty with concrete examples!
What is Pandas and Why Use It?
Pandas provides two vital structures Series (1D, similar to a column) and DataFrame (2D, like a table). It’s ideal for loading data, cleaning it, and transforming it for machine learning tasks in AI. Install it with:
pip install pandas
Then import it (with NumPy, its foundation):
import pandas as pd import numpy as np
Creating and Exploring a DataFrame
Let’s create a DataFrame for student exam scores—an AI-relevant dataset:
# Dictionary with data data = { 'Name': ['Alice', 'Bob', 'Charlie', 'Dana'], 'Math': [85, 92, 78, 95], 'Science': [88, 85, 90, 92], 'Attendance': [0.95, 0.80, 0.99, 0.90] } # Create DataFrame df = pd.DataFrame(data) print(df)
Output:
Name Math Science Attendance 0 Alice 85 88 0.95 1 Bob 92 85 0.80 2 Charlie 78 90 0.99 3 Dana 95 92 0.90
Check its structure:
print(df.describe())
Output (summary stats):
Math Science Attendance count 4.000000 4.000000 4.000000 mean 87.500000 88.750000 0.910000 std 7.505553 3.201562 0.082916 min 78.000000 85.000000 0.800000 max 95.000000 92.000000 0.990000
This is gold for AI—means and spreads help you spot patterns.
Selecting and Filtering Data
In AI, you’ll grab specific data chunks. Select the Math column:
math_scores = df['Math'] print(math_scores)
Output:
0 85 1 92 2 78 3 95 Name: Math, dtype: int64
Filter for high Math scores (>90):
high_math = df[df['Math'] > 90] print(high_math)
Output:
Name Math Science Attendance 1 Bob 92 85 0.80 3 Dana 95 92 0.90
This is like picking “good examples” for an AI model.
Modifying Data
Add an Average column:
df['Average'] = (df['Math'] + df['Science']) / 2 print(df)
Output:
Name Math Science Attendance Average 0 Alice 85 88 0.95 86.5 1 Bob 92 85 0.80 88.5 2 Charlie 78 90 0.99 84.0 3 Dana 95 92 0.90 93.5
Fix Bob’s attendance:
df.loc[df['Name'] == 'Bob', 'Attendance'] = 0.85 print(df)
Output:
Name Math Science Attendance Average 0 Alice 85 88 0.95 86.5 1 Bob 92 85 0.85 88.5 2 Charlie 78 90 0.99 84.0 3 Dana 95 92 0.90 93.5
Handling Missing Data
Real AI data often has gaps. Introduce one:
# Introduce a missing value df.loc[1, 'Science'] = np.nan print(df)
Output:
Name Math Science Attendance Average 0 Alice 85 88.0 0.95 86.5 1 Bob 92 NaN 0.85 88.5 2 Charlie 78 90.0 0.99 84.0 3 Dana 95 92.0 0.90 93.5
Fill with the mean:
mean_science = df['Science'].mean() df['Science'] = df['Science'].fillna(mean_science) print(df)
Output:
Name Math Science Attendance Average 0 Alice 85 88.00000 0.95 86.5 1 Bob 92 90.00000 0.85 88.5 2 Charlie 78 90.00000 0.99 84.0 3 Dana 95 92.00000 0.90 93.5
Feature Engineering for AI
Create a Pass/Fail feature (Average > 85):
df['Pass'] = df['Average'] > 85 print(df)
Output:
Name Math Science Attendance Average Pass 0 Alice 85 88.00000 0.95 86.5 True 1 Bob 92 90.00000 0.85 88.5 True 2 Charlie 78 90.00000 0.99 84.0 False 3 Dana 95 92.00000 0.90 93.5 True
This could be a target for a classification model.
Grouping and Aggregation
Group by Pass/Fail and compute averages:
grouped = df.groupby('Pass').mean(numeric_only=True) print(grouped)
Output:
Math Science Attendance Average Pass False 78.00000 90.00000 0.990000 84.00000 True 90.66667 90.00000 0.900000 89.83333
See the difference? This is AI insight in action.
Try It Yourself: An Exercise
Create a DataFrame for weather data:
data = { 'Day': ['Mon', 'Tue', 'Wed', 'Thu'], 'Temp': [22, 25, 19, 28], 'Rain': [0.1, 0.0, 0.3, np.nan] } df_weather = pd.DataFrame(data)
Tasks:
- Add a column for “Feels Like” (Temp - 2 if Rain > 0, else Temp).
- Fill the missing Rain value with the mean.
- Filter for days where Temp > 20.
Hint: Use conditionals, fillna()
, and filtering. Try it, then check below!
Solution
data = { 'Day': ['Mon', 'Tue', 'Wed', 'Thu'], 'Temp': [22, 25, 19, 28], 'Rain': [0.1, 0.0, 0.3, np.nan] } df_weather = pd.DataFrame(data) # 1. Add Feels Like column df_weather['Feels_Like'] = df_weather.apply( lambda row: row['Temp'] - 2 if row['Rain'] > 0 else row['Temp'], axis=1 ) # 2. Fill missing Rain mean_rain = df_weather['Rain'].mean() df_weather['Rain'] = df_weather['Rain'].fillna(mean_rain) # 3. Filter Temp > 20 hot_days = df_weather[df_weather['Temp'] > 20] print(hot_days)
Output:
Day Temp Rain Feels_Like 0 Mon 22 0.100000 20.0 1 Tue 25 0.000000 25.0 3 Thu 28 0.133333 28.0
Next Steps
You’ve covered Pandas fundamentals — creation, filtering, and feature engineering! Next up, we’ll leverage Matplotlib to visualize this data (plot temperatures, for instance). Practice with that exercise, modify that (e.g. thresholds), etc., and be prepared for graphing!
Code Demo
import pandas as pd import numpy as np # Dictionary with data # This DataFrame has rows (students) and columns (attributes). # In AI, this could be a dataset for predicting student performance. data = { 'Name': ['Alice', 'Bob', 'Charlie', 'Dana'], 'Math': [85, 92, 78, 95], 'Science': [88, 85, 90, 92], 'Attendance': [0.95, 0.80, 0.99, 0.90] } # Create DataFrame df = pd.DataFrame(data) print(df) # In real AI projects, you’ll load data from files. # Let’s assume you have a CSV file named students.csv with the same data. # Here’s how to load it: df = pd.read_csv('D:\PYTHON-PROJECTS\DataFrame\students.csv') print(df) # Exploring the Data # Basic Info: This shows column names, data types, and if any values are missing. print(df.info()) # Summary Statistics: # This gives you means, standard deviations, and ranges—super # useful for understanding data distributions in AI. print(df.describe()) # Selecting and Filtering Data # In AI, you often need specific parts of the data. Here’s how: # Select Columns: math_scores = df['Math'] print(math_scores) # Select Multiple Columns: scores = df[['Math', 'Science']] print(scores) # Filter Rows: Let’s find students with Math scores above 90: high_math = df[df['Math'] > 90] print(high_math) # Adding and Modifying Data # Let’s add a new column for “Average Score”: df['Average'] = (df['Math'] + df['Science']) / 2 print(df) # Now, let’s say Bob’s attendance was recorded incorrectly. Update it: # In AI, you might update data like this to correct errors before modeling. df.loc[df['Name'] == 'Bob', 'Attendance'] = 0.85 print(df) # Handling Missing Data # Real-world AI datasets often have missing values. Let’s introduce one and fix it: # Introduce a missing value df.loc[1, 'Science'] = np.nan print(df) # Drop Missing Values: df_dropped = df.dropna() print(df_dropped) # Fill Missing Values: Let’s fill Bob’s Science score with the mean: # In AI, you choose dropping or filling based on how much data you can afford to lose. mean_science = df['Science'].mean() df['Science'] = df['Science'].fillna(mean_science) print(df) # Practical AI Example: Feature Engineering # In machine learning, you create new features to improve models. # Let’s add a feature: “Pass/Fail” based on Average > 85. # This binary feature could be a target variable for a classification model. df['Pass'] = df['Average'] > 85 print(df) # Grouping and Aggregation # Suppose you’re analyzing data across groups. # Let’s group by Pass/Fail and compute averages: # This shows how passing students differ from failing ones—useful for insights in AI. grouped = df.groupby('Pass').mean(numeric_only=True) print(grouped)
Conclusion
Pandas gives you the superpower to read data like an AI expert. Keep moving forward, and soon you’ll be able to visualize and model. Happy coding!