Python for AI (Part 3): Mastering Pandas for Data Handling

Python for AI, Part 3: Mastering Pandas for Data Handling
Python for AI, Part 3: Mastering Pandas for Data Handling

Python for AI, Part 3: Mastering Pandas for Data Handling

Introduction

And welcome to Part 3 of your journey into Python-for-AI! With your experience in Java and your expanding skills in Python, you’re prepared to conquer Pandas, a premier library for working with structured data, such as tables or spreadsheets. Pandas is a powerful data manipulation library or building block in AI for preprocessing dataset and exploring it which is extremely important and unavoidable step before training models. Let’s get down to the nitty gritty with concrete examples!

What is Pandas and Why Use It?

Pandas provides two vital structures Series (1D, similar to a column) and DataFrame (2D, like a table). It’s ideal for loading data, cleaning it, and transforming it for machine learning tasks in AI. Install it with:

pip install pandas

Then import it (with NumPy, its foundation):

import pandas as pd
import numpy as np

Creating and Exploring a DataFrame

Let’s create a DataFrame for student exam scores—an AI-relevant dataset:

# Dictionary with data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Dana'],
    'Math': [85, 92, 78, 95],
    'Science': [88, 85, 90, 92],
    'Attendance': [0.95, 0.80, 0.99, 0.90]
}

# Create DataFrame
df = pd.DataFrame(data)

print(df)

Output:

	Name  Math  Science  Attendance
0    Alice    85       88        0.95
1      Bob    92       85        0.80
2  Charlie    78       90        0.99
3     Dana    95       92        0.90

Check its structure:

print(df.describe())

Output (summary stats):

	  Math    Science  Attendance
count   4.000000   4.000000    4.000000
mean   87.500000  88.750000    0.910000
std     7.505553   3.201562    0.082916
min    78.000000  85.000000    0.800000
max    95.000000  92.000000    0.990000

This is gold for AI—means and spreads help you spot patterns.

Selecting and Filtering Data

In AI, you’ll grab specific data chunks. Select the Math column:

math_scores = df['Math']
print(math_scores)

Output:

0    85
1    92
2    78
3    95
Name: Math, dtype: int64

Filter for high Math scores (>90):

high_math = df[df['Math'] > 90]
print(high_math)

Output:

	Name  Math  Science  Attendance
1    Bob    92       85        0.80
3   Dana    95       92        0.90

This is like picking “good examples” for an AI model.

Modifying Data

Add an Average column:

df['Average'] = (df['Math'] + df['Science']) / 2
print(df)

Output:

	Name  Math  Science  Attendance  Average
0    Alice    85       88        0.95    86.5
1      Bob    92       85        0.80    88.5
2  Charlie    78       90        0.99    84.0
3     Dana    95       92        0.90    93.5

Fix Bob’s attendance:

df.loc[df['Name'] == 'Bob', 'Attendance'] = 0.85
print(df)

Output:

	Name  Math  Science  Attendance  Average
0    Alice    85       88        0.95    86.5
1      Bob    92       85        0.85    88.5
2  Charlie    78       90        0.99    84.0
3     Dana    95       92        0.90    93.5

Handling Missing Data

Real AI data often has gaps. Introduce one:

# Introduce a missing value
df.loc[1, 'Science'] = np.nan
print(df)

Output:

	Name  Math  Science  Attendance  Average
0    Alice    85     88.0        0.95    86.5
1      Bob    92      NaN        0.85    88.5
2  Charlie    78     90.0        0.99    84.0
3     Dana    95     92.0        0.90    93.5

Fill with the mean:

mean_science = df['Science'].mean()
df['Science'] = df['Science'].fillna(mean_science)
print(df)

Output:

	Name  Math   Science  Attendance  Average
0    Alice    85  88.00000        0.95    86.5
1      Bob    92  90.00000        0.85    88.5
2  Charlie    78  90.00000        0.99    84.0
3     Dana    95  92.00000        0.90    93.5

Feature Engineering for AI

Create a Pass/Fail feature (Average > 85):

df['Pass'] = df['Average'] > 85
print(df)

Output:

	Name  Math   Science  Attendance  Average  Pass
0    Alice    85  88.00000        0.95    86.5  True
1      Bob    92  90.00000        0.85    88.5  True
2  Charlie    78  90.00000        0.99    84.0  False
3     Dana    95  92.00000        0.90    93.5  True

This could be a target for a classification model.

Grouping and Aggregation

Group by Pass/Fail and compute averages:

grouped = df.groupby('Pass').mean(numeric_only=True)
print(grouped)

Output:

		Math    Science  Attendance   Average
Pass                                           
False  78.00000  90.00000    0.990000  84.00000
True   90.66667  90.00000    0.900000  89.83333

See the difference? This is AI insight in action.

Try It Yourself: An Exercise

Create a DataFrame for weather data:

data = {
    'Day': ['Mon', 'Tue', 'Wed', 'Thu'],
    'Temp': [22, 25, 19, 28],
    'Rain': [0.1, 0.0, 0.3, np.nan]
}
df_weather = pd.DataFrame(data)

Tasks:

  1. Add a column for “Feels Like” (Temp - 2 if Rain > 0, else Temp).
  2. Fill the missing Rain value with the mean.
  3. Filter for days where Temp > 20.

Hint: Use conditionals, fillna(), and filtering. Try it, then check below!

Solution

data = {
    'Day': ['Mon', 'Tue', 'Wed', 'Thu'],
    'Temp': [22, 25, 19, 28],
    'Rain': [0.1, 0.0, 0.3, np.nan]
}
df_weather = pd.DataFrame(data)

# 1. Add Feels Like column
df_weather['Feels_Like'] = df_weather.apply(
    lambda row: row['Temp'] - 2 if row['Rain'] > 0 else row['Temp'], axis=1
)

# 2. Fill missing Rain
mean_rain = df_weather['Rain'].mean()
df_weather['Rain'] = df_weather['Rain'].fillna(mean_rain)

# 3. Filter Temp > 20
hot_days = df_weather[df_weather['Temp'] > 20]
print(hot_days)

Output:

   Day  Temp      Rain  Feels_Like
0  Mon    22  0.100000        20.0
1  Tue    25  0.000000        25.0
3  Thu    28  0.133333        28.0

Next Steps

You’ve covered Pandas fundamentals — creation, filtering, and feature engineering! Next up, we’ll leverage Matplotlib to visualize this data (plot temperatures, for instance). Practice with that exercise, modify that (e.g. thresholds), etc., and be prepared for graphing!

Code Demo

import pandas as pd
import numpy as np

# Dictionary with data
# This DataFrame has rows (students) and columns (attributes).
# In AI, this could be a dataset for predicting student performance.
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Dana'],
    'Math': [85, 92, 78, 95],
    'Science': [88, 85, 90, 92],
    'Attendance': [0.95, 0.80, 0.99, 0.90]
}

# Create DataFrame
df = pd.DataFrame(data)

print(df)

# In real AI projects, you’ll load data from files.
# Let’s assume you have a CSV file named students.csv with the same data.
# Here’s how to load it:
df = pd.read_csv('D:\PYTHON-PROJECTS\DataFrame\students.csv')
print(df)

# Exploring the Data
# Basic Info: This shows column names, data types, and if any values are missing.
print(df.info())

# Summary Statistics:
# This gives you means, standard deviations, and ranges—super
# useful for understanding data distributions in AI.
print(df.describe())

# Selecting and Filtering Data
# In AI, you often need specific parts of the data. Here’s how:
# Select Columns:
math_scores = df['Math']
print(math_scores)

# Select Multiple Columns:
scores = df[['Math', 'Science']]
print(scores)

# Filter Rows: Let’s find students with Math scores above 90:
high_math = df[df['Math'] > 90]
print(high_math)

# Adding and Modifying Data
# Let’s add a new column for “Average Score”:
df['Average'] = (df['Math'] + df['Science']) / 2
print(df)

# Now, let’s say Bob’s attendance was recorded incorrectly. Update it:
# In AI, you might update data like this to correct errors before modeling.
df.loc[df['Name']  == 'Bob', 'Attendance'] = 0.85
print(df)

# Handling Missing Data
# Real-world AI datasets often have missing values. Let’s introduce one and fix it:
# Introduce a missing value
df.loc[1, 'Science'] = np.nan
print(df)

# Drop Missing Values:
df_dropped = df.dropna()
print(df_dropped)

# Fill Missing Values: Let’s fill Bob’s Science score with the mean:
# In AI, you choose dropping or filling based on how much data you can afford to lose.
mean_science = df['Science'].mean()
df['Science'] = df['Science'].fillna(mean_science)
print(df)

# Practical AI Example: Feature Engineering
# In machine learning, you create new features to improve models.
# Let’s add a feature: “Pass/Fail” based on Average > 85.
# This binary feature could be a target variable for a classification model.
df['Pass'] = df['Average'] > 85
print(df)

# Grouping and Aggregation
# Suppose you’re analyzing data across groups.
# Let’s group by Pass/Fail and compute averages:
# This shows how passing students differ from failing ones—useful for insights in AI.
grouped = df.groupby('Pass').mean(numeric_only=True)
print(grouped)

Conclusion

Pandas gives you the superpower to read data like an AI expert. Keep moving forward, and soon you’ll be able to visualize and model. Happy coding!

Related Post


Previous Post Next Post
Buy Me A Coffee
Thank you for visiting. You can now buy me a coffee!