Python for AI (Part 6): Introduction to Machine Learning with scikit-learn
Introduction
In this post, we welcome you to Part 6 of your Python-for-AI journey! After learning about data visualization with Seaborn in Part 5, you’re finally poised to make a leap into machine learning with the scikit-learn library, a powerful set of tools for building and evaluating predictive models. This section will build from your programming knowledge to AI assuming your Java background and growing Python proficiency and cover supervised learning, particularly classification. We will predict pass/fail status on the student dataset from the previous parts, closing the gap between concepts learned and their practical use. This tutorial walks you through the bad of install, data preprocessing as well as model training and evaluation, which is a good starting point for general machine learning in AI projects.
Machine Learning Explained: How and Why to Use It?
Machine Learning — This is a branch of AI focused on creating algorithms that allow machines to learn from data and make predictions or decisions based on that data without being explicitly programmed. It’s a staple of AI, allowing computers to do things like organize email into spam, project stock prices, and, in our case, predict whether students will pass or fail. Now, we will focus on supervised learning, where the algorithm is trained on labeled data — data that has known outcomes such as our student dataset’s Pass column (True/False). This is especially relevant for AI, because this is how you might train a Java application to classify data based on historical representatives.
Supervised learning is generally best suited for beginners as starting topics: classification (predicting categories), and regression (predicting numbers). This is evidence toward scikit-learn being an easy-to-learn library, one that provides easy-to-use APIs for popular algorithms such as logistic regression, decision trees and support vector machines. In keeping with your progression, we’ll start with logistic regression for its interpretability and ease of use, because of your Java experience writing structured, predictable systems.
Setting Up scikit-learn
Before first using scikit-learn, make sure it’s installed. Enter this in your command line:
pip install scikit-learn
Then, import the libraries we’ll need, building on your Pandas knowledge:
# Import Pandas for data handling import pandas as pd # Import train_test_split to split data into training and testing sets from sklearn.model_selection import train_test_split # Import LogisticRegression for our classification model from sklearn.linear_model import LogisticRegression # Import metrics for evaluating model performance from sklearn import metrics
These imports give us tools for data management (pandas
), splitting data (train_test_split
), modeling (LogisticRegression
), and evaluation (metrics
). Check the Pandas Documentation and scikit-learn User Guide for more details.
Using the Student Dataset
We’ll use the student dataset from previous parts:
# Define the dataset as a dictionary data = { 'Name': ['Alice', 'Bob', 'Charlie', 'Dana', 'Eve', 'Frank', 'Grace', 'Hank', 'Ivy', 'Jack'], # Student names 'Math': [85, 92, 78, 95, 88, 65, 90, 82, 77, 94], # Math scores 'Science': [88, 85, 90, 92, 80, 70, 87, 83, 79, 91], # Science scores 'English': [90, 88, 85, 93, 87, 75, 89, 84, 80, 92], # English scores 'Study_Hours': [5, 6, 4, 7, 5.5, 3, 6.5, 4.5, 4, 6], # Hours studied 'Grade_Level': ['Freshman', 'Sophomore', 'Freshman', 'Sophomore', 'Freshman', 'Freshman', 'Sophomore', 'Sophomore', 'Freshman', 'Sophomore'], # Grade category 'Pass': [True, True, False, True, True, False, True, True, False, True] # Pass/Fail status } # Create a DataFrame for structured data df = pd.DataFrame(data)
This dataset has 10 students with numerical features (Math
, Science
, English
, Study_Hours
) and categorical features (Grade_Level
, Pass
). It’s small but ideal for learning—real AI projects use larger datasets, but this keeps things manageable. It’s a practical example, like predicting outcomes in a Java app.
Preprocessing the Data
We need to prepare the data for the model with two steps:
1. Handle Categorical Variables
# Convert Grade_Level to numerical columns with one-hot encoding df_encoded = pd.get_dummies(df, columns=['Grade_Level'], drop_first=True)
pd.get_dummies
turns Grade_Level
into binary columns (e.g., Grade_Level_Sophomore
), and drop_first=True
avoids redundancy, like optimizing data in Java to prevent duplicate info.
2. Split the Data
# Select features by dropping non-predictive columns X = df_encoded.drop(['Name', 'Pass'], axis=1) # Set the target variable y = df_encoded['Pass'] # Split into 80% training and 20% testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
drop
removes Name
and Pass
, leaving predictors in X
, with Pass
as the target (y
). train_test_split
divides the data, using test_size=0.2
for 20% testing and random_state=42
for consistent results. This ensures the model trains on most data and tests on new data, a key AI practice.
What is Logistic Regression?
What is Logistic Regression? Logistic regression is a statistical model used for binary classification, meaning it predicts one of two possible outcomes, like pass or fail. It works by finding the best way to separate the data into two groups using a line (or a curve in higher dimensions). The model learns from the training data and then can make predictions on new, unseen data.
In our case, the model will look at students' scores and study hours to decide whether they are likely to pass or fail.
Training a Logistic Regression Model
Let’s train the model:
# Create a logistic regression model instance model = LogisticRegression() # Train the model on the training data model.fit(X_train, y_train)
LogisticRegression()
sets up the model, and fit
teaches it to predict Pass
using X_train
and y_train
. It’s like fitting a line to separate passing and failing students—intuitive and aligned with your Java experience.
Evaluating the Model
Now, evaluate its performance:
# Predict Pass/Fail for the test set y_pred = model.predict(X_test) # Calculate accuracy (correct predictions / total) accuracy = metrics.accuracy_score(y_test, y_pred) # Calculate precision (true positives / predicted positives) precision = metrics.precision_score(y_test, y_pred) # Calculate recall (true positives / actual positives) recall = metrics.recall_score(y_test, y_pred) # Calculate F1-score (balance of precision and recall) f1 = metrics.f1_score(y_test, y_pred) # Print evaluation metrics print(f"Accuracy: {accuracy}") print(f"Precision: {precision}") print(f"Recall: {recall}") print(f"F1 Score: {f1}") # Generate and print confusion matrix (true vs. predicted outcomes) confusion = metrics.confusion_matrix(y_test, y_pred) print("Confusion Matrix: ") print(confusion)
predict
generates predictions, and we use metrics
to compute accuracy
(overall correctness), precision
(reliability of positive predictions), recall
(coverage of actual positives), and f1_score
(balance of precision and recall). The confusion_matrix
shows detailed results (e.g., true positives). With our small dataset, results may vary, but in AI, larger data improves reliability—an unexpected lesson for beginners.
predict(X_test)
uses the model to predict Pass/Fail for the test set.accuracy_score
calculates the percentage of correct predictions, a common metric in AI.precision_score
measures the proportion of true positives among all positive predictions.recall_score
measures the proportion of true positives among all actual positives.f1_score
is the harmonic mean of precision and recall, providing a balanced measure.confusion_matrix
shows a table of true positives, false positives, true negatives, and false negatives, giving a deeper view of model performance.
Try It Yourself: An Exercise
Here’s the full code to run:
# Import libraries import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn import metrics # Define student dataset data = { 'Name': ['Alice', 'Bob', 'Charlie', 'Dana', 'Eve', 'Frank', 'Grace', 'Hank', 'Ivy', 'Jack'], 'Math': [85, 92, 78, 95, 88, 65, 90, 82, 77, 94], 'Science': [88, 85, 90, 92, 80, 70, 87, 83, 79, 91], 'English': [90, 88, 85, 93, 87, 75, 89, 84, 80, 92], 'Study_Hours': [5, 6, 4, 7, 5.5, 3, 6.5, 4.5, 4, 6], 'Grade_Level': ['Freshman', 'Sophomore', 'Freshman', 'Sophomore', 'Freshman', 'Freshman', 'Sophomore', 'Sophomore', 'Freshman', 'Sophomore'], 'Pass': [True, True, False, True, True, False, True, True, False, True] } df = pd.DataFrame(data) # Preprocess: One-hot encoding df_encoded = pd.get_dummies(df, columns=['Grade_Level'], drop_first=True) # Separate features and target X = df_encoded.drop(['Name', 'Pass'], axis=1) y = df_encoded['Pass'] # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train model model = LogisticRegression() model.fit(X_train, y_train) # Predict and evaluate y_pred = model.predict(X_test) accuracy = metrics.accuracy_score(y_test, y_pred) precision = metrics.precision_score(y_test, y_pred) recall = metrics.recall_score(y_test, y_pred) f1 = metrics.f1_score(y_test, y_pred) print(f"Accuracy: {accuracy}") print(f"Precision: {precision}") print(f"Recall: {recall}") print(f"F1 Score: {f1}") confusion = metrics.confusion_matrix(y_test, y_pred) print("Confusion Matrix: ") print(confusion)
Task: Change test_size
to 0.3 and rerun. How do accuracy
, precision
, recall
, and f1_score
change? Check the confusion_matrix
—what does it tell you about model reliability? Try it and compare!
This exercise helps you see how data split affects performance, a key AI concept. With only 10 rows, results may fluctuate, but it’s great practice for understanding model evaluation.
Next Steps
You’ve started machine learning with scikit-learn! Next, Part 7 could explore advanced algorithms like decision trees or random forests, or dive into unsupervised learning (e.g., clustering). Experiment with this code—add features or find a bigger dataset—to deepen your skills.
Conclusion
Scikit-learn opens the door to AI by turning data into predictions. You’ve learned to preprocess, train, and evaluate a model, bridging your Java roots to Python’s AI power. Keep practicing, and you’ll soon tackle more complex machine learning tasks. Happy coding!
Key Citations
For more resources, check out: