Scikit-learn (sklearn) is a free Python library that helps you analyze data and make predictions. It’s like a toolbox for "machine learning," which means teaching a computer to learn patterns from data and make decisions.
In this tutorial, we’ll:
- Install scikit-learn.
- Load some simple data.
- Use a machine learning model to predict something.
- Explain each step in plain language!
Before we start, you need Python and scikit-learn installed on your computer.
- Install Python: If you don’t have Python, download it from python.org. Use version 3.8 or higher.
- Install scikit-learn: Open your computer’s command line (Terminal on Mac or Command Prompt on Windows) and type:
Press Enter, and it’ll download scikit-learn.
pip install scikit-learn - Install a code editor: Use something simple like Jupyter Notebook or VS Code. For this tutorial, I’ll assume you’re using a basic Python file (e.g.,
mycode.py).
Open your code editor and create a new file. Let’s start by bringing in the tools we need.
# Import the tools we’ll use from scikit-learn
from sklearn.linear_model import LogisticRegression # This is our prediction tool
import numpy as np # Helps us work with numbers easily- What’s happening here?
LogisticRegressionis a tool that guesses "yes or no" answers (like pass or fail).numpy(np) helps us handle lists of numbers.
Imagine we have data about students: how many hours they studied and whether they passed (1 = pass, 0 = fail). Let’s make up some numbers.
# Hours studied (our "input" data)
hours_studied = np.array([[1], [2], [3], [4], [5]])
# Pass (1) or Fail (0) (our "output" data)
pass_or_fail = np.array([0, 0, 1, 1, 1])- What’s this?
hours_studiedis a list of numbers: 1 hour, 2 hours, etc.pass_or_failtells us the result: 0 means fail, 1 means pass.- For example, a student who studied 1 hour failed (0), but one who studied 5 hours passed (1).
Now, let’s teach the computer to predict if a student passes based on study hours.
# Create the prediction tool (model)
model = LogisticRegression()
# Teach the model using our data
model.fit(hours_studied, pass_or_fail)- What’s happening?
model = LogisticRegression()is like hiring a tutor to learn from your data.model.fit()is like the tutor studying the hours and pass/fail results to find a pattern.
Let’s ask the model: "If a student studies 3.5 hours, will they pass?"
# Test with 3.5 hours
new_hours = np.array([[3.5]])
# Predict: 0 (fail) or 1 (pass)
prediction = model.predict(new_hours)
# Show the result
print("Prediction for 3.5 hours:", "Pass" if prediction[0] == 1 else "Fail")- What’s this?
new_hoursis the new data we’re testing (3.5 hours).model.predict()guesses if the student passes or fails.- The
printline shows the answer in plain words.
When you run this, it might say something like:
Prediction for 3.5 hours: Pass
Let’s see how well our model learned by testing it on the original data.
# Check predictions for all our original hours
predictions = model.predict(hours_studied)
# Show the results
print("Hours Studied:", hours_studied.flatten()) # Flatten makes it a simple list
print("Actual Results:", pass_or_fail)
print("Predicted Results:", predictions)- Output might look like:
Hours Studied: [1 2 3 4 5]
Actual Results: [0 0 1 1 1]
Predicted Results: [0 0 1 1 1]
- What’s this?
- It shows what the model guessed for 1, 2, 3, 4, and 5 hours.
- Compare “Predicted” to “Actual” to see if it’s right!
Here’s everything put together. Copy this into a file (e.g., learn_sklearn.py) and run it with Python:
# Import tools
from sklearn.linear_model import LogisticRegression
import numpy as np
# Our data
hours_studied = np.array([[1], [2], [3], [4], [5]])
pass_or_fail = np.array([0, 0, 1, 1, 1])
# Create and teach the model
model = LogisticRegression()
model.fit(hours_studied, pass_or_fail)
# Predict for 3.5 hours
new_hours = np.array([[3.5]])
prediction = model.predict(new_hours)
print("Prediction for 3.5 hours:", "Pass" if prediction[0] == 1 else "Fail")
# Check all predictions
predictions = model.predict(hours_studied)
print("Hours Studied:", hours_studied.flatten())
print("Actual Results:", pass_or_fail)
print("Predicted Results:", predictions)- Data: We used hours studied to predict passing or failing.
- Model: LogisticRegression learned the pattern (more hours = more likely to pass).
- Prediction: We guessed results for new data and checked how good our guesses were.
- Change the
hours_studiedandpass_or_failnumbers to something else (e.g.,[1, 3, 5, 7]and[0, 1, 1, 1]). - Predict for a different number of hours, like 6 or 2.5.
- See what happens!