Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

README.md

Getting Started with scikit-learn

What is scikit-learn?

Scikit-learn (sklearn) is a free Python library that helps you analyze data and make predictions. It’s like a toolbox for "machine learning," which means teaching a computer to learn patterns from data and make decisions.

In this tutorial, we’ll:

  1. Install scikit-learn.
  2. Load some simple data.
  3. Use a machine learning model to predict something.
  4. Explain each step in plain language!

Step 1: Set Up Your Tools

Before we start, you need Python and scikit-learn installed on your computer.

  1. Install Python: If you don’t have Python, download it from python.org. Use version 3.8 or higher.
  2. Install scikit-learn: Open your computer’s command line (Terminal on Mac or Command Prompt on Windows) and type:
    pip install scikit-learn
    
    Press Enter, and it’ll download scikit-learn.
  3. Install a code editor: Use something simple like Jupyter Notebook or VS Code. For this tutorial, I’ll assume you’re using a basic Python file (e.g., mycode.py).

Step 2: Write Your First Python Code

Open your code editor and create a new file. Let’s start by bringing in the tools we need.

# Import the tools we’ll use from scikit-learn
from sklearn.linear_model import LogisticRegression  # This is our prediction tool
import numpy as np  # Helps us work with numbers easily
  • What’s happening here?
    • LogisticRegression is a tool that guesses "yes or no" answers (like pass or fail).
    • numpy (np) helps us handle lists of numbers.

Step 3: Create Some Simple Data

Imagine we have data about students: how many hours they studied and whether they passed (1 = pass, 0 = fail). Let’s make up some numbers.

# Hours studied (our "input" data)
hours_studied = np.array([[1], [2], [3], [4], [5]])

# Pass (1) or Fail (0) (our "output" data)
pass_or_fail = np.array([0, 0, 1, 1, 1])
  • What’s this?
    • hours_studied is a list of numbers: 1 hour, 2 hours, etc.
    • pass_or_fail tells us the result: 0 means fail, 1 means pass.
    • For example, a student who studied 1 hour failed (0), but one who studied 5 hours passed (1).

Step 4: Build a Prediction Model

Now, let’s teach the computer to predict if a student passes based on study hours.

# Create the prediction tool (model)
model = LogisticRegression()

# Teach the model using our data
model.fit(hours_studied, pass_or_fail)
  • What’s happening?
    • model = LogisticRegression() is like hiring a tutor to learn from your data.
    • model.fit() is like the tutor studying the hours and pass/fail results to find a pattern.

Step 5: Make Predictions

Let’s ask the model: "If a student studies 3.5 hours, will they pass?"

# Test with 3.5 hours
new_hours = np.array([[3.5]])

# Predict: 0 (fail) or 1 (pass)
prediction = model.predict(new_hours)

# Show the result
print("Prediction for 3.5 hours:", "Pass" if prediction[0] == 1 else "Fail")
  • What’s this?
    • new_hours is the new data we’re testing (3.5 hours).
    • model.predict() guesses if the student passes or fails.
    • The print line shows the answer in plain words.

When you run this, it might say something like:

Prediction for 3.5 hours: Pass

Step 6: Check How Good the Model Is

Let’s see how well our model learned by testing it on the original data.

# Check predictions for all our original hours
predictions = model.predict(hours_studied)

# Show the results
print("Hours Studied:", hours_studied.flatten())  # Flatten makes it a simple list
print("Actual Results:", pass_or_fail)
print("Predicted Results:", predictions)
  • Output might look like:
Hours Studied: [1 2 3 4 5]
Actual Results: [0 0 1 1 1]
Predicted Results: [0 0 1 1 1]
  • What’s this?
    • It shows what the model guessed for 1, 2, 3, 4, and 5 hours.
    • Compare “Predicted” to “Actual” to see if it’s right!

Full Code Example

Here’s everything put together. Copy this into a file (e.g., learn_sklearn.py) and run it with Python:

# Import tools
from sklearn.linear_model import LogisticRegression
import numpy as np

# Our data
hours_studied = np.array([[1], [2], [3], [4], [5]])
pass_or_fail = np.array([0, 0, 1, 1, 1])

# Create and teach the model
model = LogisticRegression()
model.fit(hours_studied, pass_or_fail)

# Predict for 3.5 hours
new_hours = np.array([[3.5]])
prediction = model.predict(new_hours)
print("Prediction for 3.5 hours:", "Pass" if prediction[0] == 1 else "Fail")

# Check all predictions
predictions = model.predict(hours_studied)
print("Hours Studied:", hours_studied.flatten())
print("Actual Results:", pass_or_fail)
print("Predicted Results:", predictions)

What Did We Learn?

  1. Data: We used hours studied to predict passing or failing.
  2. Model: LogisticRegression learned the pattern (more hours = more likely to pass).
  3. Prediction: We guessed results for new data and checked how good our guesses were.

Try It Yourself!

  • Change the hours_studied and pass_or_fail numbers to something else (e.g., [1, 3, 5, 7] and [0, 1, 1, 1]).
  • Predict for a different number of hours, like 6 or 2.5.
  • See what happens!