Kickstart ML with Python snippets
Getting started with classification (supervised learning) in Python
Classification is a type of supervised learning method used to predict a categorical outcome (often referred to as the target or class label) based on one or more predictor variables (features). The goal of classification is to assign new observations to one of the predefined classes.
Key Concepts in Classification
1. Types of Classification
-
Binary Classification: The target variable has two possible classes.
- Examples: Spam vs. Not Spam, Fraudulent vs. Non-Fraudulent.
-
Multi-Class Classification: The target variable has more than two possible classes.
- Examples: Handwritten digit recognition (0-9), Species classification in biology.
-
Multi-Label Classification: Each instance may belong to multiple classes simultaneously.
- Examples: Tagging a text document with multiple topics.
2. Model Representation
Classification models predict the probability of each class for a given input. The class with the highest probability is chosen as the predicted class.
Logistic Regression Equation (Binary Classification): $$ P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n)}}$$
- \( P(y=1|x) \): Probability that the output is class 1 given the input features \( x \).
- \( x_1, x_2, \ldots, x_n \): Input features.
- \( \beta_0, \beta_1, \ldots, \beta_n \): Model parameters.
3. Key Metrics
- Accuracy: The ratio of correctly predicted instances to the total instances.
- Precision: The ratio of correctly predicted positive observations to the total predicted positives.
- Recall (Sensitivity): The ratio of correctly predicted positive observations to all actual positives.
- F1 Score: The harmonic mean of precision and recall.
- Confusion Matrix: A table used to describe the performance of a classification model by comparing actual vs. predicted values.
- ROC Curve and AUC: Receiver Operating Characteristic curve and Area Under the Curve measure the performance of binary classifiers.
4. Common Algorithms
- Logistic Regression: A statistical model for binary classification.
- Decision Trees: A tree-like model of decisions used for classification tasks.
- Random Forest: An ensemble method that uses multiple decision trees.
- Support Vector Machines (SVM): A model that finds the hyperplane that best separates the classes.
- K-Nearest Neighbors (KNN): A model that classifies based on the majority class among the k-nearest neighbors.
- Naive Bayes: A probabilistic classifier based on Bayes' theorem with the assumption of independence among features.
- Neural Networks: Models inspired by the human brain, used for complex pattern recognition.
Practical Example of Logistic Regression in Python
Here’s how you can implement a simple binary classification model using Python and the
scikit-learn
library:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, auc
# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=42)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Fit the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)
# Classification Report
class_report = classification_report(y_test, y_pred)
print("Classification Report:")
print(class_report)
# ROC Curve and AUC
y_prob = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
print(f"ROC AUC: {roc_auc:.2f}")
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()
-
Data Preparation:
- We generate a synthetic binary classification dataset using
make_classification
. - We split the dataset into training and testing sets using
train_test_split
.
- We generate a synthetic binary classification dataset using
-
Model Fitting:
- We create an instance of the
LogisticRegression
model and fit it to the training data.
- We create an instance of the
-
Making Predictions:
- We use the
predict
method to make predictions on the test data.
- We use the
-
Model Evaluation:
- We calculate the accuracy score.
- We generate a confusion matrix to understand the model's performance.
- We generate a classification report to get precision, recall, and F1 score.
- We plot the ROC curve and calculate the AUC to evaluate the model's performance.