Kickstart ML with Python snippets
Understanding the basics of ROC curves
-
True Positive Rate (TPR) or Sensitivity or Recall:
- TPR measures the proportion of actual positives that are correctly identified by the model.
- Formula: $$ \text{TPR} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}} $$
-
False Positive Rate (FPR):
- FPR measures the proportion of actual negatives that are incorrectly identified as positives by the model.
- Formula: $$ \text{FPR} = \frac{\text{False Positives (FP)}}{\text{False Positives (FP)} + \text{True Negatives (TN)}} $$
-
Threshold:
- The classification threshold determines the point at which the model's predicted probability is converted into a binary decision (positive or negative).
- By varying this threshold, you can generate different TPR and FPR values, which are then plotted to form the ROC curve.
How to Construct an ROC Curve:
-
Calculate TPR and FPR:
- For a series of threshold values (typically from 0 to 1), calculate the TPR and FPR.
-
Plot the Points:
- Plot the TPR against the FPR for each threshold value.
-
Connect the Points:
- Connect these points to form the ROC curve.
Interpreting the ROC Curve:
-
Diagonal Line (45-degree line):
- This represents a random classifier with no discrimination capability (TPR = FPR).
-
Above the Diagonal:
- Points above the diagonal indicate better-than-random performance.
-
Closer to the Top-Left Corner:
- The closer the ROC curve is to the top-left corner, the better the model's performance. This point represents high TPR and low FPR.
Area Under the ROC Curve (AUC - ROC):
- AUC (Area Under the Curve):
- The AUC value represents the overall ability of the model to discriminate between positive and negative classes.
- AUC ranges from 0 to 1:
- 0.5: Represents a random classifier.
- 1.0: Represents a perfect classifier.
- >0.5: Represents a model better than random.
Python Example:
Here’s how you can plot an ROC curve using the sklearn
library:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
# Generate a binary classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create a logistic regression model
model = LogisticRegression()
# Train the model
model.fit(X_train, y_train)
# Predict probabilities
y_prob = model.predict_proba(X_test)[:, 1]
# Compute ROC curve and ROC area
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='red', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()
Interpreting the ROC Curve and AUC
-
ROC Curve Shape:
- The curve starts at (0, 0) and ends at (1, 1).
- A curve that bows towards the top-left corner indicates good performance. The model achieves high TPR while maintaining low FPR.
-
Diagonal Line (Random Classifier):
- The dotted line from (0, 0) to (1, 1) represents a random classifier with no skill. Points above this line indicate better-than-random performance.
-
Area Under the Curve (AUC):
- The AUC value (0.5 to 1.0) provides a single measure of overall model performance. An AUC of 0.93, for example, indicates excellent model performance.
Practical Tips
- Threshold Selection: Depending on the application, you might choose different thresholds to balance between TPR and FPR. For example, in medical diagnostics, you might prefer a higher TPR (sensitivity) even at the cost of a higher FPR.
- Comparison of Models: ROC curves are useful for comparing the performance of different models. The model with the higher AUC is generally considered better.
- Understanding Trade-offs: The ROC curve helps visualize the trade-off between sensitivity (TPR) and specificity (1 - FPR). This is crucial in applications where the cost of false positives and false negatives differs.