Kickstart ML with Python snippets

Understanding the basics of ROC curves

True Positive Rate (TPR) or Sensitivity or Recall:
- TPR measures the proportion of actual positives that are correctly identified by the model.
- Formula: $$ \text{TPR} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}} $$
False Positive Rate (FPR):
- FPR measures the proportion of actual negatives that are incorrectly identified as positives by the model.
- Formula: $$ \text{FPR} = \frac{\text{False Positives (FP)}}{\text{False Positives (FP)} + \text{True Negatives (TN)}} $$
Threshold:
- The classification threshold determines the point at which the model's predicted probability is converted into a binary decision (positive or negative).
- By varying this threshold, you can generate different TPR and FPR values, which are then plotted to form the ROC curve.

How to Construct an ROC Curve:

Calculate TPR and FPR:
- For a series of threshold values (typically from 0 to 1), calculate the TPR and FPR.
Plot the Points:
- Plot the TPR against the FPR for each threshold value.
Connect the Points:
- Connect these points to form the ROC curve.

Interpreting the ROC Curve:

Diagonal Line (45-degree line):
- This represents a random classifier with no discrimination capability (TPR = FPR).
Above the Diagonal:
- Points above the diagonal indicate better-than-random performance.
Closer to the Top-Left Corner:
- The closer the ROC curve is to the top-left corner, the better the model's performance. This point represents high TPR and low FPR.

Area Under the ROC Curve (AUC - ROC):

AUC (Area Under the Curve):
- The AUC value represents the overall ability of the model to discriminate between positive and negative classes.
- AUC ranges from 0 to 1:
  - 0.5: Represents a random classifier.
  - 1.0: Represents a perfect classifier.
  - >0.5: Represents a model better than random.

Python Example:

Here’s how you can plot an ROC curve using the sklearn library:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

# Generate a binary classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a logistic regression model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)

# Predict probabilities
y_prob = model.predict_proba(X_test)[:, 1]

# Compute ROC curve and ROC area
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='red', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()

Interpreting the ROC Curve and AUC

ROC Curve Shape:
- The curve starts at (0, 0) and ends at (1, 1).
- A curve that bows towards the top-left corner indicates good performance. The model achieves high TPR while maintaining low FPR.
Diagonal Line (Random Classifier):
- The dotted line from (0, 0) to (1, 1) represents a random classifier with no skill. Points above this line indicate better-than-random performance.
Area Under the Curve (AUC):
- The AUC value (0.5 to 1.0) provides a single measure of overall model performance. An AUC of 0.93, for example, indicates excellent model performance.

Practical Tips

Threshold Selection: Depending on the application, you might choose different thresholds to balance between TPR and FPR. For example, in medical diagnostics, you might prefer a higher TPR (sensitivity) even at the cost of a higher FPR.
Comparison of Models: ROC curves are useful for comparing the performance of different models. The model with the higher AUC is generally considered better.
Understanding Trade-offs: The ROC curve helps visualize the trade-off between sensitivity (TPR) and specificity (1 - FPR). This is crucial in applications where the cost of false positives and false negatives differs.

Back to Kickstart ML with Python cookbook page