Kickstart ML with Python snippets
Evaluating clustering models
The Rand Index and Adjusted Rand Index are metrics used to evaluate the quality of clustering results by comparing the similarity between the clusters produced by a clustering algorithm and a ground truth clustering.
Rand Index (RI)
The Rand Index (RI) measures the similarity between two data clusterings by considering all pairs of samples and counting pairs that are either assigned to the same cluster or to different clusters in both the predicted and ground truth clusterings.
How Rand Index Works:
-
Count Pairs:
- a: The number of pairs of elements that are in the same cluster in both the predicted clustering and the ground truth clustering.
- b: The number of pairs of elements that are in different clusters in both the predicted clustering and the ground truth clustering.
- c: The number of pairs of elements that are in the same cluster in the predicted clustering but in different clusters in the ground truth clustering.
- d: The number of pairs of elements that are in different clusters in the predicted clustering but in the same cluster in the ground truth clustering.
-
Rand Index Formula: $$ \text{RI} = \frac{a + b}{a + b + c + d} $$
The Rand Index ranges from 0 to 1, where:
- 0: No agreement between the clusterings.
- 1: Perfect agreement between the clusterings.
However, the Rand Index does not account for the possibility of the agreement occurring by chance.
Adjusted Rand Index (ARI)
The Adjusted Rand Index (ARI) adjusts the Rand Index for the chance grouping of elements, providing a more accurate measure of clustering quality. ARI corrects the RI to ensure that a random clustering will have an ARI close to 0, while a perfect match will have an ARI of 1.
How Adjusted Rand Index Works:
-
Define Contingency Table:
- Create a contingency table where rows represent the clusters in the ground truth, and columns represent the clusters in the predicted clustering.
-
Calculate Index Components:
- a: The sum of the number of pairs of elements that are in the same cluster in both the predicted and the ground truth clustering.
- b: The sum of the number of pairs of elements that are in different clusters in both the predicted and the ground truth clustering.
-
Expected Index:
- Calculate the expected index \(E(RI)\) by considering the number of elements in each cluster and the total number of possible pairs.
-
Adjusted Rand Index Formula: $$ \text{ARI} = \frac{\text{RI} - E(\text{RI})}{\max(\text{RI}) - E(\text{RI})} $$
Where:
- \( RI \): Rand Index.
- \( E(RI \)): Expected Rand Index, which accounts for chance.
The ARI ranges from -1 to 1, where:
- 1: Perfect agreement between the clusterings.
- 0: Agreement is no better than random chance.
- Negative values: Indicate that the clustering is worse than random chance.
Example
Let’s illustrate the Rand Index and Adjusted Rand Index with an example:
- Ground Truth Clustering: ([1, 1, 0, 0, 1, 0, 0, 1])
- Predicted Clustering: ([1, 0, 0, 0, 1, 1, 0, 1])
We need to calculate the number of pairs:
- a: Pairs that are in the same cluster in both clusterings.
- b: Pairs that are in different clusters in both clusterings.
- c: Pairs that are in the same cluster in the predicted clustering but different clusters in the ground truth.
- d: Pairs that are in different clusters in the predicted clustering but the same cluster in the ground truth.
For simplicity, we will use the sklearn
library to calculate these metrics:
from sklearn.metrics import rand_score, adjusted_rand_score
# Ground truth and predicted clusterings
y_true = [1, 1, 0, 0, 1, 0, 0, 1]
y_pred = [1, 0, 0, 0, 1, 1, 0, 1]
# Calculate Rand Index
ri = rand_score(y_true, y_pred)
print("Rand Index:", ri)
# Calculate Adjusted Rand Index
ari = adjusted_rand_score(y_true, y_pred)
print("Adjusted Rand Index:", ari)
Summary of Metrics
-
Rand Index (RI):
- Measures the agreement between two clusterings by considering all pairs of samples.
- Ranges from 0 (no agreement) to 1 (perfect agreement).
-
Adjusted Rand Index (ARI):
- Adjusts the Rand Index for chance, providing a more accurate measure of clustering quality.
- Ranges from -1 (worse than random) to 1 (perfect agreement), with 0 indicating random clustering.
The Receiver Operating Characteristic (ROC) curve is a graphical representation used to evaluate the performance of a binary classification model. It illustrates the trade-off between the True Positive Rate (TPR) and the False Positive Rate (FPR) at various threshold settings.