Kickstart ML with Python snippets
Exploring Hierarchical Agglomerative Clustering (HAC) basics
Hierarchical Agglomerative Clustering (HAC) is a type of hierarchical clustering algorithm used to build a hierarchy of clusters. HAC is an iterative process where each data point starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
-
Dendrogram:
- A tree-like diagram that records the sequences of merges or splits. It visually represents the hierarchy of clusters formed during the HAC process.
-
Linkage Criteria:
- Single Linkage: The distance between two clusters is defined as the shortest distance between points in the two clusters.
- Complete Linkage: The distance between two clusters is defined as the longest distance between points in the two clusters.
- Average Linkage: The distance between two clusters is defined as the average distance between all pairs of points in the two clusters.
- Ward’s Linkage: Minimizes the total within-cluster variance.
-
Distance Metric:
- Common distance metrics include Euclidean, Manhattan, and Cosine distances. Euclidean distance is most commonly used.
Steps in HAC Algorithm
-
Compute the Distance Matrix:
- Calculate the distance between every pair of data points.
-
Initialize Clusters:
- Start with each data point as its own cluster.
-
Merge Clusters:
- Iteratively merge the two closest clusters based on the chosen linkage criteria.
-
Repeat:
- Repeat step 3 until all points are merged into a single cluster or until a stopping criterion is met (e.g., a desired number of clusters).
Practical Example in Python
Let’s walk through an example using the scipy
and seaborn
libraries
in Python.
Step-by-Step Example
- Import Libraries:
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.datasets import make_blobs
import seaborn as sns
# Optional for better plot aesthetics
sns.set(style="whitegrid")
- Generate Synthetic Data:
# Generate synthetic data
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Visualize the data
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Synthetic Data')
plt.show()
- Perform HAC:
# Compute the linkage matrix using Ward's method
Z = linkage(X, method='ward')
# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z, truncate_mode='lastp', p=12, leaf_rotation=45., leaf_font_size=15., show_contracted=True)
plt.title('Dendrogram')
plt.xlabel('Sample index or (cluster size)')
plt.ylabel('Distance')
plt.show()
- Form Flat Clusters:
# Determine the clusters
max_d = 7.5 # Maximum distance for cutting the dendrogram
clusters = fcluster(Z, max_d, criterion='distance')
# Plot the clustered data
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis', s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Hierarchical Agglomerative Clustering (HAC)')
plt.show()
-
Data Generation:
- We use
make_blobs
to generate synthetic data with 4 centers (clusters) for visualization and understanding of HAC.
- We use
-
Linkage Matrix:
- We compute the linkage matrix using
linkage(X, method='ward')
. This matrix contains the distances and the pairs of clusters merged at each step.
- We compute the linkage matrix using
-
Dendrogram:
- We plot the dendrogram using
dendrogram(Z)
. This visualizes the hierarchy of clusters and helps in determining the optimal number of clusters by visually inspecting the longest vertical lines that are not crossed by horizontal lines.
- We plot the dendrogram using
-
Forming Flat Clusters:
- We use
fcluster(Z, max_d, criterion='distance')
to cut the dendrogram at a specified distance (max_d
) to form flat clusters. - We visualize the clustered data by coloring each data point based on its cluster label.
- We use
Practical Tips
-
Choosing the Linkage Method:
- Different linkage methods can produce different results. Experiment with single, complete, average, and Ward’s linkage to see which method works best for your data.
-
Determining the Number of Clusters:
- Use the dendrogram to visually inspect the number of clusters. Look for the longest vertical line in the dendrogram that can be cut by a horizontal line without intersecting any clusters.
-
Scaling Features:
- Scale your features before clustering, especially if they have different units
or scales. Use
StandardScaler
fromsklearn.preprocessing
.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
- Scale your features before clustering, especially if they have different units
or scales. Use
-
Handling Large Datasets:
- HAC can be computationally expensive for large datasets. Consider using faster clustering algorithms like K-Means or MiniBatchKMeans for very large datasets.