Kickstart ML with Python snippets
Exploring k-medoids (PAM) algorithm basic concepts
K-Medoids (Partitioning Around Medoids) is a clustering algorithm similar to K-Means but instead of using the mean of the points in a cluster (centroid), it uses the most centrally located data point (medoid) to represent the cluster. This makes it more robust to outliers and noise.
-
Medoid:
- The medoid is the most centrally located point in a cluster. It minimizes the sum of dissimilarities between itself and the other points in the cluster.
-
Dissimilarity Measure:
- Commonly used dissimilarity measures include Euclidean distance, Manhattan distance, etc.
-
Algorithm Steps:
- Initialization: Randomly select k points as the initial medoids.
- Assignment: Assign each data point to the nearest medoid.
- Update: For each cluster, select the point that minimizes the sum of dissimilarities within the cluster as the new medoid.
- Repeat: Repeat the assignment and update steps until the medoids do not change.
Steps in the PAM Algorithm
-
Initialization:
- Randomly select k data points as initial medoids.
-
Assignment:
- Assign each data point to the nearest medoid based on a dissimilarity measure.
-
Update:
- For each cluster, compute the total dissimilarity between each point in the cluster and all other points in the cluster.
- Select the point that minimizes this total dissimilarity as the new medoid.
-
Repeat:
- Repeat the assignment and update steps until there is no change in the medoids.
Practical Example in Python
Let's implement the k-medoids algorithm using the scikit-learn-extra
library,
which provides an implementation of k-medoids.
Step-by-Step Example
- Install the
scikit-learn-extra
library:
pip install scikit-learn-extra
- Import Libraries:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn_extra.cluster import KMedoids
import seaborn as sns
# Optional for better plot aesthetics
sns.set(style="whitegrid")
- Generate Synthetic Data:
# Generate synthetic data
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Visualize the data
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Synthetic Data')
plt.show()
- Apply K-Medoids:
# Apply K-Medoids
kmedoids = KMedoids(n_clusters=4, random_state=0)
kmedoids.fit(X)
# Get the cluster labels
labels = kmedoids.labels_
# Get the medoids
medoids = kmedoids.cluster_centers_
# Plot the clustered data
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.scatter(medoids[:, 0], medoids[:, 1], c='red', s=200, alpha=0.75, marker='x')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-Medoids Clustering')
plt.show()
-
Data Generation:
- We use
make_blobs
to generate synthetic data with 4 centers (clusters) for visualization and understanding of K-Medoids.
- We use
-
Model Fitting:
- We initialize the K-Medoids model with
n_clusters=4
, meaning we want to partition the data into 4 clusters. - We fit the model to the data using
kmedoids.fit(X)
, which performs the K-Medoids algorithm and returns cluster labels for each data point.
- We initialize the K-Medoids model with
-
Cluster Labels and Medoids:
kmedoids.labels_
gives the cluster label for each data point.kmedoids.cluster_centers_
gives the coordinates of the medoids of the clusters.
-
Visualization:
- We plot the data points, coloring them by their cluster labels.
- The medoids are plotted as red ‘x’ marks, showing the central points of each cluster.
Practical Tips
-
Choosing the Number of Clusters (k):
- Use methods like the Elbow Method or Silhouette Score to determine the optimal number of clusters.
from sklearn.metrics import silhouette_score silhouette_avg = silhouette_score(X, labels)print(f'Silhouette Score:{silhouette_avg}')
-
Scaling Features:
- Scale your features before applying K-Medoids to ensure that all features contribute equally to the distance metric.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
-
Initialization Sensitivity:
- K-Medoids can be sensitive to initial medoid selection. Using multiple initializations and selecting the best result can improve performance.
kmedoids = KMedoids(n_clusters=4, random_state=0, init='k-medoids++')
-
Handling Large Datasets:
- For very large datasets, consider using sampling techniques to reduce the size of the data before applying K-Medoids.