Kickstart ML with Python snippets

Exploring k-medoids (PAM) algorithm basic concepts

K-Medoids (Partitioning Around Medoids) is a clustering algorithm similar to K-Means but instead of using the mean of the points in a cluster (centroid), it uses the most centrally located data point (medoid) to represent the cluster. This makes it more robust to outliers and noise.

Medoid:
- The medoid is the most centrally located point in a cluster. It minimizes the sum of dissimilarities between itself and the other points in the cluster.
Dissimilarity Measure:
- Commonly used dissimilarity measures include Euclidean distance, Manhattan distance, etc.
Algorithm Steps:
- Initialization: Randomly select k points as the initial medoids.
- Assignment: Assign each data point to the nearest medoid.
- Update: For each cluster, select the point that minimizes the sum of dissimilarities within the cluster as the new medoid.
- Repeat: Repeat the assignment and update steps until the medoids do not change.

Steps in the PAM Algorithm

Initialization:
- Randomly select k data points as initial medoids.
Assignment:
- Assign each data point to the nearest medoid based on a dissimilarity measure.
Update:
- For each cluster, compute the total dissimilarity between each point in the cluster and all other points in the cluster.
- Select the point that minimizes this total dissimilarity as the new medoid.
Repeat:
- Repeat the assignment and update steps until there is no change in the medoids.

Practical Example in Python

Let's implement the k-medoids algorithm using the scikit-learn-extra library, which provides an implementation of k-medoids.

Step-by-Step Example

Install the scikit-learn-extra library:

pip install scikit-learn-extra

Import Libraries:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn_extra.cluster import KMedoids
import seaborn as sns

# Optional for better plot aesthetics
sns.set(style="whitegrid")

Generate Synthetic Data:

# Generate synthetic data
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Visualize the data
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Synthetic Data')
plt.show()

Apply K-Medoids:

# Apply K-Medoids
kmedoids = KMedoids(n_clusters=4, random_state=0)
kmedoids.fit(X)

# Get the cluster labels
labels = kmedoids.labels_

# Get the medoids
medoids = kmedoids.cluster_centers_

# Plot the clustered data
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.scatter(medoids[:, 0], medoids[:, 1], c='red', s=200, alpha=0.75, marker='x')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-Medoids Clustering')
plt.show()

Data Generation:
- We use make_blobs to generate synthetic data with 4 centers (clusters) for visualization and understanding of K-Medoids.
Model Fitting:
- We initialize the K-Medoids model with n_clusters=4, meaning we want to partition the data into 4 clusters.
- We fit the model to the data using kmedoids.fit(X), which performs the K-Medoids algorithm and returns cluster labels for each data point.
Cluster Labels and Medoids:
- kmedoids.labels_ gives the cluster label for each data point.
- kmedoids.cluster_centers_ gives the coordinates of the medoids of the clusters.
Visualization:
- We plot the data points, coloring them by their cluster labels.
- The medoids are plotted as red ‘x’ marks, showing the central points of each cluster.

Practical Tips

Choosing the Number of Clusters (k):
- Use methods like the Elbow Method or Silhouette Score to determine the optimal number of clusters.
```
from sklearn.metrics import silhouette_score
silhouette_avg = silhouette_score(X, labels)print(f'Silhouette Score:{silhouette_avg}')
```
Scaling Features:
- Scale your features before applying K-Medoids to ensure that all features contribute equally to the distance metric.
```
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```
Initialization Sensitivity:
- K-Medoids can be sensitive to initial medoid selection. Using multiple initializations and selecting the best result can improve performance.
```
kmedoids = KMedoids(n_clusters=4, random_state=0, init='k-medoids++')
```
Handling Large Datasets:
- For very large datasets, consider using sampling techniques to reduce the size of the data before applying K-Medoids.

Back to Kickstart ML with Python cookbook page