Kickstart ML with Python snippets

Exploring K-means algorithm basic concepts

K-Means Clustering is a type of unsupervised learning used to partition data into K distinct, non-overlapping clusters. Each data point is assigned to the cluster with the nearest mean, serving as the prototype of the cluster.

Centroids:
- The center of a cluster.
- Each cluster is represented by its centroid, which is the mean of all data points in the cluster.
Inertia:
- Also known as the within-cluster sum of squares.
- Measures the compactness of the clusters, calculated as the sum of squared distances between each point and its centroid.
K (Number of Clusters):
- The number of clusters to partition the data into.
- Needs to be specified before running the algorithm.

Steps in K-Means Algorithm

Initialization:
- Select K initial centroids randomly from the dataset.
Assignment:
- Assign each data point to the nearest centroid, forming K clusters.
Update:
- Calculate the new centroids as the mean of all data points in each cluster.
Repeat:
- Repeat the assignment and update steps until the centroids no longer change (convergence) or a maximum number of iterations is reached.

Practical Example in Python

Let’s walk through an example using the sklearn library in Python.

Step-by-Step Example

Import Libraries:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

Generate Synthetic Data:

# Generate synthetic dataX, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)# Visualize the dataplt.scatter(X[:,0], X[:,1], s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Synthetic Data')
plt.show()

Apply K-Means Clustering:

# Set the number of clustersk =4# Fit the K-Means modelkmeans = KMeans(n_clusters=k)
kmeans.fit(X)# Get the cluster centers and labelscenters = kmeans.cluster_centers_
labels = kmeans.labels_# Print the cluster centersprint(f"Cluster Centers:\n{centers}")

Visualize the Clusters:

# Visualize the clustered dataplt.scatter(X[:,0], X[:,1], c=labels, s=50, cmap='viridis')
plt.scatter(centers[:,0], centers[:,1], c='red', s=200, alpha=0.75, marker='x')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-Means Clustering')
plt.show()

Evaluate the Model:

# Calculate inertia (sum of squared distances to the nearest centroid)
inertia = kmeans.inertia_print(f"Inertia:{inertia}")

Data Generation:
- We use make_blobs to generate synthetic data with 4 centers (clusters). This helps us visualize and understand how K-Means works.
Model Fitting:
- We initialize the K-Means model with k=4, meaning we want to partition the data into 4 clusters.
- We fit the model to the data using kmeans.fit(X), which performs the K-Means algorithm.
Cluster Centers and Labels:
- kmeans.cluster_centers_ gives the coordinates of the centroids of the clusters.
- kmeans.labels_ gives the cluster label for each data point.
Visualization:
- We plot the data points, coloring them by their cluster label.
- The centroids are plotted as red ‘x’ marks, showing the center of each cluster.
Inertia:
- Inertia is calculated to measure how well the clusters have been formed. Lower inertia indicates more compact clusters.

Practical Tips and tricks

Choosing K:

Use the Elbow Method to determine the optimal number of clusters. Plot inertia for different values of K and look for an "elbow" point where the rate of decrease slows down.

# Elbow Methodinertia_values = []
k_values =range(1,10)forkink_values:
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
inertia_values.append(kmeans.inertia_)

plt.plot(k_values, inertia_values,'bx-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.show()

Scaling Features:
- Scale your features before applying K-Means, especially if they have different units or scales. Use StandardScaler or MinMaxScaler from sklearn.preprocessing.
```
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```
Initialization:
- K-Means is sensitive to initial centroids. Use the k-means++ initialization to improve convergence.
```
kmeans = KMeans(n_clusters=4, init='k-means++')
```
Handling Large Datasets:
- For large datasets, consider using MiniBatchKMeans, which reduces computational cost by using mini-batches.
```
from sklearn.cluster import MiniBatchKMeans

mini_batch_kmeans = MiniBatchKMeans(n_clusters=4)
mini_batch_kmeans.fit(X)
```

Back to Kickstart ML with Python cookbook page