Kickstart ML with Python snippets
Exploring K-means algorithm basic concepts
K-Means Clustering is a type of unsupervised learning used to partition data into K distinct, non-overlapping clusters. Each data point is assigned to the cluster with the nearest mean, serving as the prototype of the cluster.
-
Centroids:
- The center of a cluster.
- Each cluster is represented by its centroid, which is the mean of all data points in the cluster.
-
Inertia:
- Also known as the within-cluster sum of squares.
- Measures the compactness of the clusters, calculated as the sum of squared distances between each point and its centroid.
-
K (Number of Clusters):
- The number of clusters to partition the data into.
- Needs to be specified before running the algorithm.
Steps in K-Means Algorithm
-
Initialization:
- Select K initial centroids randomly from the dataset.
-
Assignment:
- Assign each data point to the nearest centroid, forming K clusters.
-
Update:
- Calculate the new centroids as the mean of all data points in each cluster.
-
Repeat:
- Repeat the assignment and update steps until the centroids no longer change (convergence) or a maximum number of iterations is reached.
Practical Example in Python
Let’s walk through an example using the sklearn
library in Python.
Step-by-Step Example
- Import Libraries:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
- Generate Synthetic Data:
# Generate synthetic dataX, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)# Visualize the dataplt.scatter(X[:,0], X[:,1], s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Synthetic Data')
plt.show()
- Apply K-Means Clustering:
# Set the number of clustersk =4# Fit the K-Means modelkmeans = KMeans(n_clusters=k)
kmeans.fit(X)# Get the cluster centers and labelscenters = kmeans.cluster_centers_
labels = kmeans.labels_# Print the cluster centersprint(f"Cluster Centers:\n{centers}")
- Visualize the Clusters:
# Visualize the clustered dataplt.scatter(X[:,0], X[:,1], c=labels, s=50, cmap='viridis')
plt.scatter(centers[:,0], centers[:,1], c='red', s=200, alpha=0.75, marker='x')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-Means Clustering')
plt.show()
- Evaluate the Model:
# Calculate inertia (sum of squared distances to the nearest centroid)
inertia = kmeans.inertia_print(f"Inertia:{inertia}")
-
Data Generation:
- We use
make_blobs
to generate synthetic data with 4 centers (clusters). This helps us visualize and understand how K-Means works.
- We use
-
Model Fitting:
- We initialize the K-Means model with
k=4
, meaning we want to partition the data into 4 clusters. - We fit the model to the data using
kmeans.fit(X)
, which performs the K-Means algorithm.
- We initialize the K-Means model with
-
Cluster Centers and Labels:
kmeans.cluster_centers_
gives the coordinates of the centroids of the clusters.kmeans.labels_
gives the cluster label for each data point.
-
Visualization:
- We plot the data points, coloring them by their cluster label.
- The centroids are plotted as red ‘x’ marks, showing the center of each cluster.
-
Inertia:
- Inertia is calculated to measure how well the clusters have been formed. Lower inertia indicates more compact clusters.
Practical Tips and tricks
-
Choosing K:
- Use the Elbow Method to determine the optimal number of clusters. Plot inertia for different values of K and look for an "elbow" point where the rate of decrease slows down.
# Elbow Methodinertia_values = [] k_values =range(1,10)forkink_values: kmeans = KMeans(n_clusters=k) kmeans.fit(X) inertia_values.append(kmeans.inertia_) plt.plot(k_values, inertia_values,'bx-') plt.xlabel('Number of Clusters (K)') plt.ylabel('Inertia') plt.title('Elbow Method for Optimal K') plt.show()
-
Scaling Features:
- Scale your features before applying K-Means, especially if they have different
units or scales. Use
StandardScaler
orMinMaxScaler
fromsklearn.preprocessing
.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
- Scale your features before applying K-Means, especially if they have different
units or scales. Use
-
Initialization:
- K-Means is sensitive to initial centroids. Use the
k-means++
initialization to improve convergence.
kmeans = KMeans(n_clusters=4, init='k-means++')
- K-Means is sensitive to initial centroids. Use the
-
Handling Large Datasets:
- For large datasets, consider using MiniBatchKMeans, which reduces computational cost by using mini-batches.
from sklearn.cluster import MiniBatchKMeans mini_batch_kmeans = MiniBatchKMeans(n_clusters=4) mini_batch_kmeans.fit(X)