Kickstart ML with Python snippets
Exploring Density-based (DBSCAN) ML algorithm basic concepts
DBSCAN is a density-based clustering algorithm that groups data points that are closely packed together, marking points that are alone in low-density regions as outliers. It is useful for data that contains clusters of varying shapes and sizes.
-
Core Points:
- A point is a core point if it has at least
min_samples
points (including itself) within a given distanceeps
.
- A point is a core point if it has at least
-
Border Points:
- A point is a border point if it is not a core point, but it is within the
eps
distance of a core point.
- A point is a border point if it is not a core point, but it is within the
-
Noise Points:
- A point is a noise point if it is neither a core point nor a border point.
-
Directly Density-Reachable:
- A point
p
is directly density-reachable from a pointq
ifp
is within theeps
distance fromq
andq
is a core point.
- A point
-
Density-Reachable:
- A point
p
is density-reachable from a pointq
if there is a chain of pointsp1, p2, ..., pn
such thatp1 = q
andpn = p
, and eachpi+1
is directly density-reachable frompi
.
- A point
Steps in DBSCAN Algorithm
-
Initialization:
- Select an arbitrary point from the dataset.
-
Cluster Formation:
- If the selected point is a core point, form a cluster by finding all points density-reachable from it.
- If the selected point is a border point, it may be assigned to an existing cluster or marked as noise if it doesn't meet the density criteria.
-
Repeat:
- Repeat the process until all points have been visited.
Practical Example in Python
Let's walk through an example of using the DBSCAN algorithm with Python and the
scikit-learn
library.
Step-by-Step Example
- Import Libraries:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
import seaborn as sns
# Optional for better plot aesthetics
sns.set(style="whitegrid")
- Generate Synthetic Data:
# Generate synthetic dataX, y = make_blobs(n_samples=300, centers=4, cluster_std=0.50, random_state=0)# Visualize the dataplt.scatter(X[:,0], X[:,1], s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Synthetic Data')
plt.show()
- Apply DBSCAN:
# Apply DBSCANdbscan = DBSCAN(eps=0.3, min_samples=5)
clusters = dbscan.fit_predict(X)# Plot the clustered dataplt.scatter(X[:,0], X[:,1], c=clusters, cmap='viridis', s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('DBSCAN Clustering')
plt.show()
- Interpret Results:
# Number of clusters in labels, ignoring noise if present.n_clusters_ =len(set(clusters)) - (1if-1inclusterselse0)
n_noise_ =list(clusters).count(-1)print(f'Estimated number of clusters:{n_clusters_}')print(f'Estimated number of noise points:{n_noise_}')
-
Data Generation:
- We use
make_blobs
to generate synthetic data with 4 centers (clusters). This helps us visualize and understand how DBSCAN works.
- We use
-
Model Fitting:
- We initialize the DBSCAN model with
eps=0.3
(maximum distance between two samples for them to be considered as in the same neighborhood) andmin_samples=5
(minimum number of samples in a neighborhood for a point to be considered a core point). - We fit the model to the data using
dbscan.fit_predict(X)
, which performs the DBSCAN algorithm and returns cluster labels for each data point.
- We initialize the DBSCAN model with
-
Cluster Interpretation:
- Points with the same cluster label are grouped together.
- Points labeled as
-1
are considered noise (outliers).
-
Visualization:
- We plot the data points, coloring them by their cluster labels to visualize the clustering results.
- Noise points (if any) are usually colored differently to indicate they are outliers.
Practical Tips
-
Choosing Parameters:
- The choice of
eps
andmin_samples
is crucial. Use domain knowledge or heuristics to set these parameters. - The k-distance graph can help in choosing
eps
. Plot the sorted distances of each point to its k-th nearest neighbor and look for a knee point.
from sklearn.neighbors import NearestNeighbors neighbors = NearestNeighbors(n_neighbors=5) neighbors_fit = neighbors.fit(X) distances, indices = neighbors_fit.kneighbors(X) distances = np.sort(distances[:,4], axis=0) plt.plot(distances) plt.xlabel('Points') plt.ylabel('5th Nearest Neighbor Distance') plt.title('K-Distance Graph') plt.show()
- The choice of
-
Handling Different Densities:
- DBSCAN can handle clusters of varying densities better than K-Means, but it may struggle if there are significant variations within a single dataset. In such cases, consider using OPTICS (Ordering Points To Identify the Clustering Structure), which is an extension of DBSCAN.
-
Scaling Features:
- Scale your features before applying DBSCAN to ensure that all features contribute equally to the distance metric.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
-
Interpretation of Noise:
- Noise points can provide valuable insights, indicating outliers or points that do not belong to any cluster. Handle these points based on the context of your application.