Kickstart ML with Python snippets

Exploring Density-based (DBSCAN) ML algorithm basic concepts

DBSCAN is a density-based clustering algorithm that groups data points that are closely packed together, marking points that are alone in low-density regions as outliers. It is useful for data that contains clusters of varying shapes and sizes.

  1. Core Points:

    • A point is a core point if it has at least min_samples points (including itself) within a given distance eps.
  2. Border Points:

    • A point is a border point if it is not a core point, but it is within the eps distance of a core point.
  3. Noise Points:

    • A point is a noise point if it is neither a core point nor a border point.
  4. Directly Density-Reachable:

    • A point p is directly density-reachable from a point q if p is within the eps distance from q and q is a core point.
  5. Density-Reachable:

    • A point p is density-reachable from a point q if there is a chain of points p1, p2, ..., pn such that p1 = q and pn = p, and each pi+1 is directly density-reachable from pi.

Steps in DBSCAN Algorithm

  1. Initialization:

    • Select an arbitrary point from the dataset.
  2. Cluster Formation:

    • If the selected point is a core point, form a cluster by finding all points density-reachable from it.
    • If the selected point is a border point, it may be assigned to an existing cluster or marked as noise if it doesn't meet the density criteria.
  3. Repeat:

    • Repeat the process until all points have been visited.

Practical Example in Python

Let's walk through an example of using the DBSCAN algorithm with Python and the scikit-learn library.

Step-by-Step Example

  1. Import Libraries:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
import seaborn as sns

# Optional for better plot aesthetics
sns.set(style="whitegrid")
  1. Generate Synthetic Data:
# Generate synthetic dataX, y = make_blobs(n_samples=300, centers=4, cluster_std=0.50, random_state=0)# Visualize the dataplt.scatter(X[:,0], X[:,1], s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Synthetic Data')
plt.show()
  1. Apply DBSCAN:
# Apply DBSCANdbscan = DBSCAN(eps=0.3, min_samples=5)
clusters = dbscan.fit_predict(X)# Plot the clustered dataplt.scatter(X[:,0], X[:,1], c=clusters, cmap='viridis', s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('DBSCAN Clustering')
plt.show()
  1. Interpret Results:
# Number of clusters in labels, ignoring noise if present.n_clusters_ =len(set(clusters)) - (1if-1inclusterselse0)
n_noise_ =list(clusters).count(-1)print(f'Estimated number of clusters:{n_clusters_}')print(f'Estimated number of noise points:{n_noise_}')
  1. Data Generation:

    • We use make_blobs to generate synthetic data with 4 centers (clusters). This helps us visualize and understand how DBSCAN works.
  2. Model Fitting:

    • We initialize the DBSCAN model with eps=0.3 (maximum distance between two samples for them to be considered as in the same neighborhood) and min_samples=5 (minimum number of samples in a neighborhood for a point to be considered a core point).
    • We fit the model to the data using dbscan.fit_predict(X), which performs the DBSCAN algorithm and returns cluster labels for each data point.
  3. Cluster Interpretation:

    • Points with the same cluster label are grouped together.
    • Points labeled as -1 are considered noise (outliers).
  4. Visualization:

    • We plot the data points, coloring them by their cluster labels to visualize the clustering results.
    • Noise points (if any) are usually colored differently to indicate they are outliers.

Practical Tips

  1. Choosing Parameters:

    • The choice of eps and min_samples is crucial. Use domain knowledge or heuristics to set these parameters.
    • The k-distance graph can help in choosing eps. Plot the sorted distances of each point to its k-th nearest neighbor and look for a knee point.
    from sklearn.neighbors import NearestNeighbors
    
    neighbors = NearestNeighbors(n_neighbors=5)
    neighbors_fit = neighbors.fit(X)
    distances, indices = neighbors_fit.kneighbors(X)
    
    distances = np.sort(distances[:,4], axis=0)
    plt.plot(distances)
    plt.xlabel('Points')
    plt.ylabel('5th Nearest Neighbor Distance')
    plt.title('K-Distance Graph')
    plt.show()
  2. Handling Different Densities:

    • DBSCAN can handle clusters of varying densities better than K-Means, but it may struggle if there are significant variations within a single dataset. In such cases, consider using OPTICS (Ordering Points To Identify the Clustering Structure), which is an extension of DBSCAN.
  3. Scaling Features:

    • Scale your features before applying DBSCAN to ensure that all features contribute equally to the distance metric.
    from sklearn.preprocessing import StandardScaler
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
  4. Interpretation of Noise:

    • Noise points can provide valuable insights, indicating outliers or points that do not belong to any cluster. Handle these points based on the context of your application.

Back to Kickstart ML with Python cookbook page