K-Means Cluster Evaluation with Silhouette Analysis
These articles are AI-generated summaries. Please check the original sources for full details.
K-Means Cluster Evaluation with Silhouette Analysis
Clustering models require rigorous evaluation to ensure meaningful group separation. The silhouette score, ranging from −1 to 1, quantifies how well data points fit into their assigned clusters versus neighboring clusters.
Why This Matters
Silhouette analysis bridges the gap between theoretical cluster validity and real-world performance. While ideal models assume convex, well-separated clusters, real data often contains overlapping distributions or high-dimensional noise. Poorly chosen cluster counts (e.g., k = 6 in the penguins dataset) can lead to misleading results, with average silhouette scores dropping to 0.392, reflecting weaker cohesion and separation.
Key Insights
- “Silhouette score formula: $ s(i) = \frac{b(i) – a(i)}{\max{a(i), b(i)}} $”, from MachineLearningMastery.com (2025)
- “Sagas over ACID for e-commerce”: Not applicable here; instead, silhouette analysis is preferred for iterative clustering like K-means.
- “scikit-learn used by researchers and practitioners for silhouette computation and clustering”
Working Example
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
import matplotlib.pyplot as plt
import numpy as np
# Load and preprocess data
penguins = pd.read_csv('https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/penguins.csv')
penguins = penguins.dropna()
features = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
X = penguins[features]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Evaluate silhouette scores for k=2 to 6
range_n_clusters = list(range(2, 7))
silhouette_avgs = []
for n_clusters in range_n_clusters:
kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42)
cluster_labels = kmeans.fit_predict(X_scaled)
sil_avg = silhouette_score(X_scaled, cluster_labels)
silhouette_avgs.append(sil_avg)
print(f"For n_clusters = {n_clusters}, average silhouette_score = {sil_avg:.3f}")
# Visualize silhouette plots
fig, axes = plt.subplots(1, len(range_n_clusters), figsize=(25, 5), sharey=False)
for i, n_clusters in enumerate(range_n_clusters):
ax = axes[i]
kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42)
labels = kmeans.fit_predict(X_scaled)
sil_vals = silhouette_samples(X_scaled, labels)
sil_avg = silhouette_score(X_scaled, labels)
y_lower = 10
for j in range(n_clusters):
ith_sil_vals = sil_vals[labels == j]
ith_sil_vals.sort()
size_j = ith_sil_vals.shape[0]
y_upper = y_lower + size_j
color = plt.cm.nipy_spectral(float(j) / n_clusters)
ax.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_sil_vals, facecolor=color, edgecolor=color, alpha=0.7)
ax.text(-0.05, y_lower + 0.5 * size_j, str(j))
y_lower = y_upper + 10
ax.set_title(f"Silhouette Plot for k = {n_clusters}")
ax.axvline(x=sil_avg, color="red", linestyle="--")
ax.set_xlabel("Silhouette Coefficient")
if i == 0:
ax.set_ylabel("Cluster Label")
ax.set_xlim([-0.1, 1])
ax.set_ylim([0, len(X_scaled) + (n_clusters + 1) * 10])
plt.tight_layout()
plt.show()
Practical Applications
- Use Case: Marketing segmentation using customer purchase data, where silhouette analysis helps identify optimal cluster counts for targeted campaigns.
- Pitfall: Over-reliance on silhouette scores without domain knowledge may misalign cluster interpretations (e.g., k = 2 in the penguins dataset vs. three biological species).
References:
Continue reading
Next article
Microsoft Copilot Fall Release Includes Collaboration and Personalization Features
Related Content
Advanced SHAP Workflows for Machine Learning Explainability: A Comprehensive Coding Guide
Implementing SHAP workflows to compare explainers and detect data drift, showing TreeExplainer's speed advantage for interpreting complex machine learning models.
Building an End-to-End Data Engineering and Machine Learning Pipeline with PySpark in Google Colab
A step-by-step guide to using PySpark in Google Colab for data transformations, SQL analytics, feature engineering, and machine learning model training.
How Can We Build Scalable and Reproducible Machine Learning Experiment Pipelines Using Meta Research Hydra?
This article explains how to use Meta's Hydra framework to create scalable and reproducible ML experiments through structured configurations, overrides, and multirun simulations.