Loading...
Loading...
Interactive K-means clustering with real-time visualization. See cluster assignments, centroids, the elbow method, silhouette scores, and generated Python sklearn code.
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Data
X = np.array([
[1, 2],
[1.5, 1.8],
[1.2, 2.5],
[5, 8],
[6, 9],
[5.5, 7.5],
[9, 1],
[10, 2],
[9.5, 1.5],
[1.8, 2.2],
[5.2, 8.3],
[9.2, 1.8],
[1.3, 1.5],
[5.8, 7.8],
[10.2, 1.2],
[0.8, 2.8],
[6.2, 8.8],
[9.8, 2.2]
])
# Optional: Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# K-Means Clustering (k=3)
kmeans = KMeans(n_clusters=3, init='k-means++', n_init=10, random_state=42)
labels = kmeans.fit_predict(X_scaled)
print(f"Centroids (scaled):\n{kmeans.cluster_centers_}")
print(f"Inertia: {kmeans.inertia_:.4f}")
print(f"Iterations: {kmeans.n_iter_}")
# Silhouette Score
sil = silhouette_score(X_scaled, labels)
print(f"Silhouette Score: {sil:.4f}")
# Cluster sizes
for i in range(3):
print(f"Cluster {i}: {sum(labels == i)} points")
# Elbow Method
inertias = []
K_range = range(1, 9)
for k in K_range:
km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
km.fit(X_scaled)
inertias.append(km.inertia_)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Elbow plot
axes[0].plot(K_range, inertias, 'bo-')
axes[0].set_xlabel('Number of Clusters (k)')
axes[0].set_ylabel('Inertia')
axes[0].set_title('Elbow Method')
axes[0].grid(True, alpha=0.3)
# Cluster visualization (first 2 dimensions)
colors = plt.cm.Set1(labels / max(labels.max(), 1))
axes[1].scatter(X_scaled[:, 0], X_scaled[:, 1], c=colors, s=50, alpha=0.7)
axes[1].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
c='black', marker='X', s=200, edgecolors='white', linewidths=2)
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
axes[1].set_title(f'K-Means Clusters (k=3)')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Use the Elbow Method: plot inertia vs k and look for an "elbow" where adding more clusters gives diminishing returns. Also use the Silhouette Score - values closer to 1 indicate better-defined clusters. Domain knowledge is equally important for choosing k.
Inertia (also called WCSS - Within-Cluster Sum of Squares) measures the sum of squared distances between each data point and its assigned cluster centroid. Lower inertia means tighter clusters, but it always decreases with more clusters, which is why the elbow method is needed.
K-means++ is a smart initialization that spreads initial centroids apart. Instead of random placement, it selects the first centroid randomly, then subsequent centroids are chosen with probability proportional to their squared distance from the nearest existing centroid. This leads to better convergence.
K-means struggles with: non-spherical clusters (elongated or irregular shapes), clusters of very different sizes, clusters of different densities, and high-dimensional data (curse of dimensionality). Consider DBSCAN, Gaussian Mixture Models, or spectral clustering for such cases.
Yes! K-means uses Euclidean distance, so features with larger scales dominate the distance calculation. Standardizing (zero mean, unit variance) ensures all features contribute equally. Use StandardScaler from sklearn.