Loading...
Loading...
K-Nearest Neighbors classification with interactive visualization. See which neighbors vote for each class, explore accuracy vs k, and generate Python sklearn code.
| # | Class | Distance | Point |
|---|---|---|---|
| 1 | Versicolor | 0.361 | [5.7, 2.8] |
| 2 | Virginica | 0.361 | [5.8, 2.7] |
| 3 | Virginica | 0.424 | [6.3, 3.3] |
| Pred Setosa | Pred Versicolor | Pred Virginica | |
|---|---|---|---|
| True Setosa | 6 | 0 | 0 |
| True Versicolor | 0 | 0 | 6 |
| True Virginica | 0 | 5 | 1 |
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
# Data
X = np.array([
[5.1, 3.5],
[4.9, 3],
[4.7, 3.2],
[5, 3.6],
[5.4, 3.9],
[4.6, 3.4],
[7, 3.2],
[6.4, 3.2],
[6.9, 3.1],
[5.5, 2.3],
[6.5, 2.8],
[5.7, 2.8],
[6.3, 3.3],
[5.8, 2.7],
[7.1, 3],
[6.3, 2.5],
[6.5, 3],
[7.2, 3.6]
])
y = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2])
class_names = ['Setosa', 'Versicolor', 'Virginica']
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# KNN Classifier (k=3, metric= PH7 )
knn = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
knn.fit(X_scaled, y)
# Predict new point
test_point = np.array([[6, 3]])
test_scaled = scaler.transform(test_point)
prediction = knn.predict(test_scaled)[0]
proba = knn.predict_proba(test_scaled)[0]
print(f"Prediction: {class_names[prediction]}")
print(f"Probabilities: {dict(zip(class_names, proba.round(3)))}")
# Cross-validation accuracy
cv_scores = cross_val_score(knn, X_scaled, y, cv=min(5, len(X)))
print(f"CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
# Find optimal k
k_range = range(1, min(16, len(X)))
cv_means = []
for k in k_range:
knn_k = KNeighborsClassifier(n_neighbors=k, metric='euclidean')
scores = cross_val_score(knn_k, X_scaled, y, cv=min(5, len(X)))
cv_means.append(scores.mean())
plt.figure(figsize=(10, 5))
plt.plot(k_range, cv_means, 'bo-')
plt.xlabel('k (Number of Neighbors)')
plt.ylabel('Cross-Validation Accuracy')
plt.title('KNN: Accuracy vs k')
plt.grid(True, alpha=0.3)
plt.xticks(list(k_range))
plt.show()
# Classification report
y_pred = knn.predict(X_scaled)
print("\nClassification Report:")
print(classification_report(y, y_pred, target_names=class_names))Start with k = sqrt(n) where n is the number of training samples. Use cross-validation or leave-one-out validation to compare different k values. Odd k values avoid ties in binary classification. Too small k leads to overfitting; too large k leads to underfitting.
Yes! KNN uses distance calculations, so features with larger scales dominate. Always standardize or normalize your features (e.g., using StandardScaler) so all features contribute equally to the distance metric.
Euclidean distance is the straight-line distance (L2 norm). Manhattan distance is the sum of absolute differences (L1 norm) - like walking city blocks. Manhattan is more robust to outliers and works better in high dimensions. Euclidean is the default for most applications.
Pros: simple to understand, no training phase, works for any number of classes, naturally handles multi-class. Cons: slow prediction (must compute all distances), sensitive to feature scaling, memory-intensive (stores all training data), struggles with high-dimensional data (curse of dimensionality).
When multiple classes get the same number of votes among k neighbors, common strategies include: using odd k to prevent ties (binary), choosing the class of the nearest neighbor among tied classes, or using weighted voting where closer neighbors get more weight (weights='distance' in sklearn).