KNN Classifier Calculator

K-Nearest Neighbors classification with interactive visualization. See which neighbors vote for each class, explore accuracy vs k, and generate Python sklearn code.

Sample Data:

Training Data

Data (features, ..., class_label - one row per sample)

Class Names

Feature Names

Classification Settings

Test Point (comma-separated features)

k (Neighbors): 3

Distance Metric

Prediction

Virginica

Confidence: 66.7%

Neighbor Votes

Setosa

0/3

Versicolor

1/3

Virginica

2/3

Nearest Neighbors

#	Class	Distance	Point
1	Versicolor	0.361	[5.7, 2.8]
2	Virginica	0.361	[5.8, 2.7]
3	Virginica	0.424	[6.3, 3.3]

Neighbor Visualization

Accuracy vs k (LOO)

Confusion Matrix (LOO, k=3)

	Pred Setosa	Pred Versicolor	Pred Virginica
True Setosa	6	0	0
True Versicolor	0	0	6
True Virginica	0	5	1

Python Code

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt

# Data
X = np.array([
    [5.1, 3.5],
    [4.9, 3],
    [4.7, 3.2],
    [5, 3.6],
    [5.4, 3.9],
    [4.6, 3.4],
    [7, 3.2],
    [6.4, 3.2],
    [6.9, 3.1],
    [5.5, 2.3],
    [6.5, 2.8],
    [5.7, 2.8],
    [6.3, 3.3],
    [5.8, 2.7],
    [7.1, 3],
    [6.3, 2.5],
    [6.5, 3],
    [7.2, 3.6]
])
y = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2])
class_names = ['Setosa', 'Versicolor', 'Virginica']

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# KNN Classifier (k=3, metric=PH7)
knn = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
knn.fit(X_scaled, y)

# Predict new point
test_point = np.array([[6, 3]])
test_scaled = scaler.transform(test_point)
prediction = knn.predict(test_scaled)[0]
proba = knn.predict_proba(test_scaled)[0]

print(f"Prediction: {class_names[prediction]}")
print(f"Probabilities: {dict(zip(class_names, proba.round(3)))}")

# Cross-validation accuracy
cv_scores = cross_val_score(knn, X_scaled, y, cv=min(5, len(X)))
print(f"CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# Find optimal k
k_range = range(1, min(16, len(X)))
cv_means = []
for k in k_range:
    knn_k = KNeighborsClassifier(n_neighbors=k, metric='euclidean')
    scores = cross_val_score(knn_k, X_scaled, y, cv=min(5, len(X)))
    cv_means.append(scores.mean())

plt.figure(figsize=(10, 5))
plt.plot(k_range, cv_means, 'bo-')
plt.xlabel('k (Number of Neighbors)')
plt.ylabel('Cross-Validation Accuracy')
plt.title('KNN: Accuracy vs k')
plt.grid(True, alpha=0.3)
plt.xticks(list(k_range))
plt.show()

# Classification report
y_pred = knn.predict(X_scaled)
print("\nClassification Report:")
print(classification_report(y, y_pred, target_names=class_names))

Frequently Asked Questions

How do I choose the right value of k?

Start with k = sqrt(n) where n is the number of training samples. Use cross-validation or leave-one-out validation to compare different k values. Odd k values avoid ties in binary classification. Too small k leads to overfitting; too large k leads to underfitting.

Should I normalize features before using KNN?

Yes! KNN uses distance calculations, so features with larger scales dominate. Always standardize or normalize your features (e.g., using StandardScaler) so all features contribute equally to the distance metric.

What is the difference between Euclidean and Manhattan distance?

Euclidean distance is the straight-line distance (L2 norm). Manhattan distance is the sum of absolute differences (L1 norm) - like walking city blocks. Manhattan is more robust to outliers and works better in high dimensions. Euclidean is the default for most applications.

What are the pros and cons of KNN?

Pros: simple to understand, no training phase, works for any number of classes, naturally handles multi-class. Cons: slow prediction (must compute all distances), sensitive to feature scaling, memory-intensive (stores all training data), struggles with high-dimensional data (curse of dimensionality).

How does KNN handle ties?

When multiple classes get the same number of votes among k neighbors, common strategies include: using odd k to prevent ties (binary), choosing the class of the nearest neighbor among tied classes, or using weighted voting where closer neighbors get more weight (weights='distance' in sklearn).

Related Tools

Decision Tree

Visualize classification trees

K-Means Clustering

Unsupervised cluster analysis

Logistic Regression

Binary classification with sigmoid

Class

Distance

Point

Versicolor

0.361

[5.7, 2.8]

Virginica

0.361

[5.8, 2.7]

Virginica

0.424

[6.3, 3.3]

Pred Setosa

Pred Versicolor

Pred Virginica

True Setosa

True Versicolor

True Virginica

import numpy as np from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import StandardScaler from sklearn.model_selection import cross_val_score from sklearn.metrics import classification_report, confusion_matrix import matplotlib.pyplot as plt # Data X = np.array([ [5.1, 3.5], [4.9, 3], [4.7, 3.2], [5, 3.6], [5.4, 3.9], [4.6, 3.4], [7, 3.2], [6.4, 3.2], [6.9, 3.1], [5.5, 2.3], [6.5, 2.8], [5.7, 2.8], [6.3, 3.3], [5.8, 2.7], [7.1, 3], [6.3, 2.5], [6.5, 3], [7.2, 3.6] ]) y = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2]) class_names = ['Setosa', 'Versicolor', 'Virginica'] # Standardize features scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # KNN Classifier (k=3, metric=PH7) knn = KNeighborsClassifier(n_neighbors=3, metric='euclidean') knn.fit(X_scaled, y) # Predict new point test_point = np.array([[6, 3]]) test_scaled = scaler.transform(test_point) prediction = knn.predict(test_scaled)[0] proba = knn.predict_proba(test_scaled)[0] print(f"Prediction: {class_names[prediction]}") print(f"Probabilities: {dict(zip(class_names, proba.round(3)))}") # Cross-validation accuracy cv_scores = cross_val_score(knn, X_scaled, y, cv=min(5, len(X))) print(f"CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})") # Find optimal k k_range = range(1, min(16, len(X))) cv_means = [] for k in k_range: knn_k = KNeighborsClassifier(n_neighbors=k, metric='euclidean') scores = cross_val_score(knn_k, X_scaled, y, cv=min(5, len(X))) cv_means.append(scores.mean()) plt.figure(figsize=(10, 5)) plt.plot(k_range, cv_means, 'bo-') plt.xlabel('k (Number of Neighbors)') plt.ylabel('Cross-Validation Accuracy') plt.title('KNN: Accuracy vs k') plt.grid(True, alpha=0.3) plt.xticks(list(k_range)) plt.show() # Classification report y_pred = knn.predict(X_scaled) print("\nClassification Report:") print(classification_report(y, y_pred, target_names=class_names))