Activation Function
ReLU, Sigmoid, Tanh
A math function applied after each neuron that adds nonlinearity. Without it, a neural network is just a linear equation.
Multi-step AI process: plan → execute → evaluate → iterate until done. Not one-shot. The AI loops and self-corrects.
An LLM that can plan, use tools, and take multi-step actions autonomously. Think: chatbot that can actually DO things.
Making AI systems behave according to human values and intentions. The model should do what you mean, not just what you say.
A step-by-step procedure for solving a problem or performing a computation. In ML, algorithms learn patterns from data to make predictions.
A mechanism that lets the model decide which parts of the input are most relevant for each output position.
A network that compresses data into a small representation and reconstructs it. Learns efficient encodings.
The algorithm that computes gradients for every weight by applying the chain rule backwards through the network.
Number of training samples processed before updating weights. Balances speed (large) vs. generalization (small).
A decoding strategy that explores multiple candidate sequences in parallel, keeping the top-K most promising at each step. More thorough than greedy decoding.
BERT
Bidirectional Encoder Representations
Encoder-only transformer that reads text in both directions. Best for understanding tasks (classification, NER), not generation.
Two sources of error in ML. High bias = underfitting (too simple). High variance = overfitting (too complex). Balance both.
BitNet
1.58-bit Models, Ternary Quantization
Extreme quantization where model weights are limited to just three values: -1, 0, and 1 (1.58 bits). Enables AI inference on CPUs and edge devices with minimal quality loss.
Prompting the model to show its reasoning step by step before giving the final answer. Dramatically improves accuracy.
Predicting which category an input belongs to. Binary = 2 classes. Multi-class = 3+ classes.
CLIP
Contrastive Language-Image Pre-training
Jointly trains image and text encoders so images and their descriptions map to nearby points in the same embedding space.
Grouping similar data points without labels. The algorithm discovers natural groupings automatically.
CNN
Convolutional Neural Network
A neural network that slides learnable filters across images to detect visual features like edges, textures, and objects.
A table showing actual vs. predicted classes. Reveals exactly where and how the model makes mistakes.
Constitutional AI
CAI, RLAIF
An alignment method (by Anthropic) where the model self-critiques its own outputs using a written set of rules ("constitution") instead of relying solely on human feedback.
Context Window
Context Length
Maximum number of tokens a model can process at once. GPT-4: 128K tokens. Claude: 200K tokens.
Measures how similar two vectors are by the angle between them. 1 = identical direction, 0 = unrelated, -1 = opposite.
Cross-Validation
k-Fold CV
Split data into k folds, train on k-1, test on the remaining fold. Rotate and average. More reliable than a single split.
Creating new training samples by transforming existing ones (flip, rotate, crop images). More data for free.
When production data changes from what the model was trained on, causing accuracy to degrade over time.
Cleaning and transforming raw data before feeding it to a model. Handles missing values, scaling, encoding.
Makes predictions by learning if/else rules from data, forming a tree. Easy to read and interpret.
The rate of change of a function. Tells you which direction to adjust a parameter to reduce the loss.
Diffusion Model
Stable Diffusion, DALL-E
Generates images by learning to remove noise. Training: add noise to images. Inference: start with pure noise, denoise step by step.
Dimensionality Reduction
PCA, t-SNE, UMAP
Compressing data from many features to fewer features while keeping the important structure.
DPO
Direct Preference Optimization
A simpler alternative to RLHF. Directly trains the LLM on preference pairs without needing a separate reward model.
Randomly turns off neurons during training (set to 0). Forces the network to not rely on any single neuron.
DSPy
Declarative Self-improving Python
A framework that replaces manual prompt engineering with programming. You define what you want (signatures), and DSPy optimizes the prompts automatically.
Stop training when validation loss stops improving. Prevents overfitting by not training too long.
Converting tokens, words, or items into dense vectors of numbers that capture meaning. Similar items → nearby vectors.
Combining multiple models to get better predictions than any single model. Bagging, boosting, and stacking are the main strategies.
One complete pass through the entire training dataset. Most models train for 10-100+ epochs.
Evals
LLM-as-a-Judge, AI Evaluation
Using a strong LLM to grade the outputs of other models. Replaces traditional metrics like BLEU/ROUGE that fail to capture real-world quality.
Creating new input features from raw data to help the model learn better. Often more impactful than model choice.
Few-Shot Learning
In-Context Learning
Providing a few examples in the prompt so the model understands the task format. No training needed.
Taking a pre-trained model and training it further on your specific data. Faster and better than training from scratch.
A faster, memory-efficient implementation of attention. Same math, but optimized for GPU memory hierarchy.
GAN
Generative Adversarial Network
Two networks competing: a generator creates fake data, a discriminator tries to spot fakes. Both get better over time.
GGUF
GPT-Generated Unified Format
The standard file format for quantized LLMs used by llama.cpp and Ollama. Single file contains model weights, tokenizer, and metadata, ready to run locally.
GPT
Generative Pre-trained Transformer
Decoder-only transformer that generates text by predicting one token at a time. The architecture behind ChatGPT.
A vector of all partial derivatives that points in the direction of steepest increase. Go opposite to reduce the loss.
Simulating a larger batch size by accumulating gradients over multiple small forward passes before updating weights. Lets you train big models on small GPUs.
Gradient Boosting
XGBoost, LightGBM, CatBoost
Builds trees sequentially where each new tree corrects the errors of all previous trees combined. Dominates tabular data.
Trades compute for memory: recomputes activations during backward pass instead of storing them all. Saves ~60% memory.
The core optimization algorithm: compute the gradient of the loss, then take a small step in the opposite direction. Repeat.
The simplest text generation strategy: always pick the single highest-probability token at each step. Fast but can miss better overall sequences.
Grouped-Query Attention
GQA
A memory-efficient attention variant where multiple query heads share the same key-value heads. Middle ground between fast Multi-Query and expressive Multi-Head attention.
Safety checks that prevent AI from producing harmful, biased, or off-topic outputs. Input and output filters.
When an LLM generates confident-sounding but factually wrong information. It makes stuff up.
Hugging Face
HF, Transformers Library
The GitHub of ML. Hub hosts 500K+ models and 100K+ datasets. Transformers library loads any model in 3 lines.
Settings YOU choose before training (learning rate, layers, batch size). The model cannot learn these, so you search for good values.
Image Segmentation
Semantic, Instance, SAM
Classifying every pixel in an image. Semantic = class per pixel. Instance = distinguish individual objects.
Using a trained model to make predictions on new data. The production phase where training teaches and inference uses.
k-Nearest Neighbors
k-NN, KNN
Classifies by finding the k closest data points and taking a majority vote. No training needed - just stores and compares.
Knowledge Distillation
Teacher-Student
Training a small fast student model to mimic a large accurate teacher model. Gets 90-99% of teacher quality at 10x speed.
Stores previously computed key and value tensors so the model does not recompute them for every new token.
LangChain
LangGraph, LangSmith
The most popular framework for building LLM apps. Provides chains, tools, memory, and agent abstractions.
A neural network trained on massive text data that generates human-like text by predicting the next token. GPT-4, Claude, Llama, Gemini are all LLMs.
Latent Space
Embedding Space
A compressed, learned representation where similar inputs map to nearby points. The hidden space inside a model.
A group of neurons that processes data together. Networks stack layers, and each transforms input and passes the result forward.
Learning Rate
lr, Step Size
How big of a step to take during gradient descent. Too high = overshoot. Too low = painfully slow.
Simplest ML algorithm. Fits a straight line through data: y = w*x + b. Learns weights by minimizing squared error.
Despite the name, it is for classification. Applies sigmoid to a linear equation to output probabilities between 0 and 1.
Fine-tunes only tiny adapter matrices instead of the full model. Trains 100x fewer parameters with similar results.
Loss Function
Cost Function, Objective
A function that measures how wrong the model predictions are. Training = minimizing the loss function.
LSTM
Long Short-Term Memory
An RNN variant with gates that control what to remember and forget. Solves the vanishing gradient problem.
MCP
Model Context Protocol
An open standard (by Anthropic) for connecting AI models to tools. Like USB for AI - plug in any tool server.
Mixed Precision Training
FP16, BF16, AMP
Use 16-bit floats for most computation, keep 32-bit master weights. Halves memory, 2x faster on modern GPUs.
Mixture of Experts
MoE, Sparse MoE
Architecture where only a subset of "expert" sub-networks activate per token. A router decides which 2 of 8+ experts handle each token, keeping compute constant.
DevOps for ML: CI/CD, versioning, monitoring, and lifecycle management for models, data, and code.
Deploying a trained model as an API that handles prediction requests. FastAPI for simple cases, vLLM for LLMs.
Multimodal AI
Vision-Language Models
Models that understand multiple data types (text, images, audio, video) in a single architecture.
Named Entity Recognition
NER
Automatically finding and labeling entities in text: people, companies, locations, dates, amounts.
Layers of connected nodes that learn patterns from data by adjusting weights during training. Input goes in, gets transformed, predictions come out.
Neuron
Node, Unit, Perceptron
A single unit that takes inputs, multiplies by weights, adds a bias, and applies an activation function to produce output.
Normalization
BatchNorm, LayerNorm
Normalizes activations between layers to stabilize and speed up training. Keeps values in a reasonable range.
Object Detection
YOLO, R-CNN
Finding AND locating objects in images. Outputs bounding boxes with class labels and confidence scores.
Run LLMs locally with one command. Downloads, quantizes, and serves models on your own machine.
ONNX
Open Neural Network Exchange
A standard format for ML models. Train in PyTorch, export to ONNX, run anywhere with ONNX Runtime (2-3x faster).
Optimizer
Adam, SGD, AdamW
The algorithm that updates weights using gradients. Adam is the default because it adapts the learning rate per-parameter.
When a model memorizes training data instead of learning general patterns. High training accuracy, low test accuracy.
Memory management for LLM inference that handles KV cache like an OS handles RAM, using virtual memory pages instead of contiguous blocks. Powers vLLM.
Measures how well a language model predicts text. Lower = better. A perplexity of 10 means the model is choosing among ~10 options.
Precision & Recall
F1 Score
Precision: of all positive predictions, how many were correct? Recall: of all actual positives, how many did you find?
Prompt Caching
Prefix Caching, Context Caching
Reusing the computed KV cache for identical prompt prefixes across requests. If 100 users share the same system prompt, compute it once and reuse.
Crafting the input text to an LLM to get the best output. How you ask matters as much as what you ask.
An attack where malicious text in user input overrides the system prompt, making the AI do unintended things.
A Python agent framework by the creators of Pydantic. Uses type-safe, structured outputs and dependency injection for building production LLM applications.
Pyodide
Python in WASM, CPython in Browser
A port of CPython to WebAssembly that runs Python directly in the browser. No server needed, and NumPy, pandas, scikit-learn all work client-side.
Quantization
INT8, INT4, GPTQ, AWQ
Reducing model precision from 32-bit to 16/8/4-bit. Cuts memory and speeds up inference with minimal quality loss.
RAG
Retrieval-Augmented Generation
Combine an LLM with document retrieval. Fetch relevant docs, stuff them in the prompt, generate grounded answers.
An ensemble of many decision trees, each trained on random data subsets. Prediction = majority vote or average.
An agent pattern: Think → Act → Observe → repeat. Grounds reasoning in real observations from tool use.
Reasoning Models
System 2 Thinking, o1/o3-style
Models that "think before answering" by generating internal chain-of-thought reasoning steps. Trade speed for accuracy on hard problems like math, code, and logic.
Recurrent Neural Network
RNN
A network with memory for sequential data. Passes a hidden state from one step to the next.
Deliberately trying to make an AI system fail, produce harmful outputs, or behave unexpectedly. Find problems before users do.
Predicting a continuous number instead of a category. Output is a value, not a class label.
Regularization
L1, L2, Weight Decay
Penalizes model complexity to prevent overfitting. Adds a penalty term to the loss function.
An agent learns by trial and error: take actions in an environment, receive rewards, maximize total reward over time.
RLHF
Reinforcement Learning from Human Feedback
Fine-tuning LLMs using human preferences. Humans rank outputs → train a reward model → use RL to optimize the LLM.
ROC-AUC
Receiver Operating Characteristic
Measures classification performance across all thresholds. AUC = area under the ROC curve. 1.0 = perfect, 0.5 = random.
RoPE
Rotary Position Embeddings
Encodes token position by rotating the embedding vector. Enables models to generalize to longer sequences than seen during training.
Attention where Q, K, V all come from the same sequence. Each token relates to every other token in the input.
Finding results by meaning instead of keyword matching. Embed queries and documents, find closest vectors.
Determining if text is positive, negative, or neutral. The "hello world" of NLP classification.
Sliding Window Attention
SWA, Local Attention
Each token only attends to a fixed window of nearby tokens instead of the entire sequence. Cuts memory from O(n²) to O(n*w) where w is the window size.
Converts raw scores into probabilities that sum to 1. Standard output for multi-class classification.
Speculative Decoding
Assisted Generation
Speed trick: a tiny draft model proposes several tokens at once, then the big model verifies them in a single pass. Accepted tokens skip the slow generation step.
Training on labeled data where you provide inputs AND correct answers. The model learns to predict answers for new inputs.
SVM
Support Vector Machine
Finds the best boundary (hyperplane) that separates classes with the maximum margin between them.
The tendency of LLMs to agree with the user even when the user is wrong, prioritizing "being liked" over "being correct." A major alignment challenge in 2026.
Synthetic Data
LLM-Generated Training Data
Training data generated by AI models rather than collected from humans. A strong teacher model creates labeled examples to train smaller, specialized student models.
Controls randomness in text generation. 0 = always pick the most likely token. 1+ = more random and creative.
TF-IDF
Term Frequency-Inverse Document Frequency
Scores how important a word is to a document. High frequency in this doc + rare across all docs = high TF-IDF.
Tokenization
BPE, WordPiece
Splitting text into sub-word tokens that the model processes. "unhappiness" → ["un", "happi", "ness"].
LLMs calling external functions by generating structured JSON with function name and arguments. Text generator → action taker.
Only sample from tokens whose cumulative probability exceeds p. Dynamically adjusts how many tokens to consider.
Train/Val/Test Split
Holdout
Divide data into 3 parts: train on one, tune on another, evaluate once on the last. Prevents cheating.
Using knowledge from a model trained on one task to jumpstart learning on a different task. Why nobody trains from scratch.
The architecture behind all modern LLMs. Uses self-attention to process all tokens in parallel instead of one at a time.
An advanced quantization technique that dynamically adjusts precision per-layer based on sensitivity analysis. Preserves quality on critical layers while aggressively compressing others.
When a model is too simple to capture the patterns in data. Low accuracy on both training AND test data.
Training WITHOUT labels. The model finds hidden patterns, groups, or structure in data on its own.
Vector Database
Pinecone, Weaviate, Chroma
A database optimized for storing and searching embedding vectors. Find similar items by meaning, not keywords.
Applies the transformer architecture to images by splitting them into patches and treating each patch as a token.
Weights multiply inputs, biases shift the result. These are the learnable numbers that get adjusted during training.
A 2013 model that learns word vectors by predicting surrounding words. Showed that word meaning could be captured as math.
Zero-Shot Learning
Zero-Shot Classification
Using a model on tasks it was never specifically trained for, by describing the task in natural language.