Loading...
Loading...
Calculate Term Frequency-Inverse Document Frequency scores. Essential for text analysis, search engines, and NLP.
| Term | TF | DF | IDF | TF-IDF |
|---|---|---|---|---|
| machine | 0.07 | 1 | 1.92 | 0.14 |
| is | 0.07 | 1 | 1.92 | 0.14 |
| subset | 0.07 | 1 | 1.92 | 0.14 |
| of | 0.07 | 1 | 1.92 | 0.14 |
| artificial | 0.07 | 1 | 1.92 | 0.14 |
| intelligence | 0.07 | 1 | 1.92 | 0.14 |
| that | 0.07 | 1 | 1.92 | 0.14 |
| learning | 0.07 | 2 | 1.51 | 0.11 |
| enables | 0.07 | 2 | 1.51 | 0.11 |
| computers | 0.07 | 2 | 1.51 | 0.11 |
Words like "the", "is", "a" appear in almost every document, so they have high document frequency (DF). The IDF formula divides by DF, giving common words low IDF scores. This means even if "the" appears frequently in a document (high TF), its TF-IDF will be low because it doesn't help distinguish that document from others.
BM25 is an improved version of TF-IDF used in modern search engines. It adds saturation (diminishing returns for term frequency) and document length normalization. While TF-IDF keeps increasing linearly with term count, BM25 plateaus. BM25 is the default in Elasticsearch and used in many RAG implementations.
TF-IDF is great for exact keyword matching and is fast/interpretable. Embeddings (like BERT, sentence-transformers) capture semantic meaning, so "dog" and "puppy" are similar. Most modern systems use hybrid: TF-IDF/BM25 for initial retrieval, then re-rank with embeddings. For RAG, combining both often works best.
Use TfidfVectorizer: `from sklearn.feature_extraction.text import TfidfVectorizer; vectorizer = TfidfVectorizer(); tfidf_matrix = vectorizer.fit_transform(documents)`. This returns a sparse matrix where each row is a document and each column is a term. Use `vectorizer.get_feature_names_out()` to see the terms.
Without smoothing, a term appearing in all documents would have IDF = log(N/N) = 0, making its TF-IDF zero even if it's important. The +1 smoothing prevents this and handles edge cases like new terms not in the training corpus. This is the default in sklearn's TfidfVectorizer.