TF-IDF Calculator

Calculate Term Frequency-Inverse Document Frequency scores. Essential for text analysis, search engines, and NLP.

Documents (4)

Top Term: "machine"

TF=count("machine")|doc 1|=0.07

IDF=log(52) + 1=1.92

TF-IDF=0.07 × 1.92=0.14

14 terms

10 terms

12 terms

TF-IDF Scores - Document 1

Term	TF	DF	IDF	TF-IDF
machine	0.07	1	1.92	0.14
is	0.07	1	1.92	0.14
subset	0.07	1	1.92	0.14
of	0.07	1	1.92	0.14
artificial	0.07	1	1.92	0.14
intelligence	0.07	1	1.92	0.14
that	0.07	1	1.92	0.14
learning	0.07	2	1.51	0.11
enables	0.07	2	1.51	0.11
computers	0.07	2	1.51	0.11

Top Terms Visualization

machine

0.14

subset

0.14

artificial

0.14

intelligence

0.14

that

0.14

learning

0.11

Documents

Unique Terms

Avg Terms/Doc

Frequently Asked Questions

Why does TF-IDF penalize common words?

Words like "the", "is", "a" appear in almost every document, so they have high document frequency (DF). The IDF formula divides by DF, giving common words low IDF scores. This means even if "the" appears frequently in a document (high TF), its TF-IDF will be low because it doesn't help distinguish that document from others.

What's the difference between TF-IDF and BM25?

BM25 is an improved version of TF-IDF used in modern search engines. It adds saturation (diminishing returns for term frequency) and document length normalization. While TF-IDF keeps increasing linearly with term count, BM25 plateaus. BM25 is the default in Elasticsearch and used in many RAG implementations.

Should I use TF-IDF or embeddings for semantic search?

TF-IDF is great for exact keyword matching and is fast/interpretable. Embeddings (like BERT, sentence-transformers) capture semantic meaning, so "dog" and "puppy" are similar. Most modern systems use hybrid: TF-IDF/BM25 for initial retrieval, then re-rank with embeddings. For RAG, combining both often works best.

How do I use TF-IDF in Python with sklearn?

Use TfidfVectorizer: `from sklearn.feature_extraction.text import TfidfVectorizer; vectorizer = TfidfVectorizer(); tfidf_matrix = vectorizer.fit_transform(documents)`. This returns a sparse matrix where each row is a document and each column is a term. Use `vectorizer.get_feature_names_out()` to see the terms.

Why add +1 to the IDF formula (smoothing)?

Without smoothing, a term appearing in all documents would have IDF = log(N/N) = 0, making its TF-IDF zero even if it's important. The +1 smoothing prevents this and handles edge cases like new terms not in the training corpus. This is the default in sklearn's TfidfVectorizer.

Related Tools

LLM Token Calculator

Token count & API cost comparison

Tokenizer Playground

BPE, WordPiece tokenization

Vector Calculator

Dot product, cosine similarity

Term

IDF

TF-IDF

machine

0.07

1.92

0.14

0.07

1.92

0.14

subset

0.07

1.92

0.14

0.07

1.92

0.14

artificial

0.07

1.92

0.14

intelligence

0.07

1.92

0.14

that

0.07

1.92

0.14

learning

0.07

1.51

0.11

enables

0.07

1.51

0.11

computers

0.07

1.51

0.11

Frequently Asked Questions