gemini-embedding-001

2025

Google DeepMind · Released July 14, 2025 · Google API Terms of Service (API only, weights not released)

Google's July 2025 Gemini-based embedder. 3072 dims, 2K context, Matryoshka to 1536/768, 100+ languages. Topped MTEB multilingual at launch. $0.15 per 1M tokens.

Embedding dim

3,072

Max context

2,048 tok

MTEB overall

68.32

API cost

$0.15/1M

Paper / Announcement Model card MTEB leaderboardGoogle

How gemini-embedding-001 turns text into a vector

annotated pipeline · callouts point at the labeled parts

Output vector

3,072 float32 values per document

Matryoshka: click a cut-point. At 3,072 dim, storage is 12,288 bytes/vector (float32), a 0% storage cut vs full dim for roughly 0% retrieval quality drop.

Why embeddings matter: 3D semantic space

Nine sentences plotted in a 3D projection of the embedding space. Drag empty space to rotate, drag a point to move it, click to see cosine similarity. Move points closer and watch the similarity score climb in real time.

cats

code

travel

cats“The cat sat on the mat”

drag to rotate · drag a point to move

From:

“The cat sat on the mat”

Cosine similarity:

Values are illustrative, not from the actual model. Real scores depend on the sentences and the model's training data. The shape is what embedding models produce: semantic neighbors cluster, unrelated topics separate. We show three axes here, but real embeddings live in hundreds to thousands of dimensions.

gemini-embedding-001: what it is and when to use it

The pitch

Google's entry in the frontier-embedder race, built on Gemini rather than a from-scratch encoder. Launched at the top of the MTEB multilingual leaderboard with a 68.32 mean task score across 100+ languages. Matryoshka truncation at 3072 / 1536 / 768 is standard equipment, and the task_type parameter handles asymmetric retrieval without prompt engineering.

By the numbers

3072 default dims with Matryoshka cuts at 1536 and 768. 2,048-token input limit - shorter than OpenAI's 8K and Cohere's 128K. MTEB multilingual mean 68.32 (rank 1 at GA). 100+ language coverage. $0.15 per 1M input tokens on the Gemini API and Vertex AI.

The tradeoff

Shortest context of the closed frontier embedders (2K vs OpenAI's 8K and Cohere's 128K) - long documents still need chunking. Priced above text-embedding-3-large ($0.15 vs $0.13 per 1M tokens) and Cohere embed-v4 ($0.12). Closed weights, Google Cloud lock-in. Newer ecosystem means fewer LangChain / LlamaIndex tutorials than OpenAI.

When to pick it

Multilingual RAG where cross-language retrieval quality matters more than context length, especially on Vertex AI / Google Cloud. Natural pairing with Gemini for LLM generation. For English-only or long-document workloads, text-embedding-3-large or Cohere embed-v4 remain stronger picks.

At a glance
Released	July 2025
Organization	Google DeepMind
License	Google API Terms of Service (API only, weights not released)
Backbone	Decoder LLM
Parameters	Not disclosed
Embedding dimensions	3,072 · Matryoshka: 768 / 1536 / 3072
Max context	2,048 tokens
Pooling	Proprietary
Training objective	Contrastive + Matryoshka + Multilingual + Instruction tuning
MTEB (overall)	68.32
Multilingual	Yes
Self-hosted	No (API only)
Cost	$0.15 per 1M tokens

Key innovations in gemini-embedding-001

1Decoder-LLM backbone on Gemini

Unlike the BERT-era encoder stack most embedders still use, gemini-embedding-001 is built on the Gemini decoder-LLM family. Same architectural shift that E5-Mistral-7B made in open weights, now in a production-grade closed API with the full Gemini pretraining corpus behind it.

2MTEB multilingual leadership at launch

Ranked #1 on the MTEB multilingual leaderboard at general availability (July 2025) with a 68.32 mean task score across retrieval, classification, clustering, and STS. Held the top slot for months against every open and closed competitor.

3Matryoshka at three cut-points, aligned with peers

3072 default, truncatable to 1536 and 768 without re-embedding. The cut-points match text-embedding-3-large and Voyage-3-large, which makes A/B testing across vendors possible without reshaping your vector DB.

4task_type parameter for asymmetric retrieval

Callers pass one of eight values: RETRIEVAL_QUERY, RETRIEVAL_DOCUMENT, CODE_RETRIEVAL_QUERY, SEMANTIC_SIMILARITY, CLASSIFICATION, CLUSTERING, QUESTION_ANSWERING, or FACT_VERIFICATION. Same underlying model, different output distribution per task - so queries and indexed documents can use different embeddings without manual prompt prefixes.

Design choices and trade-offs

2048-token context, not 8K or 32K

Why: Shorter than every competing frontier API. Google's argument is that Matryoshka plus higher per-token quality matter more than raw context length for most retrieval workloads, and longer contexts dilute the pooled vector. Long documents still need external chunking.

Decoder-LLM backbone instead of a BERT-style encoder

Why: Leverages Gemini pretraining - most of the capex is already spent. Decoder-LLM backbones have shown retrieval-quality lifts in open research (E5-Mistral, NV-Embed), and Google can apply that playbook without training a separate encoder from scratch.

Closed weights, Gemini API / Vertex AI only

Why: Consistent with Google's Gemini distribution: API-first, enterprise contract for on-prem and compliance tiers. Weights stay internal; the product is inference. Keeps the moat around Gemini training data and infrastructure.

gemini-embedding-001 strengths for RAG

+Top MTEB multilingual score at launch (68.32) across 100+ languages
+Matryoshka truncation at 3072 / 1536 / 768 - aligns with OpenAI and Voyage cut-points
+Native task_type parameter handles asymmetric retrieval without prompt engineering
+First-class integration with Vertex AI, Gemini API, and the rest of the Google Cloud data stack

gemini-embedding-001 limitations and caveats

-2K context is the shortest of the closed frontier embedders - long documents still require chunking
-$0.15 per 1M tokens is above text-embedding-3-large ($0.13) and Cohere embed-v4 ($0.12)
-API-only and Google Cloud-flavored - no self-hosting, no weights, no backbone fine-tuning
-Newer ecosystem: fewer LangChain / LlamaIndex / vector-DB tutorials than the OpenAI stack

When to avoid gemini-embedding-001

!Long-document embedding. The 2,048-token context is the shortest of the closed frontier embedders - if you need to embed contracts, books, or long reports in one call, pick Cohere embed-v4 (128K) or voyage-3-large (32K).
!Workloads where you cannot call out to Google Cloud. Gemini is Google-only; for self-hosting, use BGE, Nomic, or Jina instead.
!English-only RAG with a strict budget. At $0.15 per 1M tokens, gemini-embedding-001 costs more than OpenAI ($0.13 on 3-large, $0.02 on 3-small) and Cohere ($0.12); the premium is only justified by the multilingual and decoder-LLM quality lift.
!Teams allergic to Google Cloud account plumbing. A Google Cloud project, a quota request, and a service-account setup is real overhead if OpenAI or Cohere already has your keys wired in.

Using gemini-embedding-001 in Python

task_type asymmetric retrieval with gemini-embedding-001

Embed documents with `task_type="RETRIEVAL_DOCUMENT"` and queries with `task_type="RETRIEVAL_QUERY"`. The Matryoshka `output_dimensionality` accepts any value from 128 to 3,072; Google recommends 768, 1,536, or 3,072 as the quality-preserving cuts. Passing the wrong task_type silently costs measurable retrieval quality - it is not a cosmetic parameter.

pythongemini-embedding-001

from google import genai
from google.genai import types

client = genai.Client()

docs = [
    "gemini-embedding-001 topped MTEB multilingual at launch.",
    "Matryoshka cuts: 3072 default, 1536, or 768 dim.",
    "Task types include RETRIEVAL_QUERY, RETRIEVAL_DOCUMENT, CLUSTERING.",
]

# Index documents with RETRIEVAL_DOCUMENT
doc_resp = client.models.embed_content(
    model="gemini-embedding-001",
    contents=docs,
    config=types.EmbedContentConfig(
        task_type="RETRIEVAL_DOCUMENT",
        output_dimensionality=1536,
    ),
)
doc_vecs = [e.values for e in doc_resp.embeddings]

# Query with RETRIEVAL_QUERY
query_resp = client.models.embed_content(
    model="gemini-embedding-001",
    contents=["How many Matryoshka cuts does Gemini support?"],
    config=types.EmbedContentConfig(
        task_type="RETRIEVAL_QUERY",
        output_dimensionality=1536,
    ),
)

What to expect

Eight task types are supported: RETRIEVAL_QUERY, RETRIEVAL_DOCUMENT, CODE_RETRIEVAL_QUERY, SEMANTIC_SIMILARITY, CLASSIFICATION, CLUSTERING, QUESTION_ANSWERING, and FACT_VERIFICATION. The 2,048-token input limit is enforced silently - longer inputs are truncated, so chunk before embedding. Changing `output_dimensionality` does not affect latency (truncation is post-encode), but it directly affects vector DB storage.

gemini-embedding-001 in production: latency, throughput, and cost

Latency profile

p50 around 90-130 ms and p95 around 300-400 ms from the Gemini API endpoint. Vertex AI routes through Google Cloud regions and matches or beats the public API on latency for customers already on GCP. Quality and latency do not vary with output_dimensionality - the model always runs at full fidelity internally and truncates on the way out.

Throughput ceiling

Rate limits depend on Google Cloud project quota. The default free tier is modest (around 1,500 requests per minute); paid projects can request higher limits. Batch up to 100 inputs per call on Vertex AI and on the public Gemini API.

Cost at scale

$0.15 per 1M input tokens on both Gemini API and Vertex AI. A billion-token reindex is $150; English Wikipedia at roughly 4B tokens costs $600. No batch discount at the time of writing, so heavy-reindex workloads pay full freight.

Infrastructure footprint

No self-hosting; calls go to Google Cloud. For SOC 2 Type 2, HIPAA, or FedRAMP workloads, use Vertex AI under an existing Google Cloud compliance posture. Vector DB sizing at 3,072 dim is 12,288 bytes per vector (float32) or 6,144 bytes (float16); truncating to 768 cuts those to 3,072 and 1,536 bytes respectively.

gemini-embedding-001 ecosystem: vector DBs, frameworks, and SDKs

Native SDK is the `google-genai` Python library (or @google/genai for JS and Go). Integrated in LangChain (`GoogleGenerativeAIEmbeddings` from `langchain_google_genai`), LlamaIndex (`GoogleGenAIEmbedding` from `llama-index-embeddings-google-genai` - the older `GeminiEmbedding` package was deprecated as of v0.4.2), and Haystack's Google adapter. Vertex AI Vector Search is the first-party vector DB; every major third-party vector DB (Pinecone, Weaviate, Qdrant, Milvus) works as long as it stores a fixed-dim float vector. The `task_type` parameter is forwarded through the SDK wrappers without manual handling on the application side.

Reading gemini-embedding-001 benchmark numbers honestly

gemini-embedding-001 topped the MTEB multilingual v2 leaderboard at general availability in July 2025 with a mean task score of 68.32 across 100+ languages. English-only MTEB numbers are competitive but not dominant - the model's design target is cross-language retrieval quality, not English-only classification or clustering. For monolingual English RAG, run your own eval; the multilingual leaderboard win does not always carry over. The 2,048-token limit also changes the shape of the eval, so benchmark on chunked documents that match your production chunker.

gemini-embedding-001 embedding model FAQ

gemini-embedding-001 vs text-embedding-3-large?

Gemini leads on multilingual (MTEB multilingual 68.32 at launch) and ships with an explicit task_type parameter. OpenAI wins on context length (8K vs 2K), ecosystem maturity, and pricing ($0.13 vs $0.15 per 1M tokens). For English-only, short-to-medium RAG, OpenAI is still the simpler default.

gemini-embedding-001 vs Cohere embed-v4?

Both target multilingual enterprise retrieval. Cohere wins on context (128K vs 2K) and multimodality (text + images in one model). Gemini wins on MTEB multilingual leaderboard rank and integration with the rest of Google Cloud. For 128K context or image embeddings, pick Cohere; otherwise Gemini is very competitive.

Why is the context only 2K tokens?

Google's stance is that long pooled vectors lose discriminative power - past a few thousand tokens the average representation gets noisier. Chunking long documents externally and embedding each chunk is the recommended pattern, the same approach Ada-002 required.

What does the task_type parameter actually do?

It shifts the output distribution for the task at hand. RETRIEVAL_QUERY and RETRIEVAL_DOCUMENT produce complementary embeddings for asymmetric search. CLASSIFICATION and CLUSTERING produce embeddings tuned for those workloads. Using the wrong task_type measurably hurts quality on that task.

Can I truncate the output dimension?

Yes. The model is Matryoshka-trained, so you can request 3072 (default), 1536, or 768 dims. Truncated vectors remain directly comparable via cosine similarity and match the cut-points used by OpenAI and Voyage, which simplifies A/B testing across vendors.

Is it worth switching from text-embedding-3-large?

For multilingual workloads - likely yes, especially if you are already on Google Cloud. For English-only RAG with long-context requirements - probably not. Run your own eval set; MTEB leaderboard wins do not always survive contact with a production corpus.

Related embedding models

Cohere embed-v4.0

Cohere's April 2025 flagship. 1536 dims, 128K context, Matryoshka, multimodal (text + images), 100+ languages. API-only, $0.12 per 1M tokens.

Learn more about embeddings and RAG

Best Embedding Models 2026 Rag Explained Chromadb Tutorial