Loading...
Loading...
Google's July 2025 Gemini-based embedder. 3072 dims, 2K context, Matryoshka to 1536/768, 100+ languages. Topped MTEB multilingual at launch. $0.15 per 1M tokens.
Nine sentences plotted in a 3D projection of the embedding space. Drag empty space to rotate, drag a point to move it, click to see cosine similarity. Move points closer and watch the similarity score climb in real time.
Values are illustrative, not from the actual model. Real scores depend on the sentences and the model's training data. The shape is what embedding models produce: semantic neighbors cluster, unrelated topics separate. We show three axes here, but real embeddings live in hundreds to thousands of dimensions.
Google's entry in the frontier-embedder race, built on Gemini rather than a from-scratch encoder. Launched at the top of the MTEB multilingual leaderboard with a 68.32 mean task score across 100+ languages. Matryoshka truncation at 3072 / 1536 / 768 is standard equipment, and the task_type parameter handles asymmetric retrieval without prompt engineering.
3072 default dims with Matryoshka cuts at 1536 and 768. 2,048-token input limit - shorter than OpenAI's 8K and Cohere's 128K. MTEB multilingual mean 68.32 (rank 1 at GA). 100+ language coverage. $0.15 per 1M input tokens on the Gemini API and Vertex AI.
Shortest context of the closed frontier embedders (2K vs OpenAI's 8K and Cohere's 128K) - long documents still need chunking. Priced above text-embedding-3-large ($0.15 vs $0.13 per 1M tokens) and Cohere embed-v4 ($0.12). Closed weights, Google Cloud lock-in. Newer ecosystem means fewer LangChain / LlamaIndex tutorials than OpenAI.
Multilingual RAG where cross-language retrieval quality matters more than context length, especially on Vertex AI / Google Cloud. Natural pairing with Gemini for LLM generation. For English-only or long-document workloads, text-embedding-3-large or Cohere embed-v4 remain stronger picks.
| Released | July 2025 |
|---|---|
| Organization | Google DeepMind |
| License | Google API Terms of Service (API only, weights not released) |
| Backbone | Decoder LLM |
| Parameters | Not disclosed |
| Embedding dimensions | 3,072 · Matryoshka: 768 / 1536 / 3072 |
| Max context | 2,048 tokens |
| Pooling | Proprietary |
| Training objective | Contrastive + Matryoshka + Multilingual + Instruction tuning |
| MTEB (overall) | 68.32 |
| Multilingual | Yes |
| Self-hosted | No (API only) |
| Cost | $0.15 per 1M tokens |
Unlike the BERT-era encoder stack most embedders still use, gemini-embedding-001 is built on the Gemini decoder-LLM family. Same architectural shift that E5-Mistral-7B made in open weights, now in a production-grade closed API with the full Gemini pretraining corpus behind it.
Ranked #1 on the MTEB multilingual leaderboard at general availability (July 2025) with a 68.32 mean task score across retrieval, classification, clustering, and STS. Held the top slot for months against every open and closed competitor.
3072 default, truncatable to 1536 and 768 without re-embedding. The cut-points match text-embedding-3-large and Voyage-3-large, which makes A/B testing across vendors possible without reshaping your vector DB.
Callers pass one of eight values: RETRIEVAL_QUERY, RETRIEVAL_DOCUMENT, CODE_RETRIEVAL_QUERY, SEMANTIC_SIMILARITY, CLASSIFICATION, CLUSTERING, QUESTION_ANSWERING, or FACT_VERIFICATION. Same underlying model, different output distribution per task - so queries and indexed documents can use different embeddings without manual prompt prefixes.
Why: Shorter than every competing frontier API. Google's argument is that Matryoshka plus higher per-token quality matter more than raw context length for most retrieval workloads, and longer contexts dilute the pooled vector. Long documents still need external chunking.
Why: Leverages Gemini pretraining - most of the capex is already spent. Decoder-LLM backbones have shown retrieval-quality lifts in open research (E5-Mistral, NV-Embed), and Google can apply that playbook without training a separate encoder from scratch.
Why: Consistent with Google's Gemini distribution: API-first, enterprise contract for on-prem and compliance tiers. Weights stay internal; the product is inference. Keeps the moat around Gemini training data and infrastructure.
Embed documents with `task_type="RETRIEVAL_DOCUMENT"` and queries with `task_type="RETRIEVAL_QUERY"`. The Matryoshka `output_dimensionality` accepts any value from 128 to 3,072; Google recommends 768, 1,536, or 3,072 as the quality-preserving cuts. Passing the wrong task_type silently costs measurable retrieval quality - it is not a cosmetic parameter.
from google import genai
from google.genai import types
client = genai.Client()
docs = [
"gemini-embedding-001 topped MTEB multilingual at launch.",
"Matryoshka cuts: 3072 default, 1536, or 768 dim.",
"Task types include RETRIEVAL_QUERY, RETRIEVAL_DOCUMENT, CLUSTERING.",
]
# Index documents with RETRIEVAL_DOCUMENT
doc_resp = client.models.embed_content(
model="gemini-embedding-001",
contents=docs,
config=types.EmbedContentConfig(
task_type="RETRIEVAL_DOCUMENT",
output_dimensionality=1536,
),
)
doc_vecs = [e.values for e in doc_resp.embeddings]
# Query with RETRIEVAL_QUERY
query_resp = client.models.embed_content(
model="gemini-embedding-001",
contents=["How many Matryoshka cuts does Gemini support?"],
config=types.EmbedContentConfig(
task_type="RETRIEVAL_QUERY",
output_dimensionality=1536,
),
)Eight task types are supported: RETRIEVAL_QUERY, RETRIEVAL_DOCUMENT, CODE_RETRIEVAL_QUERY, SEMANTIC_SIMILARITY, CLASSIFICATION, CLUSTERING, QUESTION_ANSWERING, and FACT_VERIFICATION. The 2,048-token input limit is enforced silently - longer inputs are truncated, so chunk before embedding. Changing `output_dimensionality` does not affect latency (truncation is post-encode), but it directly affects vector DB storage.
p50 around 90-130 ms and p95 around 300-400 ms from the Gemini API endpoint. Vertex AI routes through Google Cloud regions and matches or beats the public API on latency for customers already on GCP. Quality and latency do not vary with output_dimensionality - the model always runs at full fidelity internally and truncates on the way out.
Rate limits depend on Google Cloud project quota. The default free tier is modest (around 1,500 requests per minute); paid projects can request higher limits. Batch up to 100 inputs per call on Vertex AI and on the public Gemini API.
$0.15 per 1M input tokens on both Gemini API and Vertex AI. A billion-token reindex is $150; English Wikipedia at roughly 4B tokens costs $600. No batch discount at the time of writing, so heavy-reindex workloads pay full freight.
No self-hosting; calls go to Google Cloud. For SOC 2 Type 2, HIPAA, or FedRAMP workloads, use Vertex AI under an existing Google Cloud compliance posture. Vector DB sizing at 3,072 dim is 12,288 bytes per vector (float32) or 6,144 bytes (float16); truncating to 768 cuts those to 3,072 and 1,536 bytes respectively.
Native SDK is the `google-genai` Python library (or @google/genai for JS and Go). Integrated in LangChain (`GoogleGenerativeAIEmbeddings` from `langchain_google_genai`), LlamaIndex (`GoogleGenAIEmbedding` from `llama-index-embeddings-google-genai` - the older `GeminiEmbedding` package was deprecated as of v0.4.2), and Haystack's Google adapter. Vertex AI Vector Search is the first-party vector DB; every major third-party vector DB (Pinecone, Weaviate, Qdrant, Milvus) works as long as it stores a fixed-dim float vector. The `task_type` parameter is forwarded through the SDK wrappers without manual handling on the application side.
gemini-embedding-001 topped the MTEB multilingual v2 leaderboard at general availability in July 2025 with a mean task score of 68.32 across 100+ languages. English-only MTEB numbers are competitive but not dominant - the model's design target is cross-language retrieval quality, not English-only classification or clustering. For monolingual English RAG, run your own eval; the multilingual leaderboard win does not always carry over. The 2,048-token limit also changes the shape of the eval, so benchmark on chunked documents that match your production chunker.
Gemini leads on multilingual (MTEB multilingual 68.32 at launch) and ships with an explicit task_type parameter. OpenAI wins on context length (8K vs 2K), ecosystem maturity, and pricing ($0.13 vs $0.15 per 1M tokens). For English-only, short-to-medium RAG, OpenAI is still the simpler default.
Both target multilingual enterprise retrieval. Cohere wins on context (128K vs 2K) and multimodality (text + images in one model). Gemini wins on MTEB multilingual leaderboard rank and integration with the rest of Google Cloud. For 128K context or image embeddings, pick Cohere; otherwise Gemini is very competitive.
Google's stance is that long pooled vectors lose discriminative power - past a few thousand tokens the average representation gets noisier. Chunking long documents externally and embedding each chunk is the recommended pattern, the same approach Ada-002 required.
It shifts the output distribution for the task at hand. RETRIEVAL_QUERY and RETRIEVAL_DOCUMENT produce complementary embeddings for asymmetric search. CLASSIFICATION and CLUSTERING produce embeddings tuned for those workloads. Using the wrong task_type measurably hurts quality on that task.
Yes. The model is Matryoshka-trained, so you can request 3072 (default), 1536, or 768 dims. Truncated vectors remain directly comparable via cosine similarity and match the cut-points used by OpenAI and Voyage, which simplifies A/B testing across vendors.
For multilingual workloads - likely yes, especially if you are already on Google Cloud. For English-only RAG with long-context requirements - probably not. Run your own eval set; MTEB leaderboard wins do not always survive contact with a production corpus.