Cohere embed-v4.0

2025

Cohere · Released April 15, 2025 · Cohere Terms of Use (proprietary, API only)

Cohere's April 2025 flagship. 1536 dims, 128K context, Matryoshka, multimodal (text + images), 100+ languages. API-only, $0.12 per 1M tokens.

Embedding dim

1,536

Max context

128,000 tok

MTEB overall

65.80

API cost

$0.12/1M

Paper / Announcement Model card MTEB leaderboardCohere

How Cohere embed-v4.0 turns text into a vector

annotated pipeline · callouts point at the labeled parts

Output vector

1,536 float32 values per document

Matryoshka: click a cut-point. At 1,536 dim, storage is 6,144 bytes/vector (float32), a 0% storage cut vs full dim for roughly 0% retrieval quality drop.

Why embeddings matter: 3D semantic space

Nine sentences plotted in a 3D projection of the embedding space. Drag empty space to rotate, drag a point to move it, click to see cosine similarity. Move points closer and watch the similarity score climb in real time.

cats

code

travel

cats“The cat sat on the mat”

drag to rotate · drag a point to move

From:

“The cat sat on the mat”

Cosine similarity:

Values are illustrative, not from the actual model. Real scores depend on the sentences and the model's training data. The shape is what embedding models produce: semantic neighbors cluster, unrelated topics separate. We show three axes here, but real embeddings live in hundreds to thousands of dimensions.

Cohere embed-v4.0: what it is and when to use it

The pitch

First frontier embedder that treats text and images as a single unified model, speaks 100+ languages at production quality, and stretches to a 128K context window no open embedder has matched. The vendor most Fortune-500 teams actually trust for regulated workloads.

By the numbers

1536 dims with Matryoshka cuts at 256/512/1024/1536. 128K context - roughly 16x OpenAI's 8K and 250x BGE's 512. MTEB overall ~66. $0.12 per 1M tokens. 100+ language support with a small cross-language quality gap.

The tradeoff

API-only on the public tier; on-prem is gated behind enterprise contracts. Quality past ~32K context degrades, so very long documents still benefit from chunking. English-only retrieval is competitive but not clearly ahead of English-tuned specialists.

When to pick it

Your corpus has non-English text, embedded images, or documents longer than 8K. Regulated enterprise with compliance needs where the Cohere enterprise contract (SOC 2, HIPAA, on-prem) is worth more than open weights. Otherwise simpler options suffice.

At a glance
Released	April 2025
Organization	Cohere
License	Cohere Terms of Use (proprietary, API only)
Backbone	Proprietary
Parameters	Not disclosed
Embedding dimensions	1,536 · Matryoshka: 256 / 512 / 1024 / 1536
Max context	128,000 tokens
Pooling	Proprietary
Training objective	Contrastive + Matryoshka + Multimodal + Multilingual
MTEB (overall)	65.80
MTEB (retrieval)	56.10
Multilingual	Yes
Self-hosted	No (API only)
Cost	$0.12 per 1M tokens

Key innovations in Cohere embed-v4.0

1Unified text + image representation

Unlike CLIP-style dual-encoder splits, embed-v4 shares one representation space for text and images. Cross-modal retrieval (text query finding images, or vice versa) works without model switching or a reranker.

2128K context window

Roughly 16x OpenAI text-embedding-3-large's 8K and 250x BGE's 512 (128,000 / 512). Enough for whole contracts, spec sheets, or book chapters in a single call. Quality per-token degrades past ~32K, but remains usable everywhere in the window.

3100+ language coverage with minimal quality gap

Trained on 100+ languages including many low-resource ones. Cross-language retrieval quality is closer to the English baseline than most multilingual generics achieve.

4Matryoshka at four cut-points

256, 512, 1024, and 1536 dim cuts. Store full-size and truncate for storage or speed wins without re-embedding. Table stakes for 2026 embedders, Cohere picked up early.

Design choices and trade-offs

Unified multimodal model instead of CLIP-style dual encoders

Why: One model is simpler to deploy and gives better cross-modal alignment than separate text and image encoders with a shared projection head. Cohere bet that the model complexity was worth the unified API.

128K context native, not sliding-window hack

Why: Built into training, not retrofitted. Quality degrades gracefully rather than cliff-dropping at a specific length. Long-form workloads (legal, scientific) benefit without needing to chunk.

Enterprise-first licensing and deployment options

Why: Cohere targets regulated enterprise: banks, pharma, government. Product value is the SLA, SOC 2 / HIPAA posture, and on-prem contract, not just the embedding quality.

Cohere embed-v4.0 strengths for RAG

+Only frontier embedder combining multimodal + multilingual + long-context
+128K context for whole-document embedding without chunking
+100+ language support with small cross-language quality gap
+Enterprise SLA, SOC 2, HIPAA, and on-prem deployment on enterprise contracts

Cohere embed-v4.0 limitations and caveats

-API-only on the public tier; on-prem gated behind enterprise contracts
-Quality past ~32K context degrades; very long documents still benefit from chunking
-English-only retrieval is competitive but not clearly ahead of English-tuned models
-Closed weights - no audit trail, no fine-tuning backbone, vendor lock-in

When to avoid Cohere embed-v4.0

!English-only RAG where the retrieval leader is a specialist (NV-Embed-v2, bge-en-icl). embed-v4 is competitive but its multilingual and multimodal breadth is wasted cost if you never use it.
!Teams without a Cohere contract or AWS Bedrock plumbing - spinning up a second API vendor for a short prototype is usually not worth it when OpenAI is already wired in.
!Very long documents where the 128K context is appealing but the pooled vector loses sharpness. Past roughly 32K tokens the per-token signal dilutes and you should still chunk externally.
!Absolute budget minimums. Cohere embed-v4 is one of the cheapest frontier APIs but still costs per token; open weights (BGE, Nomic) are free after a GPU.

Using Cohere embed-v4.0 in Python

Asymmetric retrieval with Cohere embed-v4.0

Embed documents and a query with Cohere embed-v4.0 using the asymmetric `search_document` vs `search_query` input types. Matryoshka truncation happens via the `output_dimension` parameter; valid values are 256, 512, 1,024, and 1,536. For multimodal search, pass base64-encoded images into the same endpoint and they land in the same 1,536-dim space as text.

pythonCohere embed-v4.0

import cohere

co = cohere.ClientV2()

docs = [
    "Matryoshka embeddings support multiple cuts.",
    "128K context fits whole contracts in one call.",
]

# input_type controls the embedding's downstream task
doc_resp = co.embed(
    model="embed-v4.0",
    texts=docs,
    input_type="search_document",
    embedding_types=["float"],
    output_dimension=1024,
)

query_resp = co.embed(
    model="embed-v4.0",
    texts=["How does 128K context help RAG?"],
    input_type="search_query",
    embedding_types=["float"],
    output_dimension=1024,
)

What to expect

The `input_type` parameter routes the call to task-specific output distributions - skip it and retrieval quality drops by several points. `embedding_types` supports "float", "int8", "uint8", "binary", "ubinary", and "base64" for storage and transport trade-offs. For cross-modal retrieval (text query, image corpus), index images with `input_type="search_document"` and query with `input_type="search_query"` - cosine similarity across modalities is directly meaningful because both map into the shared space.

Cohere embed-v4.0 in production: latency, throughput, and cost

Latency profile

p50 around 120 ms and p95 around 350 ms from Cohere's production endpoint. The AWS-backed deployment serves from us-east-1, eu-west-1, and an APAC region; route to the closest for a tighter round-trip. On-prem deployments via Cohere's Secure enterprise plan trade latency floor for data-residency control.

Throughput ceiling

Up to 96 inputs per batch and 128,000 tokens per request. Cohere's published embed rate limit is 2,000 inputs per minute on both trial and production tiers (per the v2 rate-limits docs); higher throughput requires a custom enterprise contract. For a 128K-context request, expect 1-2 seconds end-to-end - this is where long-context embedding's cost shows up, and why most teams still chunk past roughly 32K tokens.

Cost at scale

$0.12 per 1M input tokens, slightly cheaper than OpenAI text-embedding-3-large at $0.13. A billion-token reindex is $120. Image inputs are billed as a separate `images` unit in the API response (not folded into `input_tokens`), and per-1M-image-token rates run higher than text - third-party pricing aggregators report around $0.47 per 1M image tokens. The cost story vs CLIP-style stacks is still favorable because you avoid running a second inference server, but text and images are not priced identically.

Infrastructure footprint

Cohere offers a public API, AWS Bedrock, Azure AI, and Oracle Cloud deployments. For regulated workloads, Cohere North (on-prem enterprise) runs the same model in your VPC with SOC 2 Type 2 and HIPAA postures. No open weights; the on-prem license is a commercial contract negotiated with Cohere sales.

Cohere embed-v4.0 ecosystem: vector DBs, frameworks, and SDKs

First-party support in LangChain (CohereEmbeddings), LlamaIndex (CohereEmbedding), Haystack, and AWS Bedrock's knowledge-base workflows. The multimodal endpoint works with any vector DB that stores a fixed-dim float vector - images normalize into the same 1,536-dim space as text, so Pinecone, Weaviate, Qdrant, and Milvus need no schema changes for cross-modal search. Binary and int8 output types are supported end-to-end in Qdrant and Weaviate.

Reading Cohere embed-v4.0 benchmark numbers honestly

Cohere embed-v4 reports MTEB English overall around 65.8 and multilingual MTEB scores that top the multilingual leaderboard across 100+ languages. The cross-modal evaluation uses Cohere's own image-text retrieval benchmark - there is no MTEB equivalent for multimodal embedding, so the multimodal score is vendor-reported rather than third-party-audited. Validate on your own catalog before committing; cross-modal retrieval is especially corpus-dependent and MTEB-style averages do not tell the whole story.

Cohere embed-v4.0 embedding model FAQ

embed-v4 vs text-embedding-3-large?

embed-v4 wins on multilingual (100+ languages), multimodal (text + images), and context length (128K vs 8K). OpenAI wins on cost at low volume and ecosystem integration. For English-only, short-context RAG on the OpenAI stack, OpenAI is simpler; for anything else, embed-v4 is a serious candidate.

Can I use embed-v4 for image search?

Yes, it is a headline capability. Embed text queries and images into the same vector space, then do cosine-similarity search. No separate CLIP model or cross-encoder reranker required for basic image-caption retrieval.

How does it compare on MTEB?

embed-v4 scores ~65-66 on MTEB English overall. Specialized English retrievers (BGE-en-ICL, NV-Embed-v2) outrank it on specific retrieval subsets, but embed-v4 leads on multilingual and multimodal MTEB tracks.

What is the 128K context actually useful for?

Legal contracts, technical manuals, research papers, SEC filings. Without 128K you have to chunk and re-rank, which adds complexity and often loses document-level signal. With 128K, embed once and rely on the LLM to localize.

Do I need on-prem for compliance?

Cohere offers private cloud and on-prem for regulated workloads, but it is an enterprise contract - not a self-serve API option. If compliance requires self-hosting today, open-weights alternatives (BGE, NV-Embed) are easier to start with.

Is multimodal a separate model?

No, it is the same endpoint. Pass text or a base64 image (or both), get a vector in the same 1536-dim space. Cross-modal similarity is comparable to within-modal for retrieval purposes.

Related embedding models

gemini-embedding-001

Google's July 2025 Gemini-based embedder. 3072 dims, 2K context, Matryoshka to 1536/768, 100+ languages. Topped MTEB multilingual at launch. $0.15 per 1M tokens.

Learn more about embeddings and RAG

Best Embedding Models 2026 Rag Explained Chromadb Tutorial