Loading...
Loading...
Cohere's April 2025 flagship. 1536 dims, 128K context, Matryoshka, multimodal (text + images), 100+ languages. API-only, $0.12 per 1M tokens.
Nine sentences plotted in a 3D projection of the embedding space. Drag empty space to rotate, drag a point to move it, click to see cosine similarity. Move points closer and watch the similarity score climb in real time.
Values are illustrative, not from the actual model. Real scores depend on the sentences and the model's training data. The shape is what embedding models produce: semantic neighbors cluster, unrelated topics separate. We show three axes here, but real embeddings live in hundreds to thousands of dimensions.
First frontier embedder that treats text and images as a single unified model, speaks 100+ languages at production quality, and stretches to a 128K context window no open embedder has matched. The vendor most Fortune-500 teams actually trust for regulated workloads.
1536 dims with Matryoshka cuts at 256/512/1024/1536. 128K context - roughly 16x OpenAI's 8K and 250x BGE's 512. MTEB overall ~66. $0.12 per 1M tokens. 100+ language support with a small cross-language quality gap.
API-only on the public tier; on-prem is gated behind enterprise contracts. Quality past ~32K context degrades, so very long documents still benefit from chunking. English-only retrieval is competitive but not clearly ahead of English-tuned specialists.
Your corpus has non-English text, embedded images, or documents longer than 8K. Regulated enterprise with compliance needs where the Cohere enterprise contract (SOC 2, HIPAA, on-prem) is worth more than open weights. Otherwise simpler options suffice.
| Released | April 2025 |
|---|---|
| Organization | Cohere |
| License | Cohere Terms of Use (proprietary, API only) |
| Backbone | Proprietary |
| Parameters | Not disclosed |
| Embedding dimensions | 1,536 · Matryoshka: 256 / 512 / 1024 / 1536 |
| Max context | 128,000 tokens |
| Pooling | Proprietary |
| Training objective | Contrastive + Matryoshka + Multimodal + Multilingual |
| MTEB (overall) | 65.80 |
| MTEB (retrieval) | 56.10 |
| Multilingual | Yes |
| Self-hosted | No (API only) |
| Cost | $0.12 per 1M tokens |
Unlike CLIP-style dual-encoder splits, embed-v4 shares one representation space for text and images. Cross-modal retrieval (text query finding images, or vice versa) works without model switching or a reranker.
Roughly 16x OpenAI text-embedding-3-large's 8K and 250x BGE's 512 (128,000 / 512). Enough for whole contracts, spec sheets, or book chapters in a single call. Quality per-token degrades past ~32K, but remains usable everywhere in the window.
Trained on 100+ languages including many low-resource ones. Cross-language retrieval quality is closer to the English baseline than most multilingual generics achieve.
256, 512, 1024, and 1536 dim cuts. Store full-size and truncate for storage or speed wins without re-embedding. Table stakes for 2026 embedders, Cohere picked up early.
Why: One model is simpler to deploy and gives better cross-modal alignment than separate text and image encoders with a shared projection head. Cohere bet that the model complexity was worth the unified API.
Why: Built into training, not retrofitted. Quality degrades gracefully rather than cliff-dropping at a specific length. Long-form workloads (legal, scientific) benefit without needing to chunk.
Why: Cohere targets regulated enterprise: banks, pharma, government. Product value is the SLA, SOC 2 / HIPAA posture, and on-prem contract, not just the embedding quality.
Embed documents and a query with Cohere embed-v4.0 using the asymmetric `search_document` vs `search_query` input types. Matryoshka truncation happens via the `output_dimension` parameter; valid values are 256, 512, 1,024, and 1,536. For multimodal search, pass base64-encoded images into the same endpoint and they land in the same 1,536-dim space as text.
import cohere
co = cohere.ClientV2()
docs = [
"Matryoshka embeddings support multiple cuts.",
"128K context fits whole contracts in one call.",
]
# input_type controls the embedding's downstream task
doc_resp = co.embed(
model="embed-v4.0",
texts=docs,
input_type="search_document",
embedding_types=["float"],
output_dimension=1024,
)
query_resp = co.embed(
model="embed-v4.0",
texts=["How does 128K context help RAG?"],
input_type="search_query",
embedding_types=["float"],
output_dimension=1024,
)The `input_type` parameter routes the call to task-specific output distributions - skip it and retrieval quality drops by several points. `embedding_types` supports "float", "int8", "uint8", "binary", "ubinary", and "base64" for storage and transport trade-offs. For cross-modal retrieval (text query, image corpus), index images with `input_type="search_document"` and query with `input_type="search_query"` - cosine similarity across modalities is directly meaningful because both map into the shared space.
p50 around 120 ms and p95 around 350 ms from Cohere's production endpoint. The AWS-backed deployment serves from us-east-1, eu-west-1, and an APAC region; route to the closest for a tighter round-trip. On-prem deployments via Cohere's Secure enterprise plan trade latency floor for data-residency control.
Up to 96 inputs per batch and 128,000 tokens per request. Cohere's published embed rate limit is 2,000 inputs per minute on both trial and production tiers (per the v2 rate-limits docs); higher throughput requires a custom enterprise contract. For a 128K-context request, expect 1-2 seconds end-to-end - this is where long-context embedding's cost shows up, and why most teams still chunk past roughly 32K tokens.
$0.12 per 1M input tokens, slightly cheaper than OpenAI text-embedding-3-large at $0.13. A billion-token reindex is $120. Image inputs are billed as a separate `images` unit in the API response (not folded into `input_tokens`), and per-1M-image-token rates run higher than text - third-party pricing aggregators report around $0.47 per 1M image tokens. The cost story vs CLIP-style stacks is still favorable because you avoid running a second inference server, but text and images are not priced identically.
Cohere offers a public API, AWS Bedrock, Azure AI, and Oracle Cloud deployments. For regulated workloads, Cohere North (on-prem enterprise) runs the same model in your VPC with SOC 2 Type 2 and HIPAA postures. No open weights; the on-prem license is a commercial contract negotiated with Cohere sales.
First-party support in LangChain (CohereEmbeddings), LlamaIndex (CohereEmbedding), Haystack, and AWS Bedrock's knowledge-base workflows. The multimodal endpoint works with any vector DB that stores a fixed-dim float vector - images normalize into the same 1,536-dim space as text, so Pinecone, Weaviate, Qdrant, and Milvus need no schema changes for cross-modal search. Binary and int8 output types are supported end-to-end in Qdrant and Weaviate.
Cohere embed-v4 reports MTEB English overall around 65.8 and multilingual MTEB scores that top the multilingual leaderboard across 100+ languages. The cross-modal evaluation uses Cohere's own image-text retrieval benchmark - there is no MTEB equivalent for multimodal embedding, so the multimodal score is vendor-reported rather than third-party-audited. Validate on your own catalog before committing; cross-modal retrieval is especially corpus-dependent and MTEB-style averages do not tell the whole story.
embed-v4 wins on multilingual (100+ languages), multimodal (text + images), and context length (128K vs 8K). OpenAI wins on cost at low volume and ecosystem integration. For English-only, short-context RAG on the OpenAI stack, OpenAI is simpler; for anything else, embed-v4 is a serious candidate.
Yes, it is a headline capability. Embed text queries and images into the same vector space, then do cosine-similarity search. No separate CLIP model or cross-encoder reranker required for basic image-caption retrieval.
embed-v4 scores ~65-66 on MTEB English overall. Specialized English retrievers (BGE-en-ICL, NV-Embed-v2) outrank it on specific retrieval subsets, but embed-v4 leads on multilingual and multimodal MTEB tracks.
Legal contracts, technical manuals, research papers, SEC filings. Without 128K you have to chunk and re-rank, which adds complexity and often loses document-level signal. With 128K, embed once and rely on the LLM to localize.
Cohere offers private cloud and on-prem for regulated workloads, but it is an enterprise contract - not a self-serve API option. If compliance requires self-hosting today, open-weights alternatives (BGE, NV-Embed) are easier to start with.
No, it is the same endpoint. Pass text or a base64 image (or both), get a vector in the same 1536-dim space. Cross-modal similarity is comparable to within-modal for retrieval purposes.