Gemini Embedding 2

google/gemini-embedding-2

Gemini Embedding 2 is Google's first natively multimodal embedding model, mapping text, images, video, audio, and documents into a single unified embedding space with support for interleaved multi-modal inputs and over 100 languages.

import { embed } from 'ai';

const result = await embed({
  model: 'google/gemini-embedding-2',
  value: 'Sunny day at the beach',
})

About Gemini Embedding 2

Google released Gemini Embedding 2 on March 10, 2026 in Public Preview as its first fully multimodal embedding model built on the Gemini architecture. It expands on the text-only foundation of its predecessor by mapping text, images, videos, audio, and documents into a single unified embedding space, with semantic understanding across over 100 languages.

Unifying modalities in a single vector space is the core architectural advance. Unlike systems that maintain separate embedding models per modality and then attempt cross-modal alignment after the fact, Gemini Embedding 2 processes all five modalities natively. It produces vectors that are directly comparable regardless of source medium. The model natively understands interleaved input: a request can pass multiple modalities simultaneously (such as an image and its accompanying text caption), and the model captures the relationships between them in a single embedding.

Modality-specific input constraints apply: text up to 8,192 tokens (four times the limit of gemini-embedding-001), up to six images per request in PNG or JPEG format, video up to 120 seconds in MP4 or MOV, audio natively without intermediate transcription, and PDFs up to six pages. The broader text context window is particularly relevant for long-document retrieval pipelines.

Like its predecessor, Gemini Embedding 2 uses Matryoshka Representation Learning (MRL), allowing output dimensions to scale down from the default 3,072. Google recommends 3,072, 1,536, or 768 for highest quality, giving you the same flexibility to balance retrieval accuracy against vector storage costs. The model is available through the Gemini API, Vertex AI, and ecosystem integrations including LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, ChromaDB, and Vector Search.

Providers

The AI Gateway supports routing requests across multiple AI providers. You can control provider preferences using the provider slugs available for copying with the buttons below. For more see the AI Gateway provider options documentation. By using the AI provider you acknowledge you reviewed and agree to their terms listed in the Legal section under the AI provider's name.

Provider

Context	Max Output	Latency	Throughput	Input	Output	Cache	Image Gen	Video Gen	Web Search	Per Query	Capabilities	ZDR	No Training	HIPAA	Release Date

Legal:Terms

•

Privacy

$0.20/M+3 more

—

03/10/2026

Metrics

Based exclusively on usage through AI Gateway.

Throughput24 hours

More models by Google

Model

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	Providers	ZDR	No Training	Release Date

0.9s

295tps

$0.25/M

$1.50/M

Read:$0.03/M

Write:—

$14.00/K

+ input costs

—

03/03/2026

4.7s

215tps

$2.00/M

$12.00/M

Read:

$0.2/M

Write:

—

$14.00/K

+ input costs

—

02/19/2026

0.7s

177tps

$0.50/M

$3.00/M

Read:

$0.05/M

Write:

—

$14.00/K

+ input costs

—

12/17/2025

0.5s

235tps

$0.10/M

$0.40/M

Read:$0.01/M

Write:—

$35.00/K

+ input costs

—

06/17/2025

0.4s

236tps

$0.30/M

$2.50/M

Read:$0.03/M

Write:—

$35.00/K

+ input costs

—

03/20/2025

2.4s

179tps

$1.25/M

$10.00/M

Read:

$0.13/M

Write:

—

$35.00/K

+ input costs

—

03/20/2025

What To Consider When Choosing a Provider

Zero Data Retention
AI Gateway does not currently support Zero Data Retention for this model. See the documentation for models that support ZDR.
Authentication
AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

Because this model embeds multiple modalities into the same vector space, ensure your vector database and retrieval pipeline are configured to handle queries that may originate from a different modality than the indexed documents (for example, text queries against an image corpus).

When to Use Gemini Embedding 2

Best For

Multimodal RAG pipelines:
Indexing corpora that contain a mix of documents, images, audio, and video, and retrieving across all modalities from a single vector store using unified semantic search
Cross-modal retrieval:
Enabling text queries to surface relevant images, video clips, or audio segments (and vice versa) by embedding all media into the same shared space
Rich document understanding:
Embedding PDFs with their visual layout, charts, and text together in a single request rather than extracting and embedding text separately
Audio search without transcription:
Building search systems over audio archives that skip the intermediate transcription step by directly embedding audio content

Consider Alternatives When

Pure text workloads:
Your application is text-only and you want maximum input token capacity without paying for multimodal capabilities, where gemini-embedding-001's simpler pricing may be more appropriate
No cross-modal retrieval:
The complexity of a multimodal embedding space adds operational overhead without benefit
Generative output needed:
You need generated text rather than vector representations of inputs

Conclusion

Gemini Embedding 2 removes the architectural boundary between modalities in embedding pipelines, replacing parallel per-modality indexes with a single unified space that supports direct cross-modal retrieval and semantic comparison. For teams building the next generation of multimodal search, RAG, and data organization systems, it provides the essential multimodal foundation described.

FAQ

Text (up to 8,192 tokens), images (up to six per request, PNG and JPEG), video (up to 120 seconds, MP4 and MOV), audio (natively, without intermediate transcription), and documents (PDFs up to six pages).

Vectors produced from text, images, video, audio, and documents are directly comparable. A text query can retrieve semantically relevant images, or an audio clip can be compared to a PDF. No cross-modal alignment layers on top of separate per-modality models are needed.

Yes. The model natively understands interleaved input, so you can pass an image and its text caption together. It captures the relationships between modalities in a single embedding.

Gemini Embedding 2 supports up to 8,192 input tokens for text, four times the 2,048-token limit of gemini-embedding-001, making it better suited for embedding longer documents.

Yes. Like gemini-embedding-001, it uses MRL to allow output dimensions to scale down from the default 3,072. Google recommends 3,072, 1,536, or 768 for highest quality results.

Supported integrations include LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, ChromaDB, and Vector Search.

Pricing appears on this page and updates as providers adjust their rates. AI Gateway routes traffic through the configured provider.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

Gemini Embedding 2

About Gemini Embedding 2

Providers

More models by Google

What To Consider When Choosing a Provider

Zero Data Retention

Authentication

When to Use Gemini Embedding 2

Best For

Multimodal RAG pipelines:

Cross-modal retrieval:

Rich document understanding:

Audio search without transcription:

Consider Alternatives When

Pure text workloads:

No cross-modal retrieval:

Generative output needed:

Conclusion

FAQ

About Gemini Embedding 2

Providers

More models by Google

About Gemini Embedding 2

What To Consider When Choosing a Provider

Zero Data Retention

Authentication

When to Use Gemini Embedding 2

Best For

Multimodal RAG pipelines:

Cross-modal retrieval:

Rich document understanding:

Audio search without transcription:

Consider Alternatives When

Pure text workloads:

No cross-modal retrieval:

Generative output needed:

Conclusion

FAQ

What modalities does Gemini Embedding 2 support?

What does it mean that all modalities share a single embedding space?

Can I pass multiple modalities in a single embedding request?

How does the text context window in Gemini Embedding 2 compare to gemini-embedding-001?

Does Gemini Embedding 2 use Matryoshka Representation Learning?

What vector database and framework integrations are available?

What does Gemini Embedding 2 cost on AI Gateway?

About Gemini Embedding 2