Gemini Embedding 2
Gemini Embedding 2 is Google's first natively multimodal embedding model, mapping text, images, video, audio, and documents into a single unified embedding space with support for interleaved multi-modal inputs and over 100 languages.
import { embed } from 'ai';
const result = await embed({ model: 'google/gemini-embedding-2', value: 'Sunny day at the beach',})What To Consider When Choosing a Provider
Zero Data Retention
AI Gateway does not currently support Zero Data Retention for this model. See the documentation for models that support ZDR.Authentication
AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
Because this model embeds multiple modalities into the same vector space, ensure your vector database and retrieval pipeline are configured to handle queries that may originate from a different modality than the indexed documents (for example, text queries against an image corpus).
When to Use Gemini Embedding 2
Best For
Multimodal RAG pipelines:
Indexing corpora that contain a mix of documents, images, audio, and video, and retrieving across all modalities from a single vector store using unified semantic search
Cross-modal retrieval:
Enabling text queries to surface relevant images, video clips, or audio segments (and vice versa) by embedding all media into the same shared space
Rich document understanding:
Embedding PDFs with their visual layout, charts, and text together in a single request rather than extracting and embedding text separately
Audio search without transcription:
Building search systems over audio archives that skip the intermediate transcription step by directly embedding audio content
Consider Alternatives When
Pure text workloads:
Your application is text-only and you want maximum input token capacity without paying for multimodal capabilities, where gemini-embedding-001's simpler pricing may be more appropriate
No cross-modal retrieval:
The complexity of a multimodal embedding space adds operational overhead without benefit
Generative output needed:
You need generated text rather than vector representations of inputs
Conclusion
Gemini Embedding 2 removes the architectural boundary between modalities in embedding pipelines, replacing parallel per-modality indexes with a single unified space that supports direct cross-modal retrieval and semantic comparison. For teams building the next generation of multimodal search, RAG, and data organization systems, it provides the essential multimodal foundation described.
FAQ
Text (up to 8,192 tokens), images (up to six per request, PNG and JPEG), video (up to 120 seconds, MP4 and MOV), audio (natively, without intermediate transcription), and documents (PDFs up to six pages).
Vectors produced from text, images, video, audio, and documents are directly comparable. A text query can retrieve semantically relevant images, or an audio clip can be compared to a PDF. No cross-modal alignment layers on top of separate per-modality models are needed.
Yes. The model natively understands interleaved input, so you can pass an image and its text caption together. It captures the relationships between modalities in a single embedding.
Gemini Embedding 2 supports up to 8,192 input tokens for text, four times the 2,048-token limit of gemini-embedding-001, making it better suited for embedding longer documents.
Yes. Like gemini-embedding-001, it uses MRL to allow output dimensions to scale down from the default 3,072. Google recommends 3,072, 1,536, or 768 for highest quality results.
Supported integrations include LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, ChromaDB, and Vector Search.
Pricing appears on this page and updates as providers adjust their rates. AI Gateway routes traffic through the configured provider.