Google just released Gemini Embedding 2, a new AI model that lets developers search across five different types of content using a single unified system. Instead of needing separate tools to search text, images, videos, and audio files, this model processes all of them together in what engineers call a "unified embedding space." The model became available in public preview on March 10, 2026, through the Gemini API and Vertex AI, and early partners like Everlaw, a legal discovery platform, are already reporting measurable improvements in search accuracy across millions of records. What Makes This Different From Previous AI Search Tools? Embedding models work by converting raw content into numerical vectors that capture meaning. When two pieces of content share similar meaning, their vectors sit close together in mathematical space, regardless of whether one is text and the other is a video. Previous Google embedding models handled only text. Gemini Embedding 2 breaks that single-modality constraint entirely. This matters because real-world data is almost never text-only. A legal discovery team might need to find a relevant video deposition, an image of a contract, and a written email all in response to a single search query. Before Gemini Embedding 2, that would have required three separate search systems. Now, one model handles all of it. What Are the Technical Specifications and Practical Limits? Gemini Embedding 2 handles each input type with specific parameters. Understanding these limits helps developers plan how to use the model effectively: - Text Input: Up to 8,192 input tokens per request, which covers long documents, code, and multilingual content across more than 100 languages - Images: Maximum of 6 images per request in PNG or JPEG format - Video: Maximum of 128 seconds per request in MP4 or MOV format, supporting H264, H265, AV1, and VP9 codecs - Audio: Maximum of 80 seconds per request in MP3 or WAV format, ingested natively without needing intermediate text transcription - Documents: PDF files up to 6 pages per request, with the model processing both visual layout and text content on each page The model also supports what engineers call "interleaved input," meaning you can pass multiple modalities together in a single request, such as an image combined with a text query, to capture relationships between different media types. How Do You Balance Quality Against Storage Costs? Gemini Embedding 2 uses a technique called Matryoshka Representation Learning (MRL), which nests information so that smaller versions of the embeddings still work accurately. By default, the model outputs 3,072-dimensional embeddings. However, developers can choose smaller sizes to save storage space. The recommended output sizes are 3,072, 1,536, and 768 dimensions. Benchmark testing shows that 768 dimensions delivers near-peak quality at roughly one-quarter the storage footprint of 3,072 dimensions. For most production deployments, this trade-off makes sense. The 1,536-dimension option actually scores marginally higher than the 2,048-dimension option on the MTEB benchmark, a widely used test for embedding quality, suggesting that more dimensions do not always mean better results. What Real-World Problems Does This Solve for Developers and Enterprises? Google's official documentation identifies several primary use cases where Gemini Embedding 2 delivers measurable value. These applications show how the multimodal approach changes what developers can build: - Retrieval-Augmented Generation (RAG): A technique where embeddings enhance the quality of generated text by retrieving and incorporating relevant information into model context. With multimodal support, a RAG pipeline can now retrieve images, audio, or video alongside text using a single unified index instead of managing multiple systems - Semantic Search and Information Retrieval: Cross-modal search allows a text query to surface relevant video, image, or audio results from the same vector index, eliminating the need for separate retrieval systems for each media type - Classification and Clustering: All modalities map to the same space, making cross-modal sentiment analysis, anomaly detection, and data organization viable with one model rather than a stack of specialized models - Document Intelligence: PDFs are embedded directly, with the model processing both visual layout and text content on each page, preserving information that text-extraction pipelines often lose Everlaw, an early access partner working in legal discovery, confirmed measurable improvements in precision and recall across millions of records in their workflows, adding image and video search capabilities on top of existing text-based systems. What Happens If You Switch From the Previous Text-Only Model? Google's older embedding model, gemini-embedding-001, remains available for text-only use cases. However, developers need to understand an important constraint: the embedding spaces between the two models are incompatible. Teams upgrading from gemini-embedding-001 to Gemini Embedding 2 must re-embed all existing data before switching. Direct comparison of embeddings generated by one model with embeddings generated by the other will produce inaccurate results. This incompatibility is a one-time cost. Once teams re-embed their data, they gain access to multimodal search capabilities that were impossible before. For organizations with large existing text-only indexes, the migration effort is real but manageable, and the payoff is significant. How Are Developers Already Using Multimodal Embeddings in Production? Beyond traditional search and retrieval, developers are experimenting with multimodal embeddings in novel ways. Memoo, a project built for the Gemini Live Agent Challenge, demonstrates how multimodal vision and voice can power browser automation. The system uses Gemini 2.0 Flash to detect meaningful browser interactions based on what it sees on screen, combined with voice context from users. This approach solves a long-standing problem in web automation: traditional tools like Selenium and Playwright rely on fragile CSS selectors that break when website structures change. Memoo's multimodal approach provides an autonomous fallback when deterministic selectors fail, reducing maintenance overhead. The system also demonstrates real-time voice integration through the Gemini Live API, which facilitates bidirectional 16 kilohertz PCM audio, allowing the voice model to clarify ambiguous user steps during recording. Raw events are transformed into playbooks where Gemini automatically identifies and parameterizes personally identifiable information like names, emails, and IDs. What Does This Mean for the Broader AI Landscape? Gemini Embedding 2 represents a shift in how AI systems understand the world. Humans perceive reality through multiple senses simultaneously. For years, AI systems have processed text, images, and video separately. Gemini Embedding 2 closes that gap by placing all modalities into the same mathematical space, allowing AI to reason about relationships between different types of content the way humans do naturally. The model is available now in public preview as gemini-embedding-2-preview via the Gemini API and Vertex AI, with integrations already available for popular developer frameworks including LangChain, LlamaIndex, Haystack, Weaviate, QDrant, ChromaDB, Pinecone, and Vector Search. For developers and enterprises building search systems, content discovery platforms, or AI agents that need to understand mixed-media data, this represents a significant step forward in capability and simplicity.