Gemini 3 Flash
Gemini 3 Flash delivers Gemini 3's pro-grade reasoning at flash-level latency and cost, using 30% fewer tokens than previous Gemini 2.5 models while outperforming them across most benchmarks.
import { streamText } from 'ai'
const result = streamText({ model: 'google/gemini-3-flash', prompt: 'Why is the sky blue?'})What To Consider When Choosing a Provider
Zero Data Retention
AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.Authentication
AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
Gemini 3 Flash supports configurable thinking levels (high included) via providerOptions, giving you direct control over how much reasoning compute the model applies per request.
When to Use Gemini 3 Flash
Best For
Real-time chat and assistants:
Interfaces that require pro-level reasoning without high latency
High-volume agentic pipelines:
Per-token cost directly affects operating expenses
Step-by-step analysis:
Tasks where surfacing intermediate reasoning (
includeThoughts) adds valueThroughput-bottlenecked apps:
Applications previously constrained by Gemini 2.5 Pro throughput limits
Cost-sensitive production workloads:
Production traffic where per-token cost matters but quality still has to stay benchmark-competitive
Consider Alternatives When
Maximum reasoning depth:
Your task requires the deepest reasoning regardless of cost or speed (consider
google/gemini-3-pro-previeworgoogle/gemini-3.1-pro-preview)Native image generation needed:
You require image output alongside text (consider
google/gemini-3-pro-imageorgoogle/gemini-3.1-flash-image-preview)Budget and latency dominate:
Task quality requirements are low (consider
google/gemini-3.1-flash-lite-preview)
Conclusion
Gemini 3 Flash resets expectations for what a speed-tier model can deliver, matching or exceeding previous-generation Pro quality at a fraction of the cost and latency. For teams that need scalable intelligence rather than raw capability, it represents a cost- and latency-efficient entry point into the Gemini 3 generation on AI Gateway.
FAQ
Gemini 3 Flash is built on the newer Gemini 3 architecture rather than Gemini 2.5. The generation change brings a substantial capability lift: Gemini 3 Flash surpasses Gemini 2.5 Pro on most benchmarks, so a speed-tier model in the 3 generation now exceeds the previous generation's flagship.
Yes. You can set thinkingLevel (e.g., 'high') and includeThoughts: true inside providerOptions.google when using the AI SDK. This gives you visibility into intermediate reasoning steps.
Yes. Use streamText from the AI SDK with model: 'google/gemini-3-flash' for streaming responses.
No. AI Gateway handles all provider authentication. You authenticate to AI Gateway using a Vercel API key or OIDC token and do not need to configure Google credentials separately.
Gemini 3 Pro targets the most challenging reasoning and agentic workflows. Gemini 3 Flash prioritizes speed and cost while still delivering pro-grade quality. The right tradeoff depends on your latency budget and task complexity.
Yes, Zero Data Retention is available for this model. ZDR on AI Gateway applies to direct gateway requests; BYOK flows aren't covered. See https://vercel.com/docs/ai-gateway/capabilities/zdr for configuration details.
Gemini 3 Flash uses 30% fewer tokens than previous Gemini 2.5 models. Combined with lower per-token pricing, this results in meaningful cost reductions at scale for applications processing large volumes of requests.
Yes. The model's combination of reasoning capability, token efficiency, and low latency makes it well-suited for agents that execute multiple tool calls or reasoning steps in sequence within a budget-constrained environment.