Gemini 2.5 Flash
Gemini 2.5 Flash is Google's first fully hybrid reasoning model, letting developers toggle thinking on or off and set thinking budgets to tune the balance between quality, cost, and latency, all on top of the fast, multimodal foundation of 2.0 Flash.
import { streamText } from 'ai'
const result = streamText({ model: 'google/gemini-2.5-flash', prompt: 'Why is the sky blue?'})What To Consider When Choosing a Provider
Zero Data Retention
AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.Authentication
AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
Thinking budgets affect token consumption and latency, so evaluate provider rate limits and pricing tiers with thinking-enabled requests before committing to a provider variant at production scale.
When to Use Gemini 2.5 Flash
Best For
Workloads with mixed complexity:
Applications that serve both simple requests and hard reasoning problems benefit from the ability to set per-request thinking budgets rather than paying for full reasoning on every call
Reasoning-intensive pipelines:
Multi-step math, science, coding, or logic tasks where 2.0 Flash's speed was sufficient but accuracy needs improvement
Cost-conscious agentic applications:
Need chain-of-thought planning but cannot afford 2.5 Pro pricing across high request volumes
Coding and code transformation tasks:
Benefit from the reasoning capabilities introduced in the 2.5 generation, including agentic code applications
Multimodal reasoning:
Images, video, or audio inputs that require more nuanced analysis than pattern matching
Consider Alternatives When
Deepest reasoning required:
Highly complex problems with no speed or cost constraint, where 2.5 Pro's stronger benchmark scores may justify the premium
Uniform high-volume inference:
Entirely low-complexity workloads where thinking overhead adds cost without benefit, making 2.5 Flash-Lite or 2.0 Flash-Lite more appropriate
Native image or audio output:
2.5 Flash outputs text only, so media generation needs a different model
Conclusion
Gemini 2.5 Flash introduces a new dimension of control to the Flash model family. You can dial reasoning depth from zero to a configured budget, matching compute expenditure to actual task complexity. It retains the efficiency that made Flash popular while unlocking the reasoning quality that previously required a heavier model.
FAQ
It means the model operates in two modes: with thinking disabled (behaving like a fast response model comparable to 2.0 Flash) or with thinking enabled at a configurable budget, where it reasons through the problem before generating an answer.
You set a per-request parameter that controls how much deliberation the model applies before responding. A higher budget allows more reasoning steps, improving accuracy on complex tasks at the cost of more tokens and higher latency. A lower budget favors speed and cost.
2.5 Flash outperforms 2.0 Flash even with thinking disabled. The 2.5 base model is stronger regardless of thinking mode.
Yes. Google Search and code execution are shared capabilities across all Gemini 2.5 models, including Gemini 2.5 Flash.
The context window is 1M tokens.
Gemini 2.5 Flash sits at the Pareto frontier of cost and performance. It delivers strong reasoning at a lower cost than 2.5 Pro, which targets complex tasks with strong benchmark scores.
It launched in preview on March 20, 2025. Google later promoted it to stable general availability alongside 2.5 Pro as part of the Gemini 2.5 family expansion.