MiMo V2 Flash
MiMo V2 Flash is Xiaomi's MiMo v2 Flash MoE reasoning model with 309B total parameters and 15B active per forward pass, using hybrid attention and multi-token prediction for inference efficiency. It supports a context window of 262.1K tokens at $0.1 per million input tokens and $0.3 per million output tokens.
import { streamText } from 'ai'
const result = streamText({ model: 'xiaomi/mimo-v2-flash', prompt: 'Why is the sky blue?'})What To Consider When Choosing a Provider
Zero Data Retention
AI Gateway does not currently support Zero Data Retention for this model. See the documentation for models that support ZDR.Authentication
AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
MoE plus sliding-window attention gives an unusual cost-to-score tradeoff. Track real spend in AI Gateway at your volume; list pricing is $0.1 in and $0.3 out per million tokens.
When to Use MiMo V2 Flash
Best For
Software engineering, math, and long-context tasks:
You rely on tables on https://novita.ai/ (SWE-Bench Verified, LiveCodeBench, AIME-style math, long-context tests)
Long-context analysis:
Up to 262.1K tokens, including needle-in-haystack-style checks at long lengths
Cost-aware deployment:
Low active-parameter compute matters next to published benchmark results
Consider Alternatives When
Multimodal input required:
MiMo V2 Flash is text-in, text-out only
MoE hosting constraints:
You can't host or route to a large MoE stack, even with a small active count
Enterprise support terms:
You need support guarantees this SKU doesn't offer
Simple classification jobs:
A smaller model handles extraction at lower cost and fast enough
Conclusion
MiMo V2 Flash pairs MoE routing, sliding-window attention, and MTP-style decoding for a text window of 262.1K tokens at $0.1/$0.3 per million input/output tokens. See https://novita.ai/ for benchmark tables. Use AI Gateway for routing, retries, and usage tracking.
FAQ
MoE routes tokens to expert blocks and only activates part of the weights each step. That keeps compute low while the full weight count still holds broad knowledge.
It mixes sliding-window and global attention on a fixed schedule with a 128-token window. MiMo V2 Flash uses much smaller KV caches than full attention, which helps on a context of 262.1K tokens.
It adds a small MTP block per layer so the stack can propose several future tokens and verify them in fewer full steps, which raises output tokens per second during inference.
Add your API key in AI Gateway project settings. Use xiaomi/mimo-v2-flash in API calls. AI Gateway routes, retries, and fails over across novita, chutes, xiaomi.
Check the pricing panel on this page for today's numbers. AI Gateway tracks rates across every provider that serves MiMo V2 Flash.
DeepSeek-V3 uses a larger active parameter count from a larger total than MiMo V2 Flash. Compare published tables on each vendor's page; both are MoE stacks with different size and training choices.