Overview
Google released Gemini 3.1 Pro on February 19, 2026, rolling it out to developers, enterprises, and consumers as a preview. This is the first time Google has shipped a ".1" increment between major versions — past cycles used ".5" as the mid-year update. The naming change signals something deliberate: this is a targeted, focused upgrade to reasoning depth, not a broad architectural overhaul.
Google claims Gemini 3.1 Pro powers the "core intelligence" behind its recently upgraded Deep Think tool, meaning the improvements here trickle directly into real products. For developers and enterprises, the headline story is a massive leap in reasoning performance delivered at the exact same price as the previous model. For everyday users, it means the Gemini app gets meaningfully smarter at complex tasks — without any extra cost.
This article covers every key detail: what changed, what the benchmarks actually mean, how to access the model, how pricing compares, and where it wins and falls short against Claude and GPT.
Benchmark Performance: How Much Better Is Gemini 3.1 Pro?
The headline number is on ARC-AGI-2 — a benchmark designed to test a model's ability to solve entirely new logic patterns it has never seen during training.
Gemini 3.1 Pro achieved a verified score of 77.1% on ARC-AGI-2, more than double the reasoning performance of Gemini 3 Pro.
For context: Anthropic's Claude Opus 4.6 solves 68.8% of ARC-AGI-2 challenges, and OpenAI's GPT-5.2 scores 52.9%. Gemini 3.1 Pro leads all three.
Here is the full benchmark comparison across major models:
| Benchmark | Gemini 3.1 Pro | Claude Opus 4.6 | GPT-5.2 |
|---|---|---|---|
| ARC-AGI-2 (reasoning) | 77.1% | 68.8% | 52.9% |
| GPQA Diamond (science) | 94.3% | — | — |
| Humanity's Last Exam | 44.4% | — | 34.5% |
| SWE-Bench Verified (coding) | 80.6% | — | — |
| LiveCodeBench Pro (Elo) | 2887 | — | — |
| APEX-Agents (agentic tasks) | 33.5% | — | — |
| MMMLU (multimodal) | 92.6% | — | — |
Source: Google DeepMind model card, February 2026. Third-party leaderboard results may differ.
Gemini 3.1 Pro holds the #1 position on at least 12 of 18 tracked benchmarks.
Where It Does Not Lead
The model does not win everywhere. Here is where rivals still hold an edge:
| Task | Leader | Gemini 3.1 Pro Score | Rival Score |
|---|---|---|---|
| Terminal-Bench 2.0 (coding) | GPT-5.3-Codex | 68.5% | 77.3% |
| SWE-Bench Pro Public | GPT-5.3-Codex | 54.2% | 56.8% |
| GDPval-AA Elo (expert tasks) | Claude Sonnet 4.6 | 1317 | 1633 |
| Arena text & coding preference | Claude Opus 4.6 | — | Leads by 4 pts |
On Arena (formerly LM Arena), Claude Opus 4.6 leads Gemini in text tasks, edging it by four points at 1504. In coding categories, Opus 4.6, Opus 4.5, and GPT-5.2 High also rank ahead. Arena rankings rely on user voting, which can reward answers that appear correct even if they contain subtle errors.
What Is New in Gemini 3.1 Pro
Reasoning Engine
The performance gains on ARC-AGI-2 represent a refinement in how the model handles "thinking" tokens and long-horizon tasks, providing a more reliable foundation for developers building autonomous agents. The upgrade is not just about raw scores — it reduces the model's tendency to rely on pattern matching and increases its ability to reason from first principles on novel problems.
Larger Output Window
The output limit has been significantly upgraded to 65,000 tokens for long-form code and document generation. This means the model can now generate a full multi-module Python application, a long technical report, or a complex contract draft in a single response without hitting a ceiling.
Improved File Handling
Key updates to file handling include a 100MB file limit (up from 20MB), direct YouTube URL support where the model watches the video via URL rather than requiring upload, and Cloud Storage bucket support for private database sources.
Thinking Level Control
Gemini 3.1 Pro supports three thinking levels — Low, Medium, and High — allowing developers to balance response quality, reasoning depth, latency, and cost. This is a meaningful cost-optimization lever that most competing models do not offer.
API Change Note for Developers
In the Interactions API v1beta, the field total_reasoning_tokens has been renamed to total_thought_tokens. Any existing code using the old field name will need to be updated.
Context Window and Multimodal Capabilities
| Specification | Detail |
|---|---|
| Input context window | 1,000,000 tokens |
| Output token limit | 64,000–65,000 tokens |
| Input modalities | Text, images, audio, video, PDFs, code |
| Output modalities | Text |
| Architecture base | Gemini 3 Pro |
The 1M-token context window enables use cases like full-repository code understanding, large document analysis, and multi-step agentic workflows. When paired with Google's agentic stack — Gemini API, Antigravity, and Vertex AI tools — you get a system that can hold large task graphs in memory while orchestrating multi-step workflows.
Pricing
Pricing for Gemini 3.1 Pro remains unchanged at $2 per million input tokens and $12 per million output tokens. That is identical to what Gemini 3 Pro charged — meaning this performance upgrade costs nothing extra for API users.
Here is how that stacks up against alternatives:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Gemini 3.1 Pro | $2.00 | $12.00 |
| Claude Opus 4.6 | $5.00 | $25.00 |
| GPT-5.2 (standard) | ~$3–5 | ~$15–20 |
Gemini 3.1 Pro offers a best-in-class price-performance ratio, leading most benchmarks at $2/$12 per 1M tokens — roughly 7.5x cheaper than Claude Opus 4.6 on input. Additionally, context caching offers up to 75% cost reduction on repeated contexts, which is particularly valuable for developers running long-context agentic applications.
Where to Access Gemini 3.1 Pro
Google is rolling out Gemini 3.1 Pro across multiple platforms. Developers gain access via the Gemini API in Google AI Studio, Gemini CLI, the agent-based development platform Google Antigravity, and Android Studio. Enterprises can use the model via Vertex AI and Gemini Enterprise. Individual users can access it through the Gemini app and NotebookLM, with NotebookLM exclusively unlocked for Pro and Ultra users.
| Platform | Who It Is For |
|---|---|
| Gemini App | Consumers — select "Pro" in the model dropdown |
| Google AI Studio | Developers experimenting and prototyping |
| Gemini API | Developers building production applications |
| Vertex AI | Enterprise teams needing security and compliance |
| Antigravity IDE | Agentic coding and workflow automation |
| Gemini CLI | Terminal-based developer workflows |
| Android Studio | Mobile app development |
| NotebookLM | Research and document analysis (Pro/Ultra plans) |
In the Gemini app, 3.1 Pro is rolling out globally with higher limits for users with Google AI Pro and Ultra plans.
Real-World Use Cases
Complex Research and Data Synthesis
Gemini 3.1 Pro is designed for tasks where a simple answer isn't enough — whether you need a visual explanation of a complex topic, a way to synthesize data into a single view, or a step-by-step plan bringing an ambitious creative project to life. Upload entire research corpora, datasets, or multi-document collections and ask the model to draw cross-document conclusions.
Autonomous Software Engineering
The 80.6% score on SWE-Bench Verified places Gemini 3.1 Pro among the top-tier models for real-world software engineering tasks. Databricks CTO Hanlin Tang reported that the model achieved "best-in-class results" on OfficeQA, a benchmark for grounded reasoning across tabular and unstructured data.
3D and Creative Applications
Cartwheel co-founder Andrew Carr highlighted the model's "substantially improved understanding of 3D transformations," noting it resolved long-standing rotation order bugs in 3D animation pipelines.
No-Code and Vibe Coding
Hostinger Horizons' Head of Product observed that the model understands the "vibe" behind a prompt, translating intent into style-accurate code for non-developers. This makes it well-suited for tools that help non-engineers build functional applications.
Long-Document Processing for Business
Upload entire contracts, financial reports, or legal briefs (up to 1M tokens) and ask the model to identify specific clauses, flag inconsistencies, or produce comparative summaries. The expanded 65K output window means the model can return detailed analysis without cutting off.
Agentic Task Automation
Gemini 3.1 Pro also showed gains in the APEX-Agents benchmark, nearly doubling its earlier score — a benchmark that measures performance in agentic workflows where AI systems execute multi-step tasks. For developers building AI agents that browse the web, interact with databases, or coordinate multi-tool pipelines, this is a meaningful capability upgrade.
Gemini 3.1 Pro vs. the Competition: When to Use Which
| Use Case | Best Choice | Reason |
|---|---|---|
| Novel reasoning & logic | Gemini 3.1 Pro | 77.1% ARC-AGI-2, highest score |
| Agentic workflow orchestration | Gemini 3.1 Pro | APEX-Agents leader, tool coordination |
| Cost-sensitive production apps | Gemini 3.1 Pro | $2 input vs. $5 for Opus 4.6 |
| Graduate-level science tasks | Gemini 3.1 Pro | 94.3% GPQA Diamond |
| Terminal-based coding | GPT-5.3-Codex | 77.3% vs 68.5% on Terminal-Bench |
| Expert human-preference tasks | Claude Sonnet 4.6 | Leads GDPval-AA at 1633 Elo |
| Conversational quality (Arena) | Claude Opus 4.6 | Leads Arena text leaderboard |
Context: The Gemini 3 Series Timeline
Understanding where 3.1 Pro fits requires knowing how quickly this generation has moved:
| Date | Release |
|---|---|
| November 2025 | Gemini 3 Pro launched in preview |
| December 2025 | Gemini 3 Flash released |
| December 4, 2025 | Gemini 3 Deep Think rolled out to Ultra subscribers |
| February 12, 2026 | Major update to Gemini 3 Deep Think |
| February 19, 2026 | Gemini 3.1 Pro launched in preview |
This ".1" increment is a first for Google — past generations used ".5" as the mid-year model update. The ".1" naming signals a tighter, more targeted improvement cycle. Google went from Gemini 3 Pro to 3.1 Pro in roughly three months, a pace that reflects intense competitive pressure from OpenAI and Anthropic.
Limitations and What to Watch For
Gemini 3.1 Pro is currently in preview, not general availability. That matters for production deployments — preview models can change without full stability guarantees.
Because the model is designed to prioritize being helpful, it may occasionally guess when information is missing or prioritize a satisfying answer over strict instructions. This behavior can be mitigated or modified with prompting.
For information-dense or complicated graphs, tables, or charts, the model can sometimes incorrectly extract information or misinterpret the provided resources. Presenting key information in as straightforward a manner as possible can help ensure preferred output.
On certain coding benchmarks — particularly terminal-based and SWE-Bench Pro tasks — GPT-5.3-Codex still outperforms it. For expert task preference (as measured by human evaluators), Claude Sonnet 4.6 holds the lead. No single model wins everything.
Conclusion
Gemini 3.1 Pro is the most significant mid-cycle update Google has shipped for any Gemini generation. The jump from 31.1% to 77.1% on ARC-AGI-2 is not incremental — it represents a qualitatively different level of reasoning capability. Combined with an unchanged price, a 1M-token context window, a new 65K output limit, and distribution across every major Google platform, it makes a strong case as the default choice for cost-conscious developers building reasoning-heavy or agentic applications.
It is not the best model for every task. For expert human-preference tasks and certain coding benchmarks, Claude and GPT-5.3-Codex still lead. But on the reasoning and agentic dimensions that matter most for the next wave of AI applications, Gemini 3.1 Pro has set a new bar — and done it without raising the price.
