OpenAI dropped GPT-5.4 yesterday, and buried in the announcement is a number that changes everything: 75.0%.
That’s GPT-5.4’s success rate on OSWorld-Verified a benchmark that measures how well AI can navigate a desktop computer using only screenshots, mouse clicks, and keyboard commands. No APIs. No special integrations. Just operating a computer the way a human does.
The human baseline on the same tasks? 72.4%.
For the first time in history, an AI model can operate a desktop computer better than the average human performing the same tasks.
Let that sink in for a moment. We just crossed a threshold that many researchers didn’t expect to see until 2027 or later. AI agents aren’t just matching human performance on digital work they’re exceeding it.
GPT-5.2, released just months ago, scored 47.3% on the same benchmark. GPT-5.4 nearly doubles that performance. This isn’t incremental progress. This is a category shift in what’s possible with AI automation.
Released March 5, 2026, GPT-5.4 combines elite coding ability (matching the specialized GPT-5.3-Codex), native computer-use capabilities, a 1-million-token context window, and dramatically improved efficiency that cuts token usage by 47% in tool-heavy workflows.
Let me explain what actually changed, why this matters far more than just another model release, what it means for jobs and workflows, and why we might be looking at the inflection point where AI agents transition from “impressive demos” to “actually replacing human work.”
The Computer Use Breakthrough: What 75% Actually Means
First, let’s clarify what OSWorld-Verified actually tests, because the 75% number is meaningless without context.
OSWorld-Verified measures:
- Desktop navigation using only periodic screenshots
- Mouse and keyboard commands issued by the AI
- Multi-step workflows across different applications
- Tasks that mirror real professional work (opening apps, finding information, filling forms, moving data between programs)
What the AI cannot do:
- Use special APIs or integrations
- Access application internals
- Cheat by reading DOM structures or databases directly
- Get hints about what to do next
The AI sees what you see on screen. It clicks and types like you do. And it succeeds at these tasks 75% of the time vs humans at 72.4%.
For comparison:
- GPT-5.2: 47.3%
- Claude Sonnet 4.6: 72.5%
- GPT-5.4: 75.0%
- Human average: 72.4%
GPT-5.4 didn’t just catch up to humans. It passed them.
What GPT-5.4 Actually Can Do: The Native Computer Use Capabilities
This isn’t OpenAI bolting computer use onto an existing model as an afterthought. GPT-5.4 is the first general-purpose OpenAI model trained with computer use as a core capability from the ground up.
Capabilities include:
1. Visual Screenshot Analysis The model processes screenshots to understand:
- Application interfaces and layouts
- Where buttons, menus, and controls are located
- What state the application is currently in
- What actions are available
2. Mouse and Keyboard Control Can issue:
- Mouse movements and clicks
- Keyboard input including shortcuts
- Drag-and-drop actions
- Context menu interactions
3. Multi-App Workflows Navigate across:
- Desktop applications
- Web browsers
- File systems
- Multiple simultaneous applications
4. Code-Driven Automation Can write automation scripts using:
- Playwright for browser control
- Similar tools for desktop app control
- Custom scripts for complex workflows
5. Configurable Safety Policies Developers can set:
- Risk tolerance levels
- Confirmation requirements for sensitive actions
- Custom approval workflows
- Different policies for different use cases
The Numbers Beyond Desktop Use: GPT-5.4’s Complete Performance Picture
Computer use is the headline, but GPT-5.4 delivers across the board:
Professional Work Performance (GDPval Benchmark)
83% of the time, GPT-5.4 matches or beats industry professionals across 44 real-world occupations.
That’s up from 70.9% for GPT-5.2 a 12-point jump that represents thousands of specific tasks where AI now performs at or above human professional level.
Tasks tested include:
- Building sales presentations
- Creating accounting spreadsheets
- Designing urgent care schedules
- Drawing manufacturing diagrams
- Editing short marketing videos
Spreadsheet Mastery
On internal benchmarks modeling work a junior investment banking analyst might do:
- GPT-5.4: 87.3%
- GPT-5.2: 68.4%
That’s a 19-point improvement on complex financial modeling the kind of work that typically requires years of training.
Presentation Quality
Human raters preferred GPT-5.4’s presentations 68% of the time over GPT-5.2, citing:
- Stronger aesthetics
- Greater visual variety
- More effective use of generated images
Factual Accuracy Improvements
- Individual claims: 33% less likely to be false
- Complete responses: 18% less likely to contain errors
This was measured on de-identified prompts where users had previously flagged factual errors — real-world failure cases, not synthetic benchmarks.
Browser and Web Navigation
WebArena-Verified: 67.3% (vs 65.4% for GPT-5.2) Online-Mind2Web: 92.8% using screenshots alone
Visual Understanding and Documents
MMMU-Pro: 81.2% (vs 79.5% for GPT-5.2) visual reasoning without tools OmniDocBench: Error rate 0.109 (vs 0.140 for GPT-5.2) document parsing accuracy
The Tool Search Revolution: 47% Token Reduction
Here’s a technical improvement that matters enormously for production deployments but won’t make headlines: Tool Search.
The old problem: When AI agents had access to many tools (APIs, functions, integrations), every single request had to include the full specification for every available tool upfront. As tool ecosystems grew, this could add tens of thousands of tokens to each request.
The new solution: GPT-5.4 receives a lightweight tool list. When it needs a specific tool, it searches for and retrieves the full definition on-demand.
Real-world impact: In testing with 250 tasks across 36 MCP servers (Model Context Protocol the emerging standard for AI tool integration):
- Token usage dropped 47%
- Accuracy remained identical
- Costs dropped proportionally
For enterprises running agents at scale, this makes previously prohibitive workflows economically viable.
The 1 Million Token Context Window: Working Memory on Steroids
GPT-5.4 supports up to 1 million tokens of context in the API and Codex more than double the 400,000 available in GPT-5.3.
What fits in 1 million tokens:
- An entire medium-sized codebase
- A year of corporate email
- Large document corpus
- Multi-quarter financial records
- Dozens of research papers with citations
Why this matters for agents: Long-running agentic workflows can maintain full context without:
- Losing important details
- Requiring constant summary and retrieval
- Breaking multi-step processes
- Forcing developers to architect around context limitations
The pricing catch: Requests exceeding 272,000 tokens are billed at 2x the normal rate. So the 1M window is available, but you pay premium for using the upper ranges.
For comparison, Google’s Gemini 3.1 Pro offers 2 million tokens at a lower base price making it more cost-effective for ultra-long-context use cases.
The Three Variants: Standard, Thinking, and Pro
GPT-5.4 comes in three configurations:
GPT-5.4 (Standard):
- Available via API
- General-purpose use
- Balanced performance and cost
GPT-5.4 Thinking:
- Available in ChatGPT for Plus, Team, and Pro users
- Extended chain-of-thought reasoning
- Better for complex problems requiring step-by-step analysis
- Replaces GPT-5.2 Thinking (which remains accessible under Legacy Models until June 5, 2026)
GPT-5.4 Pro:
- Limited to Pro and Enterprise tiers
- Highest-demand workloads
- Maximum context and compute
The Thinking variant is particularly interesting: it uses explicit reasoning steps before generating answers, similar to OpenAI’s o1-series models, but with GPT-5.4’s enhanced capabilities.
What Changed From GPT-5.2 to GPT-5.4: The Complete Picture
Let’s consolidate what actually improved:
Computer use: 47.3% → 75.0% (27.7 point jump, surpassing humans) Professional work (GDPval): 70.9% → 83% (matches/beats humans 83% of the time) Spreadsheet modeling: 68.4% → 87.3% (19 point improvement) Presentation preference: Not measured for 5.2 → 68% prefer 5.4 Factual accuracy: Individual claims 33% less likely to be falseBrowser navigation: 65.4% → 67.3% Screenshot-only web: Not disclosed → 92.8% Visual reasoning: 79.5% → 81.2% Document parsing error: 0.140 → 0.109 Tool efficiency: Baseline → 47% token reduction with Tool SearchContext window: Varied → 1 million tokens (doubled from 5.3)
This isn’t one capability improving. It’s across-the-board enhancement with computer use as the standout leap.
The Real-World Implications: What This Actually Means
Let’s get practical about what GPT-5.4’s capabilities enable:
For Software Development
Developers can now:
- Describe a feature in natural language
- Have GPT-5.4 navigate their IDE, write code, run tests, debug errors autonomously
- Operate across multiple applications (editor, terminal, browser, database GUI)
- Complete entire feature implementations with minimal human intervention
Brendan Foody, CEO at Mercor, states: “GPT-5.4 is now top of the leaderboard on our APEX-Agents benchmark, which measures model performance for professional services work. It excels at creating long-horizon deliverables such as slide decks, financial models, and legal analysis.”
For Finance and Analysis
Analysts can delegate:
- Building complex Excel models
- Pulling data from multiple sources
- Creating presentations with charts and insights
- Formatting and QA of financial documents
Daniel Swiecki of Walleye Capital reports GPT-5.4 improved accuracy on internal finance and Excel evaluations by 30 percentage points.
For Business Process Automation
Companies can automate:
- Data entry across multiple systems
- Report generation pulling from various sources
- Form filling and submission workflows
- Cross-application data migration
The 75% success rate means these automations work reliably enough for production deployment with oversight, not just demos.
For Customer Service and Support
Support teams can deploy agents that:
- Navigate customer accounts across multiple systems
- Retrieve information from internal databases
- Fill out tickets and case management systems
- Generate comprehensive responses with supporting data
The Competition: Where GPT-5.4 Stands
vs. Claude Sonnet 4.6
Claude’s OSWorld-Verified score: 72.5% GPT-5.4’s OSWorld-Verified score: 75.0%
OpenAI just took the lead on computer use, but barely. Both models are essentially at human-level performance. The practical difference in production deployments will be minimal.
Claude still leads on:
- SWE-bench Verified (80.9% vs GPT-5.4’s undisclosed score)
- Consistent real-world coding quality (per developer testimonials)
vs. Gemini 3.1 Pro
Gemini advantages:
- 2 million token context (2x GPT-5.4’s 1M)
- Lower base pricing
- Native multimodal (handles audio, video)
GPT-5.4 advantages:
- Better computer use performance
- Tool Search efficiency (47% token reduction)
- Stronger professional work benchmarks (GDPval)
vs. GPT-5.3-Codex
GPT-5.4 matches GPT-5.3-Codex on coding benchmarks while adding:
- Computer use capabilities
- Broader world knowledge
- Better tool orchestration
- Lower latency in Codex /fast mode (1.5x faster token velocity)
This makes GPT-5.4 a true general-purpose model rather than a specialized coding tool.
The Business Model: Efficiency Gains and Pricing
OpenAI is emphasizing efficiency improvements:
Token usage reduced by 47% (with Tool Search) 1.5x faster token velocity (Codex /fast mode) Lower retry costs(33% fewer factual errors means fewer failed attempts)
But let’s talk about actual API pricing:
Standard context (up to 272K tokens): Normal rates Extended context (272K to 1M tokens): 2x rates
The efficiency gains partially offset costs, but running agents with maximum context still gets expensive fast.
For enterprises, the calculation becomes:
- Reduced token usage (47% savings with Tool Search)
- Higher success rates (fewer retries, less human intervention)
- But premium pricing for extended context
The net result depends entirely on your specific workflows.
The Timing: Why OpenAI Released This Now
GPT-5.4 arrived during an unusually turbulent moment for OpenAI:
The Pentagon Deal Backlash: OpenAI’s partnership with the US Department of Defense triggered:
- Wave of user cancellations
- Public criticism from Anthropic’s CEO Dario Amodei
- Internal dissent from researchers
The Release Cadence:
- GPT-5.3 Instant launched Monday (March 3)
- GPT-5.4 launched Thursday (March 5)
- Two significant releases in under a week
OpenAI is betting that staying visible in the news cycle is as important as any single capability leap. When you’re facing bad press, launch better products.
TNW’s analysis: “That pace…suggests OpenAI is betting that staying visible in the news cycle is as important as any single capability leap.”
The Bottom Line: We Just Crossed the Human Performance Threshold
Strip away the marketing, the benchmarks, and the corporate drama. Here’s what actually matters:
For the first time, AI can operate desktop computers better than humans at specific professional tasks.
Not “as well as.” Better than.
The 75.0% vs 72.4% gap is small, but it’s on the right side of the threshold. And the trajectory from GPT-5.2 (47.3%) to GPT-5.4 (75.0%) in just months suggests this gap will widen quickly.
What this enables:
- Production-ready automation of knowledge work previously requiring human judgment
- AI agents operating across multiple applications autonomously
- Long-horizon professional workflows completed with minimal oversight
- Economic viability of tasks that weren’t automatable at 47% success rates
What this doesn’t mean:
- Complete automation of all jobs (success rates are 75%, not 100%)
- No human oversight needed (mistakes still happen 25% of the time)
- Immediate disruption (adoption and integration take time)
What comes next:
- Competitors will match or exceed these numbers within months
- Success rates will climb from 75% toward 85%, then 90%+
- Economic pressure will accelerate deployment
- Job roles will shift from execution to oversight and quality control
GPT-5.4 isn’t the end of the automation revolution. It’s the moment that history will mark as when AI crossed from “impressive but limited” to “reliably capable of replacing human work in specific domains.”
Pre-orders are open. The API is live. Codex integrates it natively. ChatGPT Plus, Team, and Pro users have access now.
The automation revolution just accelerated. Whether that excites or terrifies you probably depends on whether you’re building the automation or being automated.
Either way, the threshold has been crossed. AI now operates desktops better than humans. And the gap is widening.
GPT-5.4 is available now via the OpenAI API (model: gpt-5.4), in Codex for development workflows, and in ChatGPT as GPT-5.4 Thinking for Plus, Team, and Pro subscribers. Enterprise and Edu plans can opt in via admin settings. Context window: up to 1 million tokens (2x pricing above 272K). Computer use available through the API’s updated computer tool. For full documentation and pricing details, visit platform.openai.com.


Leave a Reply