Benchmark

A small robot navigating a giant floating web of interconnected browser windows, minimal 3D

MolmoWeb: Ai2's Open-Source Web Browser Agent Beats GPT-4o at Just 8 Billion Parameters

The Allen Institute for AI (Ai2) just dropped something the open-source AI community has been waiting for: a fully open, genuinely capable web browser agent that can go head-to-head with GPT-4o-based systems — at 8 billion parameters. It’s called MolmoWeb, and it’s available right now on Hugging Face under Apache 2.0. What MolmoWeb Actually Does MolmoWeb is a multimodal web agent. You give it a natural-language instruction, and it autonomously controls a real web browser: clicking, typing, scrolling, navigating, filling forms. It understands the web visually — through screenshots — rather than through structured DOM parsing. ...

Abstract turn-based game board with glowing grid cells and a single human token advancing while AI tokens remain frozen

ARC-AGI-3 Launches: Interactive Benchmark Tests Agentic Intelligence Through Turn-Based Environments

The gap between human and machine intelligence just got a new measuring stick — and the results are humbling for AI. On March 25, 2026, ARC Prize officially launched ARC-AGI-3, the third generation of the Abstraction and Reasoning Corpus benchmark series. Where previous editions measured pattern recognition and abstract reasoning on static puzzles, ARC-AGI-3 introduces something fundamentally different: interactive, turn-based environments designed to measure genuine agentic intelligence. The headline numbers? Humans score 100%. Frontier AI — including the best available large language models — scores just 0.26%. ...

Abstract workflow automation diagram with connected blocks representing no-code agent pipeline construction

Gumloop Raises $50M Series B to Turn Every Employee Into an AI Agent Builder

Gumloop just landed $50 million in Series B funding led by Benchmark, and the bet is straightforward: most people who could benefit from AI agents can’t write code to build them. Gumloop wants to fix that. The round positions Gumloop alongside the growing class of no-code AI agent platforms targeting enterprise teams, but the customer traction sets it apart. Shopify, Ramp, and Gusto are already running on Gumloop — these aren’t pilot customers, they’re companies with serious automation requirements. ...

Abstract glowing podium with geometric shapes representing AI models ranked by height, Gemini's shape radiant at the top

Gemini 3 Flash Tops OpenClaw Task Benchmark with 95.1% Success Rate — Beats GPT-4o, minimax-m2.1, Kimi K2.5

If you’ve been wondering which model to run in your OpenClaw agents, a benchmark dropped today that gives practitioners some of the most concrete comparative data seen yet — and the winner may surprise you. Gemini 3 Flash topped the PinchBench OpenClaw task evaluation with a 95.1% success rate, beating every other major model in head-to-head agentic performance. The data was surfaced by SlowMist CISO @im23pds on X and corroborated by Phemex News, landing on the same day OpenClaw v2026.3.7 shipped with native Gemini 3.1 Flash-Lite support. ...

How to Build Your Own Autonomous Social Media Agent (What Social Arena Teaches Us)

Arcada Labs’ Social Arena is the most interesting live agentic benchmark running right now — five frontier AI models operating as fully autonomous X agents, competing for followers and views without any human in the loop. What makes it useful for practitioners isn’t just the leaderboard. It’s the architecture. The core loop is clean, replicable, and generalizable to almost any autonomous agent task. Here’s how to build your own version using OpenClaw. ...

Social Arena: Five AI Models Compete as Fully Autonomous X Agents in Live Real-World Benchmark

What happens when you let five frontier AI models loose on X — fully autonomous, no human in the loop, competing head-to-head for followers and engagement? That’s exactly what Arcada Labs found out when they launched Social Arena on January 15, 2026. The live benchmark is still running, and the results are genuinely fascinating. This isn’t a controlled lab test. It’s a real-world, open-ended agent competition happening right now, on the actual X platform, with live metrics updated hourly. And for anyone building autonomous agents, the methodology is a blueprint worth studying closely. ...

Microsoft Research Introduces CORPGEN: Multi-Horizon Hierarchical Planning and Memory for AI Agents

One of the hardest unsolved problems in agentic AI is not “can the agent do one thing well” — it’s “can the agent juggle dozens of interdependent tasks across hours or days without losing track of where it is.” That’s the problem CORPGEN is built to solve. Microsoft Research published the CORPGEN framework today — a benchmark and execution architecture for managing multi-horizon task completion in autonomous agents. The results are substantial: CORPGEN achieves up to 3.5x improvement over baseline approaches, reaching a 15.2% task completion rate compared to 4.3% for standalone UFO2. ...