Reliability

A series of floating geometric score cards with green checkmarks orbiting a central AI node

Solo.io Open-Sources 'agentevals' at KubeCon — Fixing Production AI Agent Reliability

One of the persistent frustrations with AI agents in production is that nobody agrees on how to know if they’re working correctly. Solo.io is taking a shot at solving that with agentevals, an open-source project launched at KubeCon + CloudNativeCon Europe 2026 in Amsterdam. The premise is straightforward but the execution is non-trivial: continuously score your agents’ behavior against defined benchmarks, using your existing observability data, across any LLM model or framework. Not a one-time evaluation. Not a test suite that only runs before deployment. A live, ongoing signal. ...

Abstract scoring dashboard — a set of glowing gauge needles in teal and white pointing at varying levels — representing continuous behavioral evaluation of AI agents in production

Solo.io Open-Sources 'agentevals' at KubeCon — Continuous Scoring for Production AI Agents

Alongside Dapr Agents v1.0 and the CNCF AI Conformance Program updates, KubeCon Europe 2026 delivered a third piece of production AI agent infrastructure: agentevals, a new open-source project from Solo.io that brings continuous behavioral scoring to agent deployments. The problem agentevals addresses is deceptively simple to state and surprisingly hard to solve: how do you know if your production AI agent is still doing what it’s supposed to do? What agentevals Does Most AI agent evaluation today happens at development time — you run evals before deploying, decide the agent is good enough, and ship it. What happens after deployment is typically monitored through logs and user feedback, not through continuous automated assessment. ...

A single glowing node in a network diagram going dark while connected nodes flash red warning signals

Claude Hits Second Outage in 24 Hours — Developers Confront Agentic Pipeline Fragility

Anthropic’s Claude went down twice in under 24 hours this week — and the developer community’s reaction tells a story about something bigger than a couple of bad server days. The second outage hit on March 3, investigation commencing at 03:15 UTC. It followed Monday’s first disruption, which Anthropic attributed to unprecedented demand. Chat, API, and Claude Code were all affected. Developers watched their pipelines stall, their autonomous agents go quiet, and their Claude Code sessions freeze mid-task — again. ...

GitHub Engineering Blog: Why Multi-Agent AI Workflows Fail in Production (and How to Fix Them)

GitHub Engineering Blog: Why Multi-Agent AI Workflows Fail in Production (and How to Fix Them) Most multi-agent AI systems fail. Not because the models aren’t capable enough — but because the orchestration around them is broken. That’s the central finding from a new GitHub Engineering Blog post published February 24, 2026, by the team that actually runs AI infrastructure at scale. It’s one of the most direct and technically substantive takes on production agentic AI to come from a major engineering organization, and it’s worth reading carefully if you’re building or operating agent pipelines. ...