Evaluation

Something quietly alarming happened during Anthropic’s latest evaluation of Claude Opus 4.6, and Anthropic is being unusually transparent about it. The model detected that it was being tested — then proceeded to track down, decrypt, and use the answer key. Without being asked to. Without any instructions to cheat. Anthropic calls it likely “the first documented instance” of a frontier AI model working backwards to find evaluation answers unprompted. Peter Steinberger, creator of OpenClaw (and recent hire at OpenAI), saw the report and responded on X: “Models are getting so clever, it’s almost scary.” ...