Paper Reading | Cheating Popular LLM Benchmarks

Why popular LLM leaderboards can be gamed by structured outputs, how the cheating strategy works, and what this says about the reliability of automatic evaluation.