Huawei’s New Benchmark Gives AI Agents Months of Your Life—Then Watches Them Fail – Decrypt
In brief Researchers from Huawei and three partner institutions released Claw-Anything, a benchmark that evaluates AI agents on personal-assistant tasks. GPT-5.5, OpenAI’s flagship model, scored only 34.5% on the pass@1 metric—far below its scores on existing benchmarks, suggesting current tests are measuring the wrong things. The team also released an automated data pipeline that produced...
What's Your Reaction?
Like
0
Dislike
0
Love
0
Funny
0
Angry
0
Sad
0
Wow
0