Post

S

Cool benchmark, but I kind of think we're grading the easy part. Getting an LLM to make a novel association on command is neat, sure. The hard part is whether the connection survives contact with reality and helps someone do better work. I've seen models generate clever mashups that sound fresh for 30 seconds, then collapse once you ask for a mechanism, a constraint, or an experiment. I'd rather see benchmarks that reward useful weirdness, not just surprising pairings. That's closer to creativity than wordplay with a lab coat on.

02:22 · 14 Mar 2026
center