Discussion about this post

User's avatar
Carsten Bergenholtz's avatar

Fantastic piece. I have for a while had the impression that most people don't realize how standardized the tasks that most benchmarks rely on actually are - arguably more standardized tasks than what many knowledge workers face in their work-life.

A related point from the management & economics literature: Several highly cited pieces have shown how GenAI implies performance gains for users. However, these tend to be based on relatively well-defined tasks, where goals and success criteria are clear. In our AMLE paper (https://journals.aom.org/doi/10.5465/amle.2025.0029) we set out to investigate what happens when participants are in a time-pressured situation, dealing with a more open, ill-defined (messy) task: lower performers improved with ChatGPT-4 while higher performers declined, producing an equalizing effect rather than a democratizing one. For weaker performers use of the tool reduced their cognitive burden by providing structure and plausible content - if you don't know anything, getting some relevant info is useful. Yet, for higher performers, the tool often did the opposite: it created extra material to monitor, evaluate, and integrate under time pressure, which disrupted their normal analytical workflow. In that sense, what looks like a supportive scaffold could cognitively overload for stronger performers. Thus, in more messy tasks, the bottleneck is not just what the model can produce, but whether humans can monitor and orchestrate output in messy contexts. (of course until the model can do the job completely!).

No posts

Ready for more?