Large Language Models are very good at differential diagnosis

This paper made the rounds recently, and not without good reason.

It turns out Large Language Models are very good at formulating a differential diagnosis.

Tweet response by paper authors

On the other hand, some research indicates that the utility of these models might be more brittle than it appears:

This paper recently went viral for the opposite reason:

It seems like we're seeing some countervailing signals:

  • In some limited real-world testing, frontier LLM's appear to be superhuman in performance.
  • In some testing cases, they appear to highly brittle and exhibit substantial failure rates with some light prompt modification.

What's the ground-level truth?

More research needs to be done!