All Souls exam questions and the limits of machine reasoning

This Hacker News discussion revolves around the capabilities of Large Language Models (LLMs) in producing creative and insightful writing, particularly when compared to human performance on challenging intellectual tasks, and touches upon the human inclination towards tradition and ritual.

LLMs and the Challenge of Originality/Individuality

A central theme is the perceived lack of originality and individual voice in LLM-generated content. Several users suggest that LLMs, by their nature of aggregating vast amounts of data, tend to produce output that sounds "average" or bland.

SamBam posits that to be interesting, writing needs to be from an individual's standpoint, and LLMs, by amalgamating all text, inevitably sound like an average: "because you can't amalgamate all the text in the world and not sound like an average."

wjnc offers a counterpoint, suggesting that "People are average on average" and that testing LLMs against "super human tests" might be unfair: "OP is measuring LLM succes based on a super human test which most of us would likely fail." They also suggest that "Creativity is just longer context and opinionated prompting."

dmurray echoes the sentiment that LLMs are often bland because "we value blandness," noting that corporations pay for inoffensive corporate styles, not for the originality of figures like Montaigne or Swift.

"The chat bots we have today are bland because we value blandness. Customers are willing to pay for the inoffensive corporate style that can replace 90% of their employees at writing." - dmurray

Conversely, a specific prompt might elicit better results, as suggested by dmurray: "If it was told to write an idiosyncratic, opinionated essay, and perhaps given a suitable source material - "you are Harry Potter" but someone less well known but still with a million words of backstory - couldn't it do it?"

The All Souls Exam as a Benchmark for Intellectual Prowess

The All Souls College exams, particularly their essay prompts, are frequently mentioned as a high bar for human intellectual and writing ability. The difficulty of these exams serves as a point of comparison for LLM capabilities.

hydrogen7800 expresses a desire to read essays from test-takers: "I would love to read some real essays written by test takers. Any pointers?"

decimalenough clarifies that these essays are not publicly available: "They're written in pencil and not returned, so nobody (except All Souls staff) has access to them."

lordnacho, who attended the Mallard Song ritual, also comments on the LLM's performance on the All Souls test: "As for LLMs on the All Souls test, it's predictable that it mostly whiffs. After all it takes in a diet of Reddit+Wikipedia+etc, none of which is the kind of writing they are looking for." They elaborate on why Wikipedia and Reddit are insufficient training data for the nuanced, cross-disciplinary thinking required: "Wikipedia is a great reference work, but it tends to not have any of the kinds of connections you're supposed to make in these essays." and "Reddit is a lot of crappy comments."

dash2 tests ChatGPT 5 with a prompt requiring the "gluing together" of ideas, specifically asking "Why should cultural historians care about ice cores?". They find the LLM does a "pretty good job summarizing an abstruse, but known, subfield of frontier research" but "clearly lacks 'depth', in the sense of deep thinking about the why and how of this." They note the output is in bullet points, not an essay, and even a request for a 1000-word essay is "an essay in form, but secretly a bunch of bullet points." Comparing it to a Guardian article, dash2 observes that the LLM "does a good job at explaining why they should, though not in a deep essayistic style."

dmurray agrees that the LLM examples for "Water" might place it in the top 10% of literate adults, but that it would likely be the "worst candidate in the All Souls exams, because those obviously select for people who are interested in writing essays of this sort."

andy99 summarizes a key point regarding LLMs: "LLMs suck at natural writing, particularly long form. Or more abstractly they don't have complex original ideas, so can't do anything that requires this."

The Human Affinity for Tradition and Ritual

A secondary but recurring theme is the human tendency towards tradition and ritual, particularly in a British context. The discussion touches on the seemingly nonsensical nature of some long-standing traditions.

The mention of All Souls College's "Mallard Song" ritual, occurring once a century, sparks commentary on British traditions.

andyjohnson0 notes the prevalence of such rituals in the UK: "You can't walk for more than five minutes in the UK without tripping over some nonsense like this. History is very important, and traditon has its place, but really? As a brit I find it all kind of tediously performative sometimes."

xg15 relates this to Terry Pratchett's fictional "ritual of the Other Jacket."

andyjohnson0 then brings up the King's Remembrancer and the Quit Rent Ceremony as further examples of enduring, albeit strange, traditions: "See also; the King's Remembrancer and the Quit Rent Ceremony and the Trial of the Pyx." They express wonder at the longevity of these practices: "It is truly strange how my country can create a political and cultural operating system that allows this stuff to just go on and on for almost 800 years, right up to now."

xg15 admires the "stamina for that" regarding the King's Remembrancer ceremony where a jury counts gold coins.

lordnacho attended the latest Mallard Song, describing it as "a bunch of weirdos in a courtyard," but acknowledges the uniqueness of a once-in-a-century event and the inherent change rituals undergo over time: "It looked like a bunch of weirdos in a courtyard to me, but it was a literally once-in-a-century event, and I was living less than a minute away, so why not? I don't think I've ever heard of a scheduled ritual that has a longer period."

Evolving Benchmarks of Intelligence and the Turing Test

The discussion also touches on the shifting goalposts for evaluating intelligence, particularly in the context of LLMs and the Turing Test.

munchler observes a trend of moving away from the Turing Test: "A few years ago, the Turing Test was universally seen as sufficient for identifying intelligence. Now we’re scouring the planet for obscure tests to make us feel superior again." They suggest that the goalposts have been moved, even if the Turing Test's adequacy was always debatable.

altruios considers the limitations of the Turing Test in light of LLM hallucinations and suggests an "infinite" Turing Test where LLMs eventually degrade. They reflect on the nature of testing, comparing it to human limitations in perceiving things like UV or infrared light: "This ignores such ideas of probing the LLM's weak spots. Since they do not 'see' their input as characters, and instead as tokens... But the above approach is not in the spirit of the Turing test, as that only points out a blind spot in their perception, like how a human would have to guess a bit at what things would look like if UV and infrared were added to our visual field..."