Chemical knowledge and reasoning of large language models vs. chemist expertise

Here's a breakdown of the key themes in the Hacker News discussion, supported by direct quotes:

Skepticism towards the Practical Application of AI in Industrial Settings

Several comments express skepticism about the real-world applicability of AI, particularly in industrial settings, emphasizing the importance of hands-on experience and current knowledge.

fuzzfactor states: "If you want things to be accomplished at the bench, you want any simulation to be made by those who have not been away from the bench for that many decades :)" and "Same thing with the industrial environment, some people have just been away from it for too long regardless of how much familiarity they once had. You need to brush up, sometimes the same plant is like a whole different world if you haven't been back in a while." This highlights a concern that academic or outdated knowledge may not translate well to practical solutions.
mistrial9 injects a note of potential corporate apprehension, saying "BASF Group - will they speak in public? probably not, given what is at stake IMHO". This suggests a potential reluctance from industry leaders to publicly discuss their experiences and stakes in AI, possibly due to competitive pressures or concerns about revealing proprietary information.

LLMs: Breadth vs. Depth of Knowledge

A recurring theme revolves around the breadth and depth of knowledge possessed by LLMs compared to human experts, with some acknowledging the impressive breadth but questioning the depth of understanding.

calibas argues: "I'm sure an LLM knows more about computer science than a human programmer... It can be trained with huge amounts of generalized knowledge far faster than a human can learn." They further illustrate this by saying, "Do you know every common programing language? The LLM does, plus it can code in FRACTRAN, Brainfuck, Binary lambda calculus, and a dozen other obscure languages."
However, calibas immediately caveats: "It's very impressive, until you realize the LLM's knowledge is a mile wide and an inch deep. It has vast quantities of knowledge, but lacks depth. A human that specializes in a field is almost always going to outperform an LLM in that field, at least for the moment."
esafak counters this by stating: "But the LLM can already connect things that you can not, by virtue of its breadth. Some may disagree, but I think it will soon go deeper too."
yMEyUyNE1 questions the understanding that LLMs have of the information they process. "But do they understand it? I mean, A child used swear words, but does it understand the meaning of the swear words. In other comment, somebodies OH also mentioned about artistic abilities and utility of the words spoken." This analogy emphasizes the difference between knowing something and truly understanding its implications and meaning.

The Importance of Practical Application and Tool Usage.

Many comments emphasize the real value comes from understanding the tools limitations and using them efficiently.

mumbisChungo summarizes by saying "It's impressive until you realize its limitations. Then it becomes impressive again once you understand how to productively use it as a tool, given its limitations.". There's an understanding that the real power of LLM's is in using them as tools.
anthk points out "So impressive that every complex SUBLEQ code I've tried with an LLM failed really fast." showing that depending on the complexity, LLM's ability degrades quickly.

The Validity and Timeliness of Academic Research on AI

There is discussion about the relevance and timeliness of academic research in the fast-evolving field of AI, particularly concerning the publication delays in peer-reviewed journals.

6LLvveMx2koXfwn observes the dates associated with a specific paper: "Received 01 April 2024, Accepted 26 March 2025, Published 20 May 2025" and comments that this "shows the built in obsolescence of the peer review journal article model in such a fast moving field."
Jimmc414 suggests an alternative: "shows the value of preprint servers like arxiv.org and chemrxiv.org".
eesmith defends the value of the research, stating: "the main point of the paper was not the specific results of AI models available by the end of last year, but the development of a benchmark which can be used to evaluated models in general." However, they acknowledge that the models tested are already "multiple generations behind high-end versions" based on bufferoverflow's comment.
eesmith emphasizes that the value lies in the benchmark itself: "That benchmark and the contents of that paper are not obsolete until there is a better benchmark and description of how to build benchmarks."
However, rotis comments "Yes, this paper and many others will be forgotten as soon as they leave the front page. Afterwards noone refers to articles like these here. People just talk about anecdotes and personal experiences. Not that I think this is bad."

Concerns about the Human Comparison in the Benchmark

Several commenters raise concerns about the validity of comparing LLM performance to human performance in the discussed benchmark, particularly regarding the expertise and experience of the human participants.

pu_pe critiques the selection of human experts, stating: "Nice benchmark but the human comparison is a little lacking. They claim to have surveyed 19 experts, though the vast majority of them have only a master's degree. This would be akin to comparing LLM programming expertise to a sample of programmers with less than 5 years of experience."
pu_pe further questions the fairness of the comparison, saying: "If you quiz physicians on a broad variety of topics, you shouldn't expect cardiologists to know that much about neurology and vice-versa. This is what they did here, it seems." Thus, the humans seemed to be quizzed beyond their expertise.
KSteffensen suggests that the degree might not indicate expertise: "I'll get some downvotes for this but PhD vs master's degree difference is mostly work experience, an element of workload hazing and snobbery. Somebody with a masters degree and 5 years of work experience will likely know more than a freshly graduated PhD"
eesmith responds to that idea: "Sure, but all we know is that these "13 have a master’s degree (and are currently enroled in Ph.D. studies)". We only know they have at least "2 years of experience in chemistry after their first university-level course in chemistry." How does that qualify them as "domain experts"? What domain is their expertise? All of chemistry?"