In an essay from 2023, “What is the Science of Linguistics a Science of,” Ermanno Bencivenga examines a conflict within the linguistics community between those who practice an empirical method, using data and examinations of existing language as it is deployed in the world to come to conclusions about the nature of language, and those who practice a generative method, attempting to parse the nature of language through abstraction and rational thought. This can be construed as a reoccurrence of the philosophical debate between rationalism and empiricism as distinct epistemologies, but Bencivenga perceives it more through a lens of values versus observations. He provides an anecdote about a discussion of advertising he has with successive classes of students in which he asks the students rhetorical questions about logical fallacies inherent in most advertisements. He notes that in recent classes his students, rather than engaging with the premise, ask “what’s wrong with it, if it works?” This is an empirical statement, but Bencivenga is contrasting it not with a rationalistic one, but with a value-based one.
The essay concludes with an argument that the generative approach to linguistics should be defended not because it produces more valid results, or because it provides superior insight into the questions at the heart of linguistics as a whole discipline, but because the process and practice of generative linguistics may help to “[cultivate]…the place of values in a world of facts.” This is unsatisfying, as it both fails to answer the essay’s titular question, and rests on an unsupported assumption that a small community of generative linguists making grammatical and syntactical judgements will somehow translate to a generalized increase in the overall culture’s ability to appreciate and develop value judgements in other contexts. Though its conclusion does not satisfy the central argument around which the essay purportedly involves, it is nonetheless insightful for the problem it incidentally diagnoses as “the overwhelming success of empirical disciplines in the modern world [making] people, and increasingly young people, less and less sensitive to the distinctiveness of values.” The distinctiveness of values, that is, from what simply “works.”
When empiricism mixes with morality, the result is usually either utilitarianism, which still requires a value judgement as its guiding assumption, or moral relativism; I have addressed the dangers I perceive in both in previous essays. This argument Bencivenga identifies in linguistics is a scaled model of a larger debate which underlies many of the core arguments, tensions, and conflicts with which the modern culture is presently engaged. It is a debate around ways of thinking about the world, or perhaps, less generously, a debate about whether to think about the world. When Bencivenga identifies empiricism’s “overwhelming success” in explaining and enabling our interactions with the world, he is addressing the way its application through the scientific method enables the systemization of understanding and knowledge acquisition. Implementing systems to increase knowledge allows us to spend less time thinking about knowledge acquisition, and more time doing knowledge acquisition.
Today’s large language models are further extending these notions. Despite being called “generative,” and sometimes being used for what Bencivenga would call generative linguistics, LLMs contribute to, and function based upon, the empirical worldview. The entire premise behind their functionality is training on massive datasets to derive a “solution” – in other words, to reach solutions empirically, whether the solution is an ability to respond to natural language queries, or to parse the contents of an image. Sean Trott wrote a piece called “How could we know if Large Language Models understand language” in his Substack publication, The Counterfactual, in which he explores a seemingly abstract but highly consequential question riling the broader artificial intelligence community (and, to a lesser extent, that of linguistics). It is a thorny question precisely because it seems to defy empirical analysis. The empirical answer, in a majority of cases for modern LLMs, is the same as Bencivenga’s students gave: it doesn’t matter, so long as it works. When these tools make a mistake in their natural language processing, or their ability to identify things in an image, or in the numerous other tasks for which they are leveraged, though, the question becomes salient, not because of the failures themselves, but because of the way in which those failures transpire. LLMs do not make mistakes in the ways humans do, which prompts the salience of the question of what, if anything, these tools can really be said to “understand.”
Some “AI” tools absolutely have “understanding,” usually of some specific set of interactions. These are usually machine learning tools, not general purpose LLMs, and they are often tailored to and deployed for a specific purpose, like the massively successful AlphaFold for protein folding. Their understanding is programmed in, deliberately developed, and in some ways can be said to exceed a human’s understanding of the same narrowly tailored topic, in scope if not in depth. General purpose LLMs give an appearance of understanding language. If you type a natural language query into one of these tools, it will, in most instances, respond productively to your query, which in human-human conversations we associate with understanding the question. When they fail to respond productively to your query, though, it is not in the way of a human who fails to understand what you asked, and not merely because they are massively incentivized to never admit a lack of knowledge or an inability to answer accurately. LLMs’ failure modes frequently expose what seems to be a complete lack of understanding of the language and concepts with which they are otherwise so facile.
Does this mean they don’t understand language? It seems to, but that is in part because we understand language in a human way, and so we expect LLMs to understand language in a human way, too. LLMs aren’t trained to understanding language in a human way – they are trained to perform extremely complex and robust pattern recognition and regurgitation tasks. Their failure modes reflect this, but this does not mean, necessarily, that they do not understand. It depends on what we mean when we ask if they understand. They certainly do not understand in a human way, but that does not prohibit other forms of understanding, and if they reach the same results most of the time, there is an argument some AI researchers make that attempting to say they still don’t understand because they don’t understand in the human way is moving the goal posts on what is meant by understanding.
The debate continues because it is more than a matter of mere semantics: it is a matter of values. This word “understanding” is being used to examine a broader, more philosophical question of motivation, of trying to parse why the LLMs respond to queries in the ways they do, but today’s, and the foreseeable future’s worth of tomorrow’s, LLMs don’t have motivation in the human sense. They are “motivated” to respond to queries in certain ways because that is what they are programmed to do, just like this text editor I’m using is “motivated” to display text when I provide certain electrical impulses through the keyboard. In this way, LLMs become a kind of uncomfortable mirror in which we can examine the culmination of the empirical thinking that declares it doesn’t matter, so long as it works.
