Artificial intelligence has learned to speak fluently. It writes sentences, answers questions, summarizes documents, and sometimes converses in a tone that feels strikingly human. But On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?, a landmark paper by Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell, challenges the excitement surrounding this fluency with a deeply uncomfortable question: Do large language models truly understand language, or are they merely giant parrots that probabilistically stitch together expressions found in massive text datasets?
Since its publication at the 2021 ACM Conference on Fairness, Accountability, and Transparency, the paper has become one of the defining texts in AI ethics. Its central argument is clear. The belief that bigger language models automatically produce better outcomes may be technically appealing, but scale also amplifies environmental costs, data bias, social discrimination, distorted research priorities, and the risks of human-like imitation. The authors do not treat the growth of large language models as an unquestionable sign of progress. Instead, they argue that the more important questions are not simply how large these models can become, but why they are being built, who benefits from them, and who may be harmed.
The first major issue raised by the paper is environmental cost. Large language models require enormous computational resources and energy. As model size and training data increase, so do carbon emissions and electricity consumption. The authors argue that this is not merely a cost borne by technology companies. The consequences of climate change often fall first and hardest on vulnerable regions and marginalized communities that benefit least from these technologies. In other words, the benefits of large English-centered language models may be concentrated among a relatively small group of corporations and users, while their environmental burdens are distributed unequally across society.
The second issue concerns training data. Large language models are trained on vast amounts of text collected from the internet. Yet the internet is not a neutral space. The people who can speak online, remain visible online, and avoid exclusion under platform rules are not evenly distributed across society. Sources such as Reddit, Wikipedia, and Common Crawl may appear broad and diverse on the surface, but they can overrepresent the perspectives of particular language communities, social classes, genders, races, and regions.
This is where the paper makes one of its most important interventions: size does not guarantee diversity. More data does not automatically produce a fairer model. On the contrary, if biased data accumulates at scale, bias may also be learned at scale. Hate speech, sexist language, racist expressions, negative portrayals of disabled people, and distorted representations of LGBTQ+ communities may not disappear inside massive datasets. They may instead be absorbed as linguistic patterns. The problem is that the model does not understand or judge these patterns. It simply reproduces what appears statistically plausible.
The third issue is the meaning of “understanding.” The paper urges caution toward claims that language models perform genuine natural language understanding. Language is not merely a sequence of symbols. It is connected to meaning, context, intention, social relations, and lived experience. But language models learn only from textual form. They do not directly experience the speaker’s intention, the listener’s state of mind, the surrounding social context, or the structure of the real world. Therefore, a fluent answer should not be mistaken for evidence of understanding.
This is where the paper introduces its most influential metaphor: the “stochastic parrot.” Like a parrot that can imitate human speech without necessarily understanding its meaning, a large language model generates language by probabilistically combining forms learned from vast datasets. The phrase has since become one of the most widely cited concepts in debates over generative AI.
The paper’s concern is not simply that language models may produce incorrect answers. The greater danger is that they can be wrong in highly persuasive ways. Humans tend to interpret fluent language as evidence of meaning and intention. As a result, the more natural and convincing a model’s output appears, the more likely users may be to mistake it for the result of genuine understanding or judgment. Under these conditions, biased information, hate speech, misinformation, extremist messages, and distorted social stereotypes can spread in the form of polished and credible language.
The fourth issue is social harm. The authors argue that large language models can reinforce dominant social perspectives. Racism, sexism, ableism, and exclusionary assumptions already embedded in training data may be reproduced through model outputs. These outputs are not merely technical errors. For some people, they become experiences of insult, exclusion, or discrimination. For others, they help normalize harmful stereotypes. As language models become integrated into search engines, automated response systems, translation tools, hiring platforms, education software, and customer service systems, these harms may become increasingly structural.
The paper calls for a more cautious development culture. Instead of endlessly expanding datasets and model sizes, the authors argue that researchers and companies should document what data is collected, why it is collected, and what limitations or biases it contains. They also call for pre-development risk assessments that consider potential harms to affected communities. Environmental costs and energy use should be reported, and models should be evaluated not only by performance benchmarks, but also by efficiency, accountability, and social impact.
The strength of the paper lies in the fact that it is not a simple rejection of language technology. The authors do not deny the value of progress in natural language processing. Rather, they challenge the linear assumption that larger models, more data, and higher benchmark scores necessarily lead to better social outcomes. The paper asks AI research to move beyond performance competition and to place responsibility, sustainability, and social context at the center of technological development.
There are, of course, contested aspects of the paper. Some researchers have argued that its critique underestimated the potential of large language models. Indeed, the generative AI systems that emerged after the paper have demonstrated practical value in translation, summarization, coding, writing, education, and creative work. Yet the broader these systems become, the more relevant the paper’s warnings appear. As large language models enter everyday life, questions about data bias, misinformation, copyright, privacy, environmental cost, and accountability are no longer peripheral concerns. They are central to the future of AI governance.
Ultimately, On the Dangers of Stochastic Parrots reads less like a narrow technical critique and more like an early warning for the generative AI era. It asks what society may overlook when machines become fluent. When a system speaks like a human, do we too easily assign it understanding, intention, and responsibility? In the race to build ever larger models, are we failing to ask whose language is being learned, whose worldview is being amplified, and whose reality is being erased?
The paper’s most enduring message is clear. The size of an AI model is not the same as wisdom. The quantity of data is not the same as diversity. A fluent sentence is not the same as understanding. As generative AI becomes a core layer of social infrastructure, this paper continues to pose a question that remains urgent: Do we want larger language models, or do we want more responsible language technologies?