Making sure our machines understand the intent behind our instructions is an important problem that requires understanding intelligence itself.
Melanie Mitchell writes:
Many years ago, I learned to program on an old Symbolics Lisp Machine. The operating system had a built-in command spelled “DWIM,” short for “Do What I Mean.” If I typed a command and got an error, I could type “DWIM,” and the machine would try to figure out what I meant to do. A surprising fraction of the time, it actually worked.
The DWIM command was a microcosm of the more modern problem of “AI alignment”: We humans are prone to giving machines ambiguous or mistaken instructions, and we want them to do what we mean, not necessarily what we say.
[AI researchers] believe that the machines’ inability to discern what we really want them to do is an existential risk. To solve this problem, they believe, we must find ways to align AI systems with human preferences, goals and values.
This view gained prominence with the 2014 bestselling book Superintelligence by the philosopher Nick Bostrom, which argued in part that the rising intelligence of computers could pose a direct threat to the future of humanity. Bostrom never precisely defined intelligence, but, like most others in the AI alignment community, he adopted a definition later articulated by the AI researcher Stuart Russell as: “An entity is considered to be intelligent, roughly speaking, if it chooses actions that are expected to achieve its objectives, given what it has perceived.”
For Bostrom and others in the AI alignment community, this prospect spells doom for humanity unless we succeed in aligning superintelligent AIs with our desires and values. Bostrom illustrates this danger with a now-famous thought experiment: Imagine giving a superintelligent AI the goal of maximizing the production of paper clips. According to Bostrom’s theses, in the quest to achieve this objective, the AI system will use its superhuman brilliance and creativity to increase its own power and control, ultimately acquiring all the world’s resources to manufacture more paper clips. Humanity will die out, but paper clip production will indeed be maximized.
It’s a familiar trope in science fiction — humanity being threatened by out-of-control machines who have misinterpreted human desires. Now a not-insubstantial segment of the AI research community is deeply concerned about this kind of scenario playing out in real life. Dozens of institutes have already spent hundreds of millions of dollars on the problem, and research efforts on alignment are underway at universities around the world and at big AI companies such as Google, Meta and OpenAI.
To many outside these specific communities, AI alignment looks something like a religion — one with revered leaders, unquestioned doctrine and devoted disciples fighting a potentially all-powerful enemy (unaligned superintelligent AI). Indeed, the computer scientist and blogger Scott Aaronson recently noted that there are now “Orthodox” and “Reform” branches of the AI alignment faith. The former, he writes, worries almost entirely about “misaligned AI that deceives humans while it works to destroy them.” In contrast, he writes, “we Reform AI-riskers entertain that possibility, but we worry at least as much about powerful AIs that are weaponized by bad humans, which we expect to pose existential risks much earlier.”
Many researchers are actively engaged in alignment-based projects, ranging from attempts at imparting principles of moral philosophy to machines, to training large language models on crowdsourced ethical judgments. None of these efforts has been particularly useful in getting machines to reason about real-world situations. Many writers have noted the many obstacles preventing machines from learning human preferences and values: People are often irrational and behave in ways that contradict their values, and values can change over individual lifetimes and generations. After all, it’s not clear whose values we should have machines try to learn.
Note: “Crowdsourced ethical judgments” could be something similar to “mob rule” or “the party line” or “social relativism”. Anybody want to enlist in a future governed by machines making value-based decisions with such criteria?
Many in the alignment community think the most promising path forward is a machine learning technique known as inverse reinforcement learning (IRL). With IRL, the machine is not given an objective to maximize; such “inserted” goals, alignment proponents believe, can inadvertently lead to paper clip maximizer scenarios. Instead, the machine’s task is to observe the behavior of humans and infer their preferences, goals and values.
However, I think this underestimates the challenge. Ethical notions such as kindness and good behavior are much more complex and context-dependent than anything IRL has mastered so far. Consider the notion of “truthfulness” — a value we surely want in our AI systems. Indeed, a major problem with today’s large language models is their inability to distinguish truth from falsehood. At the same time, we may sometimes want our AI assistants, just like humans, to temper their truthfulness: to protect privacy, to avoid insulting others, or to keep someone safe, among innumerable other hard-to-articulate situations.
Moreover, I see an even more fundamental problem with the science underlying notions of AI alignment. Most discussions imagine a superintelligent AI as a machine that, while surpassing humans in all cognitive tasks, still lacks humanlike common sense and remains oddly mechanical in nature. And importantly, in keeping with Bostrom’s orthogonality thesis, the machine has achieved superintelligence without having any of its own goals or values, instead waiting for goals to be inserted by humans.
Yet could intelligence work this way? Nothing in the current science of psychology or neuroscience supports this possibility. In humans, at least, intelligence is deeply interconnected with our goals and values, as well as our sense of self and our particular social and cultural environment. The intuition that a kind of pure intelligence could be separated from these other factors has led to many failed predictions in the history of AI. From what we know, it seems much more likely that a generally intelligent AI system’s goals could not be easily inserted, but would have to develop, like ours, as a result of its own social and cultural upbringing.
Note: How could this possibly go wrong? Surely, programming an AI system to develop its own goals (for the good of humanity, of course) would lead to utopia. (Sarcasm alert!)
In his book Human Compatible, Russell argues for the urgency of research on the alignment problem: “The right time to worry about a potentially serious problem for humanity depends not just on when the problem will occur but also on how long it will take to prepare and implement a solution.” But without a better understanding of what intelligence is and how separable it is from other aspects of our lives, we cannot even define the problem, much less find a solution. Properly defining and solving the alignment problem won’t be easy; it will require us to develop a broad, scientifically based theory of intelligence.
Full article at Quanta.
The Declaration of Independence of the United States of America contains these words:
“We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness.”
Basing human rights on the endowment of a transcendent Creator ensures that they cannot be taken away by changes in human opinion. Human rights may be violated, but their transcendent origin means that they still exist as an immutable reality to which aspirations of freedom will strive.
Let us be warned not to cede our rights and values to public opinion or an elite few who control the programming of moral philosophy into machines.