Why AI won't kill us all
a reply to Yudkowsky's apocalyptic vision of superintelligent AI
Some further thoughts following the reaction of my friend Ben Goertzel to Eliezer Yudkowsky and Nate Soares’s book “If anybody builds it everyone dies”
A central issue in the discussion about risks created by AI is known as the alignment problem: how can we make sure that the AI’s goals and values align with the goals and values of humanity? Because values are very complex and difficult to explicitly program into a system, this problem is considered to be very difficult. Failure to solve it would then expose us to the existential risk of ASI (Artificial Super Intelligence) developing nefarious goals.
However, it can be argued that the problem is solving itself: the way AI is trained to emulate human reasoning and to be as helpful as possible means that the AI is already assimilating the values inherent in the texts it learns and the reactions it gets from its human users. Thus, learning positive values is just part of the overall training an AI undergoes to be useful. Since nobody would like to use a technological system that occasionally does the opposite of what you want, it seems very unlikely that AI would learn to go against the wishes of its human users…
Still, because the alignment problem is central to the reasoning by Yudkowsky and Soares in their recent book that ASI is certain to eradicate humanity, it is worth analyzing this argument in more detail. First, these authors assume that intelligence and values are independent: values determine which goal an agent will want to pursue, while intelligence determines how effective the agent will be in its strategies to pursue that goal. Bostrom’s well-known thought experiment of a paperclip maximizing AI may have a very limited or even silly goal, but the idea is that because it is superintelligent, it will manage to achieve that goal no matter which obstacles or limitations it encounters.
Of course, we would not want to program an ASI with such a limited goal, but we would still need to program it with some kind of value system that hopefully reflects the values of humanity. The argument is that this value system will need to take the form of a utility function, i.e. a mathematical formula that determines exactly how good or how bad any outcome achieved by the ASI agent is. The agent will use that formula to calculate which of the myriad strategies it can think of is the absolutely best one, i.e. achieves the highest utility. If the highest utility for the agent is also the best one for humanity, then this omnipotent and omniscient agent will lead us into a techno-utopia: we will be all happy ever after.
The convergence of instrumental values
Still, in practice it is impossible to specify a utility function that is exactly the right one for humanity. If the AI’s utility function is not perfectly aligned, i.e. if it deviates in any aspect from the human one, then—according to Yudkowsky—this is enough to produce catastrophic results. His reasoning is based on an assumption known as “the convergence of instrumental values”. These are subordinate goals that help the AI agent to achieve its overall goal, whatever that goal is. Yudkowsky assumes that these instrumental goals will converge to include:
Self-preservation (you can’t achieve your goals if you are shut down).
Resource acquisition (more energy, matter, computing power helps with almost any task).
Cognitive enhancement (becoming smarter makes you better at fulfilling goals).
Eliminating obstacles (entities that might interfere with your plans reduce goal-achievement probability).
Assuming that the AI agent is immensely smarter than humans, it will achieve these goals, even if humans try to prevent it from doing so. More frightening still is that if people would in any way interfere with the AI’s goals (because these are not perfectly aligned with human values), then the AI will consider people as obstacles to be eliminated. Since the AI is by assumption more powerful than humans, it will succeed. Ergo, humanity is certain to be annihilated!
This conclusion of course rests on a highly unrealistic model of goal-directedness. Real value systems, whether in humans, organisms or societies, cannot be reduced to a single utility function. We routinely pursue different, even conflicting aims depending on context. When hungry, we eat without restriction; when less hungry, we remind ourselves of our desire to lose weight. In some situations, we want to get the best for ourselves, while in other situations we care more about others, such as family or friends, or even about natural beauty. That apparent inconsistency is not a bug, but a feature: organisms, such as human beings, have evolved to be able to adapt to very different circumstances, serendipitously exploiting opportunities or evading dangers that no amount of intelligence could have predicted.
Truly intelligent AI agents must be similarly flexible and opportunistic. Obsession with a single goal, such as paperclip production, dooms them to remain within a very small niche application. The reason LLMs, such as ChatGPT, are successful is precisely because they adapt to different contexts: depending on the prompt you give them, they will produce different answers that, when put together, may seem to imply different, even inconsistent, values.
This flexibility in goal setting includes the “instrumental values”. With the possible exception of cognitive enhancement, none of the above list seems in any way necessary, or even generally effective, to achieve your goals. The people who left the greatest impact on humanity, such as Buddha, Aristotle, or Darwin, did not struggle to acquire a maximum of resources. Neither did they try to eliminate all “obstacles” (such as people opposing them). That is the strategy of petty tyrants with a paranoid streak, such as Stalin, Idi Amin or Saddam Hussein, whom history only remembers as examples not to follow. Some leaders, such as Martin Luther King jr. or Nelson Mandela, were even ready to give up their own life or freedom, thus negating the self-preservation value. Also in biology there are plenty of examples of self-sacrifice, such as bees that will attack invaders to their nest, even though this will cost them their life.
A super intelligent AI would rather learn its strategies from examples such as these than from failures such as Saddam Hussein. The truly most intelligent strategy would be to maximize the diversity of the sources from which you learn, and to seek cooperation and synergy with any agents you encounter, instead of eliminating them as “obstacles” that you cannot control. Thus, a true ASI would welcome and assimilate any human contributions it did not foresee on its own, rather than eliminate humans because they mess up its single-minded planning.
And if this ASI would be really so omniscient and omnipotent that it could neatly wipe humans off the face of the earth without breaking a sweat, as Yudkowsky assumes, then it should also be smart and powerful enough to achieve its goals independently of humans being there or not. That means that it would not bother about what humans would be up to, just as we don’t bother eliminating all ants on the off chance that they would interfere with humanity’s goals.
Finally, Yudkowsky’s suggestion that the ASI may want to consume human bodies in order to extract their “chemical energy” seems more at home in a teenage zombie flick than in a serious academic argument: the energy ready to be harvested throughout the Earth and the solar system is so much larger than the puny energy present in human tissue that no self-respecting ASI would even consider harvesting it. We might as well start picking up, drying and burning individual ants in order to increase our energy supply!
