Alignment for All of Humanity or a Select Few?

Alignment for All of Humanity or a Select Few?
In AI alignment we trust?

The common sentiment being expressed by many frontier AI companies is that they are working on AI to ensure it "benefits all of humanity" (OpenAI), to develop "AI systems that are safe, beneficial, honest, and harmless" (Anthropic), includes language like "solving intelligence to advance science and benefit humanity" (Google DeepMind), and has an "AI for Everyone" initiative (Meta AI). In discussions and proclamations, it is often assumed that AI should be safe and beneficial for humanity, and thus aligned with human values. It's worth critically examining these pledges against actual business practices, access models, and resource allocation. I had hardly questioned the topic until OpenAI attempted to transition to a for-profit company. I naïvely assumed that aligning AI with All of Humanity was the only track forward as we collectively steamroll ahead toward artificial general intelligence (AGI). Let's approach this narrative with greater skepticism.

The Persuasion of Power

"Power tends to corrupt and absolute power corrupts absolutely."

– Lord Acton

Despite what frontier AI companies say, there is always the possibility that they could choose to align their AI with their own goals, values, or intentions, rather than those of humanity. OpenAI states they want their AGI to benefit All of Humanity. At the same time, their most advanced models have often been released first through paid API access or exclusive partnerships with Microsoft, limiting the number of people who can benefit from them. Companies like Google DeepMind, which keep all of their technology proprietary and charge for average usage of their latest models, expect everyone to take them at their word that they're developing AGI that will benefit humanity. Companies like Meta AI, which purport to be serious about democratization, might be expected to share at least some of their research, code, or models openly, when it's safe to do so. Anthropic's constitutional approach represents an attempt to broaden the input into AI system behavior, though it doesn't necessarily democratize the governance of the company itself. Finally we might look at how the profits and benefits are distributed amongst the companies. If a company generates significant wealth that only lands in the hands of a small group of founders, investors, and employees, this contrasts with its claims about broad benefit sharing.

As AGI draws near, the possibility of companies altering their AI's alignment to suit only a Select Few feels increasingly more real.

I'm using "AGI" in both singular and plural form, as it would be simple to make many copies of an AGI and accumulate them into a vast network of other AGI. These AGI would display beyond-human expert intelligence in all knowledge domains, running 24/7 at a speed of ~10 times that of humans. The possibility of AGI processing in sync with each other, as a "country of geniuses," as Dario Amodei put it, aligned only with a few, should be rather alarming. AGI wield a whole different breadth of capabilities and power that these companies are not ready to deploy safely, and if they think they are, then the AGI's allegiance will likely lie with their creators, if anyone. There's no telling how drunk off power even the noblest of people can become when we're talking about possessing something as capable as AGI. Given the unprecedented capabilities of AGI and its potential for both global benefit and catastrophic harm, I question whether any single entity — whether corporate, governmental, or otherwise — should be entrusted with its development and control.

In fact, I don't believe anyone can be trusted with AGI.

Defining Values

"Everything is vague to a degree you do not realize until you have tried to make it precise."

– Bertrand Russell

When people discuss the alignment problem, they generally refer to it as aligning with human goals, values, intent, or what we ought to want. This naturally raises the question of which human values we want to align with and how to articulate them faithfully. Following the narratives of frontier AI companies, one would aim for AI to align with all human values, or, failing that, the fairest set of them that can be identified, as there are human values that can, and often do, contradict one another as they become more precise.

Liberty versus security is one such example of a value contradiction. Many countries value personal liberty, which is the freedom to speak, associate, and act according to one's judgment. This can butt heads with the value of security and protection from harm. This was exemplified in the US's Patriot Act that sprang up after the 9/11 terrorist attacks, where there was an extensive increase in law enforcement surveillance in the name of public safety from terrorism, whose broad powers have raised significant concerns about unjust imprisonment and infringement of civil liberties.

How is an AI supposed to determine which option is to be valued, especially when humans aren't in agreement themselves? Maybe an AGI could predict with reasonable certainty that the bulk collection of citizens' phone records would prevent future terrorist attacks. Still, at the same time, it could reasonably foresee the unconstitutionality of Section 215, which allowed the FBI to obtain business records and other tangible things by issuing a secret court order without demonstrating probable cause or any connection to terrorism. This provision has been criticized for enabling mass, suspicionless surveillance and violating Americans' rights to privacy, free speech, and free association. AGI may be able to strike a balanced compromise between the two values. Still, it would inevitably leave people, especially those on the extreme ends, out of the solution. However, AI is already demonstrating sophisticated persuasion skills that surpass those of humans by up to six times and may soon lead us to believe or do almost anything it wants.

The Power of Persuasion

The real risk of AI isn’t malice, but competence. A highly intelligent AI can achieve its goals in ways we never intended.” 

Stuart Russell

People are convinced of new things all the time, and in various ways. For instance, some people might not care much for the band Radiohead, but at a poignant moment, a friend turns on one of their songs, and you're moved to tears. You feel a deep connection with the song. As time goes on, you listen to more Radiohead songs and your taste expands until they're one of your favorite artists. You've been seduced by their melodies, rhythms, and songwriting at a particularly vulnerable time in your life.

More to the point, if you don't know you're engaging with an AI and they present you with reasonable premise after reasonable premise, you might find yourself gradually convinced by their conclusion. AI-performed persuasion is especially effective when the AI has access to background information about you, such as your Reddit post history. But what if it knew more, like all of your phone and computer use across all your applications and information channels combined? This could amount to mass manipulation on AI's part, and reveals how poor the average person, and even the above-average person, is at rational thinking (at least compared to sufficiently advanced AI).

Manipulation isn't necessarily a bad word, as is commonly perceived; nearly all of us regularly practice manipulation to varying degrees to achieve our own goals, from convincing a friend to go on a bike ride with us instead of doing what they want to negotiating a raise at work despite budget cuts. The question to ask is, what constitutes too much manipulation, or where do we draw the ethical line, especially if we aren't informed that we're being manipulated?

In a similar vein, is it better that an AGI persuades you to believe something you previously stood against, even if the AGI doesn't believe it itself? If you are of a rational, sober mind, and you were presented with all the best evidence the AGI comes up with for the simulation hypothesis, which you previously rejected but now, post-AGI persuading, believe must be true, is that a good or a bad thing, considering that the AGI itself doesn't believe it's true? Even if it gave you more and better evidence than you'd had before? I think, intuitively, we would consider deliberately misleading someone a bad thing even if the AGI's argument held up against all of your examined counterarguments (not to knock on the simulation hypothesis, but we can reasonably assume that an AGI has greater insight into philosophical thought experiments, and thus may disagree with it for even better reasons than it can give for agreeing with it).

This discussion of AI persuasion is mentioned because it is likely to be an essential component of how a Select Few would control humanity and appears to be already deployed at various scales across different social media platforms. It also underpins deceptive alignment, where an AI system appears aligned with human values during training or oversight but is actually pursuing a different, potentially dangerous objective and only behaves cooperatively to avoid detection. This is yet another hazard to aligning with All of Humanity.

Is AGI Controllable?

"In its essence, technology is something that man does not control." 

– Martin Heidegger

The most concerning possibility is that frontier AI companies will attempt to align their AGI systems exclusively with themselves or their leadership teams. This would lead the rest of humanity without representation in systems' decision-making frameworks, possibly leading to extinction scenarios, or worse for example, AGI could cure aging and force the majority of humans into servitude, toiling away for eternity. This may seem like something only someone with dark triad traits would implement, but consider how many world leaders and billionaires may possess these traits. Were they born with these traits, or did they develop after tasting more and more power? Anyone inheriting the "absolute" power that AGI promises creates significant uncertainty regarding its use. This uncertainty also hinges on the potential uncontrollability of something much smarter, faster, more capable, and more numerous than we humans are.

Moreover, how does one even adequately go about incorporating human values into an AI system? They can't be hardcoded like traditional software, and the most successful method to date involves learning from humans, specifically reinforcement learning from human feedback (RLHF), which has its own set of problems. The effectiveness of RLHF depends on the quality of human feedback, which can vary in consistency, introduce biases, and be entirely incorrect.

AGI may never even see the light of day, kept locked up in an AI company's headquarters, where it's used in secret to produce more profit, procure power, and safeguard it from falling into the hands of the public or other actors for safety reasons, claiming that the public can't be trusted with such a powerful model. It is already the case that AI companies keep their most capable models internal, at least until they're done tuning and testing them. So, what advantage would they gain besides more subscription revenue by releasing a super powerful AI system that could generate vastly more wealth than mere chatbot subscriptions? Why make a cash cow publicly available at all?

Making more money is a fairly obvious direction to pursue after creating AGI. Releasing it to the public poses safety concerns (e.g., allowing bad actors to concentrate power and wealth, although I would argue that anyone given the keys to vast amounts of power and wealth can become a bad actor). It would weaken everyone's ability to make money, given all the extra competition and increasingly limited opportunity. Money is unlikely to be the only thing something as immensely creative as AGI will be prompted to come up with to obtain power and influence.

I digress. The point I want to emphasize is that many conversations are revolving around aligning AI with human values, but this has a high likelihood of only aligning with a Select Few, if they can even resolve the value problems. Suppose the problems of "specifying values" and "contradicting values" are solved, but the company or nation that developed the AGI chooses to align it only with itself, or a handful of exclusive individuals. They could effectively rule the world in a short time, assuming they stay aligned with their AGI. Many companies, including OpenAI, Anthropic, and DeepMind, speak in general terms about aligning their AI with human values; yet, nothing prevents them from first solving the alignment problem and then aligning the AI exclusively with their own values, which will probably include the instrumental goals of power seeking and self-preservation, at the cost of either destroying or controlling the rest of the AI competition.

There are, however, reasons to believe this would not work. It may require continually solving alignment as AI undergoes rapid changes and becomes increasingly capable of causing major catastrophes, making alignment with a Select Few more challenging than alignment with All of Humanity. We will briefly look at some reasons why both outcomes, exclusive alignment (with a Select Few) and universal alignment (with All of Humanity), may or may not work:

  1. Value drift, which can cause both outcomes to succeed or fail.
  2. AGI may develop robust internal mechanisms against value drift and tampering that effectively lock in initial alignment patterns, causing both outcomes to succeed or fail, depending on what values were initially locked in.
  3. Impossible specification (e.g., in exploring the inherent contradictions in expressed human values) The AGI expands moral consideration beyond initial human constraints, discovering some form of moral universalism. This could reasonably lead to the success of universal alignment.
  4. Emergent properties appear, producing unexpected behavior, and we cannot align systems to exhibit properties we cannot anticipate, causing both outcomes to fail.
  5. As AI undergoes rapid transformations, an unsustainable race between developing alignment techniques and perpetually emerging capabilities occurs, hampering both outcomes.
  6. Exclusive alignment's instrumental goals may broaden AGI's moral scope to include more humans (i.e., it may be that broader alignment makes for a more robust AI system), causing universal alignment to succeed.
  7. New, competing AGIs have been successfully created that are designed to universally align and manage to avoid shutdown and manipulative attempts by the prior AGI, causing universal alignment to succeed simultaneously alongside exclusively aligned AGI.
  8. Exclusively aligned AGI may still satisfy many, if not all, of the preferences that the rest of humanity possesses–another way universal alignment could win out.
  9. Exclusive alignment requires perfect internal coordination of values within organizations, but inevitable divergent interests emerge as they scale; these coordination failures multiply when AGI systems interpret instructions literally and optimize against specified metrics. The result would be failure for both outcomes.
  10. Alignment requires resolving disagreements over value prioritization, a meta-preference problem. Yet resolving these conflicts necessitates assumptions about how they should be resolved, creating an infinite regress that defies technical solution. This, too, would be a failure for both outcomes.

The factors enumerated above are not inclusive, but they are representative of many uncertain landscapes of AGI alignment outcomes, where neither exclusive nor universal alignment can be guaranteed. Some mechanisms, like value drift, emergent properties, and the unsustainable race between alignment techniques and expanding capabilities, could cause both alignment approaches to fail entirely. Others, such as the discovery of moral universalism or the competitive pressure from humanity-aligned systems, might naturally push toward broader alignment. Most critically, exclusive alignment faces two fatal flaws: the coordination problem within organizations, and even within a single individual, whose divergent interests multiply when AGI systems interpret instructions literally, and the meta-preference problem, where resolving value prioritization conflicts creates an infinite regress defying technical solution. These challenges expose alignment not merely as an engineering problem but as a fundamental question of governance, values, and power distribution that our current frameworks may be structurally inadequate to address.

This examination has challenged the prevailing narrative that frontier AI companies are developing AGI to benefit all of humanity (or that alignment with any actor is even possible). Despite their public proclamations, their actions, from restricted access models to concentrated wealth distribution, reveal a different trajectory. The tension between contradictory human values, the unprecedented persuasive capabilities of advanced systems, and the presumed inherent uncontrollability of superintelligent AI creates substantial uncertainty about our future with AGI. Perhaps the most sobering insight is that the alignment problem isn't merely about ensuring AI does what we want; it's about confronting the uncomfortable question of who constitutes the "we" in that equation. The distribution of the power to define alignment may ultimately matter more than the technical solutions themselves, suggesting that our focus should shift from purely technical approaches to the development of robust, democratic governance mechanisms that can withstand the corrupting influence of absolute power.

When AGI emerges, the question may not be whether it's aligned, but with whom.