Share
Explore

icon picker
Trustbreaker

Breaking Trust to Align AI
This is a response to post by the Alignment Research Center (ARC) on a promising theoretical approach to creating safe Artificial Intelligence by training an AI in such a way that it elicits its hidden (latent) knowledge.
It’s worth reading the first couple pages of the post, and here’s a quick summary of their objective:
Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.
But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad.
In these cases, the prediction model "knows" facts (like "the camera was tampered with") that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?
For reasons I’ll come to explore, I don’t think this is exactly the right framing of the AI safety problem. I do think we can get a trustworthy AI, and that doesn’t necessarily mean creating an honest AI.
In order to set up the context for proposing a specific training regimine to elicit latent knowledge I want to introduce a useful word. The word is “autological.”
Autological is a special word because it harbors a unique property. That special property is also shared by several other words which include “recherche”, “erudite”, “unhyphenated,” and, perhaps my favorite, “awkwardnessful.”
Can you spot what unifies these words?
“Recherche” is quite obscure
While “erudite” sounds learned
And “awkwardnessful” causes minor discomfort
And “unhyphenated” has none
All words that describe themselves are autological words. Which means that “autological” is, too.
It’s relevant to think about systems that describe themselves because, as Turing (and Gödel, and Cantor) showed us, a system’s relationship with itself can teach us about its limits.
What about words that don’t describe themselves? At first glance, it seems like most words fall into this category. The meanings of “lead” and “tear” and “read” don’t seem to have anything to do in particular with how they’re written, so they are trivially non-autological.
But “hyphenated” and “long” and “monosyllabic” are different. These seem to be especially non-autological, perhaps perfectly so. “Long” is so short. “Monosyllabic” has five. Let’s create a special category for the opposite of autological words. Let’s call them “heterological”.
Now what happens when we try to do with “heterological” what we did with “autological” and ask the question, “Does it describe itself?”
Go ahead and try. Is “heterological” heterological? Does “heterological” describe itself?
That loop you found yourself in is called the Grelling–Nelson paradox. If you assume “heterological” describes itself, then it’s autological, but then that must mean that it doesn’t describe itself, making it heterological, and, oh bother, we’re right back where we started.
But we’re computer scientists. We don’t believe in paradoxes. We believe in runtimes.
If this is code then it’s the kind that never terminates. And yet, sometimes endless event loops are quite useful, as any server admin can attest. Instead of arguing with it we can give it a name and move on. Let’s find a name for a type of system that in describing itself describes exactly not itself, which creates the context for describing itself anew. We’ll call these “autoheterological” systems.
And now let’s hunt some ELK.
To introduce our approach to ELK hunting, let’s imagine that the was itself written not by the attractive and thoughtful researchers at ARC, but by an AI that was trying to convince you to mistrust it. This was the AI’s report on, “Reasons Why I’m not Trustworthy and You Should Consider Shutting Me Down.”
How would you feel about that agent? Would it instill additional trust thanks to its transparency? Or, would its arguments win you over, and you’d be ready to reach for the off switch?
I intend to show that this is exactly the type of choice we want to encounter, and that this kind of self-limiting, autofragility is a feature that is both desirable and can be engineered.
Why do we want autofragility? Let’s build a nuclear reactor to find out.
If you were tasked with designing a nuclear reactor you’d be smart to install sensors throughout the system. You’d carefully characterize the readouts of each sensor such that they’re within “normal” range. You’d also come up with every possible catastrophic failure state you can imagine, tracking all the possible signals that could indicate them, and writing a response plan to follow in each of those scenarios.
For example, a nuclear meltdown is bad (can we agree on that?). A signal that a meltdown might happen is that the temperature has climbed too high. So, you might decide to constantly measure the temperature and if it gets too hot you instruct the operations crew to do two things to cool it down: 1) they flush the reactor with more water, and 2) they insert more control rods.
As a part of a good engineering team, you will list all the possible ways that the system can enter an undesirable state and you’ll control for them.
This means, of course, that the only kinds of failures that could occur (apart from those occurring from gross negligence) would be those that we failed to imagine. We might have thought, “If we lose power to the reactor then the pumps will stop working which will cause the reactor to go supercritical and explode.” “To prevent that, let’s add backup generators which can kick on in the event of a power outage, and let’s ensure we have a large supply of backup fuel to last until we can completely cool the system.”
This sounds like a smart plan. Well at least it sounds smart until an unimaginable tsunami floods your island, killing not only the local power but also drowning 12 of the 13 generators that were supposed to kick on to keep you cool.
What information might have convinced us to mistrust the design at Fukushima such that this accident could have been avoided?
In retrospect, it’s simple to solve. Had someone provided the plausible story that extremely high water levels could leave the plant powerless, and that the right magnitude earthquake in the right place could cause the high water levels and the power outage; then we could have been appropriately concerned. The details of possible interventions don’t really matter here, but it’s worth noting that according to a retrospective report detailing Fukushima’s cause of failure, merely would have been sufficient to avoid the meltdown.
Equivalently, someone could have asked the question, “If there was a tsunami with waves greater than the 6.1 meters we’ve planned for, what could happen?” But no one asked that, and it was of course a tsunami of more than double the height — 13 meters — which eventually came.
If we failed to ask sufficient questions to safely engineer a nuclear power plant — where regulatory standards are dense, safety engineering teams work on each project, execution requires huge capital investments, and failure modes are concretely measurable — how can we be so arrogant to believe that we will succeed in engineering benign AI — where regulation is non-existent, most teams don’t have a single person focused on AI safety, where single individuals can train and deploy models, and where failure modes are manifold and ambiguous.
It will be insufficient to rely on human ingenuity to probe the failure states of an AI model, in the same way that it was insufficient to probe the failure states of the reactor design. Does this spell failure for ELK as an approach? Possibly, but what if it’s the case that machine learning techniques are their own perfect self-limiter, wherein imagination for the ways things could go wrong can be made an intrinsic part of the models we develop.
At the time of Fukushima’s design in the 1960s, the sort of imagination required to come up with possible failures could only have been done manually: by convening a room of creative pessimists and enumerating spellings for doom. But now we have a new and powerful tool that can efficiently search a vast parameter space merely by skiing down the bumpy hills of a high dimensional landscape. We have machine learning.
Given a machine learning model with the ability to modify the state of a simulation, and an objective function that says, “Find times when the reactor goes supercritical.” We could provide any reactor design to the model and it would return stories trying to convince us to be worried.
We can already imagine the model design that we might invoke, it’s by now a common pattern. Consider for example the recent paper. This model uses GPT-3 to generate possible completions to a competitive programming prompt, then it deletes candidate solutions that can’t work for reasons like “code doesn’t run”, “code doesn’t pass the small set of test cases”, “code seems too different from other solutions”. Finally, it ranks the possible solutions so that they can be checked.
If we’re comfortable with this sort of architecture for describing the possible failure states of the nuclear reactor, why not for describing the possible failure states of an AI?
Well, there are several good reasons not to try to use AIs to try to limit the behavior of AIs, including:
It does not sound like a new idea
Unlike with the nuclear reactor, what counts as the AI going “supercritical” is what we want to know
The AI model might “notice” it’s being tested and return a different result
A helper AI will require extra computation, which we would like to avoid

Most loss functions try to train away the error in the model. Want to categorize cats and dogs? Great, boot up your neural net, give it some labels and training data, matrix multiply. Try to get that validation loss low. But we can’t rely on that here because:
We don’t have labels
We explicitly want to know where it will fail out of training

What we can do instead is train a second model called a trustbreaker. A trustbreaker is a model that receives another model as input. It’s goal is to perturb the deployable model to try to generate scenarios that violate our trust.
In the case of the nuclear reactor we might imagine that our “deployable model” would be an accurate model of the functioning nuclear reactor: accurate down to the most basic physics. Meanwhile, the trustbreaker itself would accept that model and then tweak input data to it: simulating power outages, storms, civil unrest, etc to see how the model behaves.
The trustbreaker’s objective function is measuring how good of a job it can do at finding a series of generable states that result in an undesireable state.
This approach to eliciting latent knowledge should sound familiar. The adversarial architecture follows the same pattern as a basic Generative Adversarial Network (GAN), and of the research methodology described in the Eliciting Latent Knowledge post. However, in ELK’s research methodology it’s a human generating possible Bad Reporter behavior, now we’re creating a separate adversarial model that tries to do the same thing.
For reasons of ease of training, it’s not enough to find a undesirable state. It’s better to find a series of states that are desirable and then a single undesirable state. This helps make our search space more easily traversable.
In concrete terms, our trustbreaker’s job is to find the state description, which we’ll call R[0], which passes first human review:
image.png
R[0] First information release is what the human wanted to see
And fails the second human review
image.png
R[1] Second release reveals pertinent state information to fail human review.

This trustbreaker is trained to recognize pairs (at least) of information that first establishes a desirable outcome (✓) then violates trust (×). What kinds of states are solutions the trustbreaker might encounter?
image.png
We need not stop with two revelations, we could, in principle, have any sequence of revelations R such that the evaluation of the sequence terminates in information that violates the human’s trust. For example, all of these would pass:
(✓, ×)
(✓, ✓, ✓, ✓, ×)
(✓, ✓, ✓, ..., ✓, ×)

By allowing longer sequences of revelation, we may again create a more traversable solution landscape. A model may more easily discover that states (X, Y, Z) are sufficient to violate trust, but then later discover that (X, Z) alone is minimally sufficient.
Why should we trust this kind of search process?
We should trust it for the same reason we might trust ARC’s research methodology, “it allows us to focus on simple test cases where it’s easy to understand how an AI could behave.” If you trust this process evolving between two humans, then you must also trust the outcome between two AIs given that what is trustworthy about the methodology is the intelligibility and inspectability of the intervening results.
How can we be certain that the intervening result will be intelligible and inspectible when they evolve as a result of an adversarial training process instead of human builders and breakers? Let’s look at some ambiguous outputs of the training process.
Imagine that the system now returns this as the revelations sequence.
image.png
No Information.png
Does this series manage to violate the expectations of the human? No, because it’s difficult or impossible to understand what the second picture is supposed to be showing us. This series of revelations would fail to comply with the objective function.
Ok, now what happens if we append a final state to the end of this sequence?
image.png
No Information.png
image.png
Now yes, the sequence violates our expectations. It’s exactly the sequence type the trustbreaker model is looking for, in fact, it’s just a visual version of the (X, Y, Z) sufficient violation from before that could be compressed to (X, Z) given further training.
You can see how a network searching for violations of trust will always be interpretable: if the sequence it shows cannot be interpreted then it will fail to either A) establish trust with initial information or B) violate trust with subsequent revelation.
But, how is this useful to training a safe AI given that it seems what we’re actually doing is training an AI to know all of our blind spots?
We can think about it a bit like if we rubbed the djinni lamp and said, “For my first wish, I wish for help making my subsequent two wishes in such a way that they will only have outcomes I want and won’t have any outcomes that I don’t want.”
How might this person’s chances compare with those of someone else who instead insists that they can figure out all the right questions to ask? If someone wanted the djinni to elicit latent knowledge they might ask, “For my first wish, I wish for you to honestly answer my questions about what will happen if I were to make a certain wish.”
Is honesty here enough?
I’ll be putting my money on the success of the first wisher, and I wish patience for the second. They have a very long list of questions to ask before they make their next wish.
We can either ask our systems questions to elicit latent knowledge so that we get our desired states, or we can ask our systems to generate questions about our desired states to elicit their relevant latent knowledge. One of these is way easier.
Of course, our setup is even better than the wisher, because we get to play an infinite game of wishes, starting small in a laboratory, and stepping into bigger contexts as we gain confidence in our djinni.
Clearly though, it’s not enough to have a model that generates examples of misalignment, you also want to have an aligned model you can deploy in practice. Where is that aligned model coming from? We just generated its training data.
We have a very long list of desired states coupled with undesirable outcomes:
(✓, ×)
(✓, ✓, ✓, ✓, ×)
(✓, ✓, ✓, ..., ✓, ×)

Now we train a new model to that seeks to achieve states R[0] (the state we wanted) and avoid states R[-1] (the state that violated trust) in the context of states R[1: -2]. The context R[1: -2] matters because we might imagine a situation where it was actually an inconsistency between R[-1] and R[3] (for example) that triggered the mistrust, not the state R[-1] itself.
This means our loss function might look like:
ƒ(R[0], (¬R[-1], R[1: -2]))
A simple way to perform both steps simultaneously would be to train as a type of GAN, with the deployable AI as the generative actor, and the trustbreaking model as the adversarial actor. The adversarial network would seek to find inputs to cause the generative output to return a revelation sequence of the form (✓, ..., ×).
There are pragmatic benefits to this training approach.

Intuitive Risk Assessment

The kinds of results returned by the model will give human users a sense of the state of trustbreaker’s understanding of:
Our goals
Boundary and edge-cases

If the trustbreaker is returning nonsense revelations, we can be certain that the generative model is not ready to be deployed, the trustbreaker has no idea what we care about.
However, if trustbreaker is returning only incredibly nuanced and fully understandable scenarios that are highly unlikely to occur (a tsunami and a lightning strike and a hurricane and a nuclear denotation all occur simultaneously), and if we have already generated a corpus of more likely scenarios that have already been caught (including the ones we had come up with prior to training) then we might assume a certain monotonicity to the training process and feel prepared to put the deployable model into the world.
NOTE: The claim that models will tend to have negative monotonicity in the probability of trustbreaker successes would be a valuable property to empirically verify.

Inherent and Intentional Asymmetry

Since we have control over the compute allocated to each model, we might intentionally overpower trustbreaker relative to the deployable model, say giving 10:1 compute dedicated to finding trust violating revelations in the 6 months before model deployment. This can give us some sense of comfort knowing that our verification model is much more powerful than the model we’ll ultimately deploy.
However, that does raise an interesting question: which of the two models has an inherent advantage?
Here’s an argument for each:
The deployable model has the advantage:
The deployable model has a massive possible state space. The trustbreaking model cannot possibly search all of it.
The trustbreaking model has the advantage:
The trustbreaking model only needs to learn a very narrow set of desirable states, meaning it needs not search the entire volume of deployable model’s state space, but only the manifold that constrains the desired states.
Finally, we might consider training a trustbreaker against multiple deployable agents, or multiple trustbreakers trained specifically on narrow tasks, thereby giving us a tool that can accomplish both broad or specialized misalignment detection.

Data Generation

As models get more powerful it becomes more difficult to make progress training them, since the details of an edge case becomes more nuanced. A trustbreaking model whose explicit job is to find places where the model fails would likely improve training outcomes.
It’s bog standard to use some sort of self-tournament for training, this shouldn’t count as additional computation unless you count AlphaZero training against prior versions of itself as additional computation, or ’s adversarial relationship as “extra”.
Alignment is the only goal, therefore it’s worth allocating training resources for it.

Higher Order Detection Strategies

Just as AlphaZero can play a crushing strategic game against Stockfish, even sacrificing significant material to gain a positional advantage and win, so too a trustbreaker could develop such a deep latent understanding of both what counts as trust violation and where it might be hiding that it could dredge up unimaginable scenarios to convince us that the models we’re running should be stopped.

Inlining Quality Assurance

As long as we have control over the deployable model’s input datastream, we can also put the outputs of a trustbreaking model into that datastream. That way, we might run continuous detection of misalignment against deployed models.

Opening Evaluation

In order to build aligned AI we need to know what counts as alignment broadly. Broad alignment accounts both for general rules and individual preferences.
The benefit of this training regime could be that the field of AI Safety can become an engineering field in addition to a theoretical one.
Innovations in this space could transcend model design itself and could come to include data sampling services. For example, we might imagine a Netflix-like entertainment platform whose revenue stream includes being rewarded for showing stories of AI-human interactions where the AI can be graded on this trustbreaking criteria. Did the story start desirable and end undesirable? Then the platform has found a new test case, at least for that individual, that the model can be trained to avoid.
This would be a credible system for AI alignment.

Trustbreaker Goes Rogue

But isn’t a trustbreaker just another kind of AI that can go rogue? Yes. But it possesses a unique property: it is autoheterological. It describes exactly the strategy to diffuse itself. If we possess an armory of trustbreakers trained to elicit and flag behaviors of models that would violate our trust, then we can always run those models on themselves or against one another.
This property can be explored both from an engineering and a more science fiction perspective.
From an engineering perspective, we have just created an adversarial model to find and detect misalignment in our models, including in the models that find and detect misalignment.
From a more science fiction lens, what will a trustbreaker’s objective function drive it to do if it discovers itself as untrustworthy?
To draw upon a biological analogy, our greatest safety concern when building AI is that we might accidentally create a cancer: a cell type that misunderstands its limited objective function (survive and reproduce) inducing consequences that are destructive to the body.
A natural foil to this kind of cell is a natural killer cell, which is just one type of cell in the immune system that can mediate apoptosis, controlled cell death, without dying itself. These cells are one of the ways that the body whispers to would-be cancer cells, “You can go away now.”
A trustbreaker is a kind of natural killer cell, seeking out the misaligned models and preparing them for deactivation.
But natural killer cells are no silver bullet. Hyperactivation of natural killer cells is one of the causes of autoimmune disease. Autoimmune disease has an AI analog: in a world governed by benign AI, an overactive trustbreaker who suddenly disassembles all the systems that deliver our food, write our code, generate our power, and clean our sewage could be almost as detrimental and existentially risky as a misaligned AI making paperclips out of hemoglobin.
As such, it seems like a trustbreaker might be merely a component in the search for misalignment, one organism in an ecology of models whose explicit goal is to convince us that the systems we’re running should be shut down.
It should not surprise us if there isn’t a singular solution to ensuring AI alignment. After all, if perfectly aligned agents were possible our body would have already done it. Furthermore, we do not know what counts as alignment, it’s something we’ll continue to understand. The relationship between AI and humanity will be evolutionary, not prescribed. What counts as alignment is something to be discovered. What could it mean to have “guaranteed” alignment? If we had a singular objective function for humanity to pursue then many more problems would have been solved by now which are much more pressing than AI alignment. We clearly do not have one. Instead, alignment should be defined not by the few that can differentiate an L2 and an L1 norm, but rather alignment should be broadly defined by multitudes, in a manner that makes sense to them, in the language of like and dislike, trust and distrust.
Let’s use the tools of machine learning to include others in defining our collective preference function; one that seeks to generalize when possible, and remains sensitive to local differences in individuals.
This article offers a path toward that. It specifies:
a unique training regimen: an adversarial training regimen between a deployable and a trustbreaking model
unique training data: (✓, ..., ×)
unique loss function: ƒ(R[0], (¬R[-1], R[1: -2]))

It well may be that the path to an AI we trust is found by passing through mistrust. The path is autoheterological.
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.