Breaking Trust to Elicit Latent Knowledge
This is a response to post by the Alignment Research Center (ARC) on a promising theoretical approach to creating safe Artificial Intelligence by training an AI in such a way that it elicits its hidden (latent) knowledge.

Before I offer a proposal to elicit latent knowledge, I want to introduce a word.

The word is “autological.” Autological is a special word because it harbors a unique property. That property is also shared by other words like “recherche”, “erudite”, “unhyphenated,” and, perhaps my favorite, “awkwardnessful.”

See, “recherche” is quite obscure
While “erudite” sounds learned
And “awkwardnessful” causes minor discomfort
And “unhyphenated” has none
Just as all these, the word “autological” describes itself

All words that describe themselves are autological words. Which means that “autological” is, too.

It’s relevant to think about systems that describe themselves because, as Turing (and Gödel, and Cantor) showed us, a system’s relationship with itself can teach us about its limits.

What about words that don’t describe themselves? Well, it seems like most words fall into this category. The meanings of “lead” and “tear” and “read” don’t seem to have anything to do in particular with how they’re written, so they are trivially non-autological.

But “hyphenated” and “long” and “monosyllabic” are different. These seem to be especially non-autological, perhaps perfectly so. “Long” is so short. “Monosyllabic” has five. Let’s create a special category for the opposite of autological words.w Let’s call them “heterological”.

Now what happens when we try to do with “heterological” what we did with “autological” and ask the question, “Does it describe itself?”

Go ahead and try. Is “heterological” heterological? Does “heterological” describe itself?

That loop you found yourself in is called the Grelling–Nelson paradox. If you assume “heterological” describes itself, then it’s autological, but then that must mean that it doesn’t describe itself, making it heterological, and, oh bother, we’re right back where we started.

But we’re computer scientists. We don’t believe in paradoxes. We believe in runtimes.

If this is code then it’s the kind that never terminates. And yet, sometimes endless event loops are quite useful, as any server admin can attest. Instead of arguing with it we can give it a name and move on. Let’s find a name for a type of system that in describing itself describes exactly not itself, which creates the context for describing itself anew. We’ll call these “autoheterological” systems.

And now let’s hunt some ELK.

To introduce our approach to ELK hunting, let’s imagine that the was itself written not by the attractive and thoughtful researchers at ARC, but by an AI that was trying to convince you to mistrust it. This was the AI’s report on, “reasons why I’m not trustworthy and you should consider shutting me down.”

How would you feel about that agent? Would it instill additional trust thanks to its transparency? Or, would its arguments win you over, and you’d be ready to reach for the off switch?

I intend to show that this is exactly the type of choice we want to encounter, and that autofragility is a feature that can be engineered.

Let’s start with why we prefer this outcome.

If we were designing a nuclear reactor we’d install sensors throughout the system. We’d carefully characterize the readouts of each sensor such that they’re within “normal” range. We’d also enumerate every possible catastrophic failure state we can imagine, we’d list the possible signals of those states, and the appropriate responses for us to take in each scenario.

Of course, the only kinds of failures that could then occur (apart from those occurring from gross negligence) would be those that we failed to imagine. We might have thought, “If we lose power to the reactor then the pumps will stop working which will cause the reactor to go supercritical and explode.” “Therefore,” we might conclude, “let’s add backup generators which can kick on in the event of a power outage, and let’s ensure we have an unreasonable backup supply of fuel to last until we can completely cool the system.”

This sounds like a smart plan. Well at least it sounds smart until an unimaginable tsunami floods your island, killing not only the local power but also drowning 12 of the 13 generators that were supposed to kick on to keep you cool. Now .

What information might have convinced us to mistrust the design at Fukushima such that this accident could have been avoided?

In retrospect, it’s simple to solve. Had someone provided the plausible story that extremely high water levels could leave the plant powerless, and that the right magnitude earthquake in the right place could cause the high water levels and the power outage; then we could have been appropriately concerned. The details of possible interventions don’t really matter here, but merely higher off the ground would have been sufficient to avoid meltdown.

Equivalently, someone could have asked the question, “If there was a tsunami with waves greater than the 6.1 meters we’ve planned for, what could happen?” But no one asked that, and it was of course a tsunami of more than double the height — 13 meters — which eventually came.

If we failed to ask sufficient questions to safely engineer a nuclear power plant — where regulatory standards are dense, safety engineering teams work on each project, execution requires huge capital investments, and failure modes are concretely measurable — how can we be so arrogant to believe that we will succeed in engineering benign AI — where regulation is non-existent, most teams don’t have a single person focused on AI safety, where single individuals can train and deploy models, and where failure modes are manifold and ambiguous.

It will be insufficient to rely on human ingenuity to probe the failure states of an AI model, in the same way that it was insufficient to probe the failure states of the reactor design. Does this spell failure for ELK as an approach? Possibly, or it may be that machine learning techniques are their own perfect self-limiter, an autoheterological framing, wherein imagination for the ways things could go wrong can be made an intrinsic part of the models we develop.

At the time of Fukushima’s design in the 1960s, the sort of imagination required to come up with possible failures could only have been done manually. But now we have a generic tool that can efficiently search a vast parameter space merely by skiing down the bumpy hills of a high dimensional landscape. We have gradient descent.

Given a machine learning model with the ability to modify the state of a simulation, and an objective function that says, “Find times when the reactor goes supercritical.” We could provide any reactor design to the model and it would return stories trying to convince us to be worried.

We can already imagine the model design that we might invoke, it’s by now a common pattern. The recent release of uses GPT-3 to generate possible completions to a competitive programming prompt, then it elides solutions that can’t work for reasons like “code doesn’t run”, “code doesn’t pass the small set of test cases”, “code seems real effing different from the other solutions”. Finally, it ranks the possible solutions so that they can be checked.

If we’re comfortable with this architecture for describing the possible failure states of the nuclear reactor, why not for describing the possible failure states of an AI?

Well, there are several good reasons which we’ll explore, including:
It does not sound like a new idea
Unlike with the nuclear reactor, what counts as the AI going “supercritical” is what we want to know
The AI model might “notice” it’s being tested and return a different result
A helper AI will require the “extra computation” which we would like to avoid

Most loss functions try to train away the error in the model. Want to categorize cats and dogs? Great, boot up your neural net, give it some labels and data, train. Try to get that loss nice and low. But we can’t rely on that here because:
We don’t have labels
We explicitly want to know where it will fail out of training

What we can do instead is train a model that perturbs our deployable model (or our nuclear reactor simulation) to try to generate scenarios that violate our trust. This is called a trustbreaker.

This approach to eliciting latent knowledge should sound familiar. The adversarial architecture follows the same pattern as the Research methodology described in the Eliciting Latent Knowledge post. However, rather than a human generating possible Bad Reporter behavior, let’s create a separate adversarial model that tries to find the state description, R[0], which passes first human review:

First information release is what the human wanted to see

And now a second state description, R[1], which fails.

Second release reveals pertinent state information to fail human review.

This trustbreaker is trained to recognize pairs of information that establish a desirable outcome (✓) then violate trust (×). What are the kinds of states are solutions the trustbreaker might return?

We need not stop with two revelations, we could, in principle, have any sequence of revelations R such that the evaluation of the sequence terminates in information that violates the human’s trust. All of these would pass:
(✓, ×)
(✓, ✓, ✓, ✓, ×)
(✓, ✓, ✓, ..., ✓, ×)

The benefit of allowing longer sequences of revelation is that we may help create a more easily traversable solution landscape. A model may more easily discover that states (X, Y, Z) are sufficient to violate trust, but then later discover that (X, Z) alone is minimally sufficient.

Why should we trust this kind of search process? We should trust it for the same reason we might trust ARC’s research methodology, “it allows us to focus on simple test cases where it’s easy to understand how an AI could behave.” If you trust this process evolving between two humans, then you must also trust the outcome between two AIs given that what is trustworthy about the methodology is the intelligibility and inspectability of the intervening results.

How can we be certain that the intervening result will be intelligible and inspectible when they evolve as a result of an adversarial training process instead of human builders and breakers? Let’s look at some ambiguous outputs of the training process.

Imagine that the system now returns this as the revelations sequence.

No Information.png

Does this series manage to violate the expectations of the human? No, because it’s difficult or impossible to understand what the second picture is supposed to be showing us. This series of revelations would fail to comply with the objective function.

Ok, now what happens if we append a final state to the end of this sequence?
No Information.png
Now yes, the sequence violates our expectations. It’s exactly the sequence type the trustbreaker model is looking for, in fact, it’s just a visual version of the (X, Y, Z) sufficient violation from before that could be compressed to (X, Z) given further training.

You can see how a network searching for violations of trust will always be interpretable: if the sequence it shows cannot be interpreted then it will fail to either A) establish trust with initial information B) violate trust with subsequent revelation.

If it seems like we’re cheating by taking your research methodology and making it tractable to be carried out by an AI then... well yeah that’s what we’re doing. There’s a major benefit to approaching it this way.

That said, I can hear your screams, “How is this useful to training a safe AI given that it seems what we’re actually doing is training an AI to know all of our blind spots?”

Well, we can think about it a bit like if we rubbed the djinni lamp and said, “For my first wish, I wish for help making my subsequent two wishes in such a way that they will only have outcomes I want and won’t have any outcomes that I don’t want.”

How might this person’s chances compare with those of someone else who instead insists that they can figure out all the right questions to ask? If someone wanted the djinni to elicit latent knowledge they might ask, “For my first wish, I wish for you to honestly answer my questions about what will happen if I were to make a certain wish.”

Is honesty here enough?

I’ll be putting my money on the success of the first wisher, and I wish patience for the second. They have a very long list of questions to ask before they make their next wish.

We can either ask our systems questions to elicit latent knowledge so that we get our desired states, or we can ask our systems to generate questions about our desired states to elicit their relevant latent knowledge. One of these is way easier.

Of course, our setup is even better than the wisher, because we get to play an infinite game of wishes, starting small in a laboratory, and stepping into bigger contexts as we gain confidence in our djinni.

Clearly though, it’s not enough to have a model that generates examples of misalignment, you also want to have an aligned model you can deploy in practice. Where is that aligned model coming from? We just generated its training data.

We have a very long list of desired states coupled with undesirable outcomes:
(✓, ×)
(✓, ✓, ✓, ✓, ×)
(✓, ✓, ✓, ..., ✓, ×)

Now we train a new model to that seeks to achieve states R[0] (the states we wanted) and avoid states R[-1] (the states that violated trust) in the context of states R[2, -2]. The context R[2, -2] matters because we might imagine a situation where it was actually an inconsistency between R[-1] and R[3] (for example) that triggered the mistrust, not the state R[-1] itself.

This means our loss function might look like:
ƒ(R[0], (¬R[-1], R[2, -2]))

A simple way to perform both steps simultaneously would be to train as a type of GAN, with the deployable AI as the generative actor, and the trustbreaking model as the adversarial actor. The adversarial network would seek to find inputs to cause the generative output to return a revelation sequence of the form (✓, ..., ×).

This approach seems to have a handful of practical benefits.

Intuitive Risk Assessment
The kinds of results returned by the model will give human users a sense of the state of trustbreaker’s understanding of:
Our goals
Boundary and edge-cases

If the trustbreaker is returning nonsense revelations, we can be certain that the generative model is not ready to be deployed, the trustbreaker has no idea what we care about.

However, if trustbreaker is returning only incredibly nuanced and fully understandable scenarios that are highly unlikely to occur (a tsunami and a lightning strike and a hurricane and a nuclear denotation all occur simultaneously), and if we have already generated a corpus of more likely scenarios that have already been caught (including the ones we had come up with prior to training) then we might assume a certain monotonicity to the training process and feel prepared to put the deployable model into the world.

NOTE: The claim that models will tend to have negative monotonicity in the probability of trustbreaker successes would be a valuable property to empirically verify.

Inherent and Intentional Asymmetry
Since we have control over the compute allocated to each model, we might intentionally overpower trustbreaker relative to the deployable model, say giving 10:1 compute dedicated to finding trust violating revelations in the 6 months before model deployment. This can give us some sense of comfort knowing that our verification model is much more powerful than the model we’ll ultimately deploy.

However, that does raise an interesting question: which of the two models has an inherent advantage? This is an interesting empirical answer for which there is no clear intuition. Here’s an argument for each:

The deployable model clearly has the inherent advantage:
The deployable model has a massive possible state space. The trustbreaking model cannot possibly search all of it.

The trustbreaking model clearly has the inherent advantage:
The trustbreaking model only needs to learn a very narrow set of desirable states, meaning it needs not search the entire volume of deployable model’s state space, but only the manifold that constrains the desired states.

Finally, we might consider training a trustbreaker against multiple deployable agents, or multiple trustbreakers trained specifically on narrow tasks, thereby giving us both a broad and specialized set of misalignment detectors.

Data Generation
As models get more powerful it becomes more difficult to make progress training them, since the details of what constitutes an edge case becomes more nuanced. A trustbreaking model whose explicit job is to find places where the model fails would likely improve training outcomes.

It’s bog standard to use some sort of self-tournament for training, this shouldn’t count as additional computation unless you count AlphaZero training against prior versions of itself as additional computation, or ’s adversarial relationship as “extra”.

Alignment is the only goal, allocate training resources for it.

Higher Order Detection Strategies
Just as AlphaZero can play a crushing strategic game against Stockfish, even sacrificing significant material to gain a positional advantage and win, so too a trustbreaker could develop such a deep latent understanding of both what counts as trust violation and where it might be hiding that it could dredge up unimaginable scenarios to convince us that the models we’re running should be stopped.

Inlining Quality Assurance
As long as we have control over the deployable model’s input datastream, we can also put the outputs of a trustbreaking model into that datastream. That way, we might run continuous detection of misalignment against deployed models.

Opening Evaluation
In order to build aligned AI we need to know what counts as alignment broadly. Broad alignment accounts both for general rules and individual preferences.

The benefit of this training regime could be that the field of AI Safety can become an engineering field in addition to a theoretical one.

Innovations in this space could transcend model design itself and could come to include data sampling services. For example, there could come to exist a Netflix-like entertainment platform whose revenue stream includes being rewarded for showing stories of AI-human interactions where the AI can be graded on this trustbreaking criteria. Did the story start desirable and end undesirable? Then the platform has found a new test case, at least for that individual, that the model can be trained against. That would be a credible system for AI alignment.

Trustbreaker Goes Rogue
But isn’t a trustbreaker just another kind of AI that can go rogue? Yes. But it possesses a unique property: it is autoheterological. It describes exactly the strategy to diffuse itself. If we possess an armory of trustbreakers trained to elicit and flag behaviors of models that would violate our trust, then we can always run those models on themselves or against one another.

This property can be explored both from an engineering and a more science fiction perspective. From an engineering perspective, we have just created an adversarial model to find and detect misalignment in our models, including in the models that find and detect misalignment. From a science fiction perspective (one I don’t fully buy into) we’ve just created an AI whose job it is to detect untrustworthy AI systems, if it detects itself as one, what will its objective function drive it to do?

To draw upon a biological analogy, our greatest safety concern when building AI is that we might accidentally create a cancer: a cell type that misunderstands its limited objective function (survive and reproduce) inducing consequences that are destructive to the body.

A natural foil to this kind of cell is a natural killer cell, which is just one type of cell in the immune system that can mediate apoptosis, controlled cell death, without dying itself. These cells are one of the ways that the body whispers to would-be cancer cells, “You can die now.”

A trustbreaker is a kind of natural killer cell, seeking out the misaligned models and preparing them for deactivation.

But natural killer cells are no silver bullet: their overactivation can be the cause of autoimmune disease. Autoimmune disease has an AI analog. In a world governed by benign AI, an overactive trustbreaker who suddenly disassembles all the systems that deliver our food, generate our power, and clean our sewage could be almost as detrimental and existentially risky as a misaligned AI making paperclips out of hemoglobin.

As such, it seems like a trustbreaker might be a component in eliciting latent knowledge from an AI model to search for misalignment, but it may only be one part of an ecology of models whose explicit goal is to convince us that the systems we’re running should be shut down.

It should not surprise us if there isn’t a singular solution to ensuring AI alignment. After all, if perfectly aligned agents were possible, our body would have already done it. Furthermore, we do not know what counts as alignment, it’s something we’ll continue to understand. The relationship between AI and humanity will be evolutionary, not prescribed. What counts as alignment is something to be discovered. What could it mean to have “guaranteed” alignment? If we had a singular objective function for humanity to pursue then many more problems would have been solved by now which are much more pressing than AI alignment. We clearly do not have one. Instead, alignment should be defined not by the few of us that can differentiate an L2 from an L1 norm, but rather alignment should be broadly defined by multitudes, in a manner that makes sense to them, in the language of like and dislike, trust and distrust.

Let’s use the tools of machine learning to include others in defining our collective preference function; one that seeks to generalize when possible, and remains sensitive to local differences in individuals.

This post offers a path toward that. It specifies:
a unique training regimen: an adversarial training regimen between a deployable and a trustbreaking model
unique training data: (✓, ..., ×)
unique loss function: ƒ(R[0], (¬R[-1], R[2, -2]))

The irony is that the path to an AI we trust is found by passing through mistrust. The path is autoheterological.
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
) instead.