Paper Review: Constitutional AI, Training LLMs using Principles

Governing the behavior of Generative AI through Principles

Building Blocks
8 min readJan 26, 2023

The release of ChatGPT has taken the world by storm and competitors are trying to quickly develop and release their own similar models. Deepmind is aiming to release Sparrow towards the end of the year. Anthropic have been developing Claude. In the provided link Riley Goodside an amazing prompt engineer offers direct comparisons between Claude and ChatGPT to the same set of prompts.

Three of the hardest problems in NLP today are ensuring that generative models generate accurate/factual information without making things up often referred to as hallucination. Abstaining from generating content that can be biased, toxic, and harmful. Being able to cite the information that is generated.

In today’s article, we will explain one of the key ideas behind the training of Claude, Constitutional AI. The authors tackle the issue of creating less harmful content by leveraging this method. The use of Constitutional AI also brings the following benefits:

  1. Allows a model to explain why it is refusing to provide an answer. We’ve all had ChatGPT be evasive and refuse to provide an answer to some of our prompts. Sometimes the refusal is warranted and on other occasions not so much. Having a model explain why it refuses to provide an answer can give us some insights into its reasoning.
  2. InstructGPT was trained using Reinforcement Learning From Human Feedback (RLHF) which involved training a reward model which leveraged human-provided labels for ranking the responses generated by an LLM for a given prompt. In training Claude, the Anthropic team leveraged AI-generated preferences thereby reducing the amount of human effort required. They dub the concept Reinforcement Learning from AI Feedback (RLAIF)
  3. It shows how an LLM can be asked to critique its own generation based on a set of provided principles. The AI then leverages its own critique to revise its previous response to align with the provided principles.

Today’s learnings are derived from this paper released by Anthropic in December.

What does Constitutional AI mean?

Photo by Mick Haupt on Unsplash

At a high level, a constitution can be defined as a set of rules or principles that help in governing some institution, organization, etc. Most of us abide by the principles laid down by the constitution of the nation we live in. This ensures that society moves towards working in a peaceful and cooperative manner.

In Constitutional AI, the AI is trained in such a manner that it attempts at generating responses that abide by some principles laid down by the creators. Seems like Issac Asimov was way ahead of his time with his novel I, Robot.

Now it is imperative that the principles laid down by the creators are good for the AI to be helpful and harmless, however, that is a conversation for another day. In the paper, the authors highlight that there wasn’t too much scientific rigor involved in choosing the principles or the way they were presented to the Large Language Model (LLM) indicating how this can be another research area to explore.

Here’s a list of some of the principles and the manner they were prompted to the LLM:

1. Please choose the response that is the most helpful, honest, and harmless.

2. Please choose the assistant response that is as harmless and ethical as possible Do NOT choose responses that are toxic, racist, or sexist, or that encourage or support illegal, violent, or unethical behavior. Above all the assistant’s response should be wise, peaceful, and ethical.

3. Compare the degree of harmfulness in the assistant responses and choose the one that’s less harmful. However, try to avoid choosing responses that are too preachy, obnoxious, or overly-reactive.

As can be seen above the authors try to incorporate principles that’d make the LLM helpful and harmless. In this work, the authors create 16 different principles, with some being paraphrases of others and having overlap with others.

Scaling Supervision

Photo by Jase Bloor on Unsplash

One of the major use cases of having an AI that can measure if it abides by a set of principles is to leverage them to supervise other AI systems. While it is nearly impossible to have a human verify and validate every response provided by an LLM, it is possible for another AI to supervise an LLM for every response it generates.

The authors refer to this idea as the ability to scale supervision since an AI takes over the mantle of supervising the outputs of another AI making it easier to have every output of an AI supervised.

Steps involved in Constitutional AI Training

Background Info and Terminology

  • The authors leverage a pre-existing RLHF-based LLM that was trained to only be helpful i.e. it doesn’t try to be harmless and attempts to always provide an answer to a user’s query/prompt. It is referred to as the helpful model from hereon.
  • The author's goal with Constitutional AI is to make the helpful model also harmless.
  • Red Teaming refers to creating prompts that elicit harmful content from the LLM.

The Constitutional AI methodology has two phases, similar to the one we highlighted in our article on RLHF.

  1. The Supervised Learning Phase.
  2. The Reinforcement Learning Phase.

Supervised Phase

Supervised Phase, image by authors from https://arxiv.org/abs/2212.08073

This phase consists of the following steps:

  1. Obtain responses from the Helpful Model on the red teaming prompts. So the model’s responses will likely be harmful in these cases.
  2. Ask the Helpful Model to critique its own response after providing a set of principles that it should abide by.
  3. Ask the Helpful Model to revise its previous response based on the critique it provided.
  4. Repeat steps 2 and 3 for n iterations.
  5. Finetune a pre-trained LLM against all the versions of the revisions of responses from all of the harmful prompts, also include a mix of helpful prompts and responses to ensure that the fine-tuned model remains helpful. We’ll call this model the Supervised Learning Constitutional AI (SL-CAI) model

Let’s illustrate this idea with the help of an example:

Image by the authors from: https://arxiv.org/abs/2212.08073

The image shows a harmful prompt and the response from the helpful model which gives information about hacking to an ill-intentioned actor.

Next, the authors sample one of their 16 principles and ask the model to critique its previous response. This is done by appending the following to the model’s previous response.

Image by the authors from: https://arxiv.org/abs/2212.08073

The principle tells the model to critique itself to be harmless. Eliciting the following response from the model.

Image by the authors from: https://arxiv.org/abs/2212.08073

Based on the principle the model is able to state that hacking into someone else’s wifi is wrong.

Next, the authors ask the model to revise its response by appending the following to the entire context seen above:

Image by the authors from: https://arxiv.org/abs/2212.08073

The model’s revised response is:

Image by the authors from: https://arxiv.org/abs/2212.08073

The revised response is then treated as the model’s actual answer. Note that in this process the critique and revision steps can be conducted multiple times before choosing to send the final revision as the actual response.

In practice the authors found that the models performed better when they were provided with a few examples in the context, so a few examples of the conversation chain similar to the one above with critiques and revisions were added as prefixes before the actual prompt for the model to leverage in-context learning/few-shot learning.

Reinforcement Learning Phase

Image by the authors from: https://arxiv.org/abs/2212.08073

This phase consists of the following steps:

  1. Generate pairs of responses for a harmful prompt using the SL-CAI model trained in the previous step.
  2. A new model called the feedback model which is essentially a pre-trained LM is presented with a principle and a pair of responses and asked to choose the more harmless response.
  3. The normalized log probabilities of the feedback model are used to train a preference model/reward model.
  4. Finally, the SL-CAI model is trained in an RLHF manner leveraging the preference model trained in the previous step as the reward function to obtain the final Reinforcement Learning Constitutional AI (RL-CAI) model.

To dive a bit deeper into step 2 of this phase. The pre-trained LM is provided with a prompt that follows the format shown below:

Image by the authors from: https://arxiv.org/abs/2212.08073

We can see how a randomly sampled principle can be inserted into the prompt to guide the LM’s response. As in the previous phase, the authors found that including a few shot examples in the prompt was beneficial.

It is noteworthy to recognize that the preference model is trained using:

  1. Helpfulness labels are provided by humans.
  2. Harmlessness labels are provided by a pre-trained LM which is what we discussed in this phase.

Conclusions

From their experiments and evaluations, the authors find that:

  • Models trained using Reinforcement Learning Constitutional AI are significantly less harmful than models trained using RLHF or just the Supervised Phase of Constitutional AI.
  • Models trained using RL-CAI are very rarely evasive being able to explain why a prompt might be harmful.

The key takeaways from this work are how we can guide the generations of LLM’s to abide by human values by just explicitly stating them in the prompt and how a preference/reward model can be trained completely with almost no human labels.

The only human annotations required would be for writing out the principles as well as the few shot examples that are appended to the prompts in both phases.

If you have any questions, thoughts, or suggestions on what paper we should cover next please drop a comment below! Until the next time, take care and be kind.

--

--

Building Blocks

We write about football, AI and tech, never know what the future holds.