CLIP Skip with the Diffusers Library

Skipping the Top Layers of the CLIP Text Encoder for More Aesthetically Pleasing Images

Y. Natsume
3 min readAug 6, 2023

While looking through diffusion models online on sharing sites such as CivitAI, I noticed that many of the generated images use a parameter of clip skip = 2 which lead immediately piqued my interest.

What in the world is clip skip? Why do people use it? Can it be used with the diffusers library?

In this short article we will explore what clip skip does, and how to implement it with the diffusers package.

CLIP

First of all we need to know what CLIP is. CLIP, which stands for Contrastive Language Image Pretraining is a multi-modal model trained on 400 million (image, text) pairs. During the training process, a text and image encoder are jointly trained to predict which caption goes with which image as shown in the diagram below.

Diagram of training the CLIP text and image encoders. Figure taken from https://arxiv.org/pdf/2103.00020.pdf.

The trained CLIP text encoder can then be used for other downstream tasks, such as encoding text prompts for input into diffusion models!

What Effects does CLIP Skip Have?

The community discovered that skipping the last few layers of the CLIP text encoder resulted in more aesthetically pleasing images. For example, setting clip skip = 2, the top layer is skipped and encoded text features from the second layer from top are used instead. (When clip skip = 1, all layers of the text encoder are used without skipping).

For example, the two images below was generated using the prompts and negative prompts for this image on CivitAI using the k-Main model — the one on the left with clip skip = 1 and the one on the right with clip skip = 2. By skipping the last layer of the CLIP text encoder, the one on the right has a distinctively more east Asian looking face than the one on the left.

Comparison of using clip skip = 1 and clip skip = 2. The image generated using clip skip = 2 has a distinctively more east Asian face. Image created by the author.

Using CLIP Skip with Diffusers

By default, the diffusers stable diffusion pipeline uses all layers of the CLIP text encoder. We have to specify explicitly to the pipeline how many layers of the CLIP text encoder to use (and therefore how many to skip).

Here we present a modification of a solution proposed by Patrick von Platen on GitHub to use clip skip with diffusers:

# Follow the convention that clip_skip = 2 means skipping the last
# layer of the CLIP text encoder.
# clip_skip = 1 will use all layers of the CLIP text encoder.
clip_skip = 2


# Load the CLIP text encoder from the stable diffusion 1.5 pipeline,
# and specify the number of layers to use.
text_encoder = transformers.CLIPTextModel.from_pretrained(
"runwayml/stable-diffusion-v1-5",
subfolder = "text_encoder",
num_hidden_layers = 12 - (clip_skip - 1),
torch_dtype = torch_dtype
)

# Load the stable diffusion pipeline for a custom model
# with the text encoder with CLIP skip.
# This diffusion pipeline with CLIP skipping should generate more
# aesthetically pleasing images!
pipe = diffusers.DiffusionPipeline.from_pretrained(
model_path,
torch_dtype = torch_dtype,
safety_checker = None,
text_encoder = text_encoder
)

By instantiating a new CLIPTextModel, we can specify the number of hidden layers in the CLIP text encoder to use. We follow the convention that clip skip = 2 means skipping the last layer, while clip skip = 1 means using all layers of the text encoder.

The instantiated text encoder can then be inserted into the diffusion pipeline to generate more aesthetically pleasing images!

References

  1. https://openai.com/research/clip
  2. https://arxiv.org/pdf/2103.00020.pdf
  3. https://huggingface.co/docs/transformers/model_doc/clip
  4. https://github.com/huggingface/diffusers/issues/3212

WRITER at MLearning.ai // Control AI Video 🗿/imagine AI 3D Models

--

--

Y. Natsume

Deep Learning | Computer Vision | AI | Stable Diffusion | Physics