Book: Alice’s Adventures in a differentiable wonderland

Neural networks surround us, in the form of large language models, speech transcription systems, molecular discovery algorithms, robotics, and much more. Stripped of anything else, neural networks are compositions of differentiable primitives, and studying them means learning how to program and how to interact with these models, a particular example of what is called differentiable programming.

This primer is an introduction to this fascinating field imagined for someone, like Alice, who has just ventured into this strange differentiable wonderland. I overview the basics of optimizing a function via automatic differentiation, and a selection of the most common designs for handling sequences, graphs, texts, and audios. The focus is on a intuitive, self-contained introduction to the most important design techniques, including convolutional, attentional, and recurrent blocks, hoping to bridge the gap between theory and code (PyTorch and JAX) and leaving the reader capable of understanding some of the most advanced models out there, such as large language models (LLMs) and multimodal architectures.

Download the book

There are two versions of the PDF available for feedback and beta reading: a static draft on arXiv, and the most updated version from this website. I also provide an errata list with the differences between the two versions. I thank the many people who are providing comments: I am periodically updating the draft here and on arXiv to edit errors and typos, and adding content when needed.

Table of contents

  1. Chapter 1: Foreword and introduction
  2. Chapter 2: Mathematical preliminaries
  3. Chapter 3: Datasets and losses
  4. Chapter 4: Linear models
  5. Chapter 5: Fully-connected layers
  6. Chapter 6: Automatic differentiation
  7. Chapter 7: Convolutive layers
  8. Chapter 8: Convolutions beyond images
  9. Chapter 9: Scaling up models
  10. Chapter 10: Transformer models
  11. Chapter 11: Transformers in practice
  12. Chapter 12: Graph layers
  13. Chapter 13: Recurrent layers
  14. Appendix A: Probability theory
  15. Appendix B: Universal approximation in 1D

Additional chapters

I will publish here additional chapters on more advanced material that I could not fit into the first volume. Eventually, I hope these will be part of a second volume. More probably, they will languish here forever.

  1. Model re-use (including parameter-efficient fine-tuning and model merging).
  2. Density estimation and generative modelling.
  3. Conditional computation (mixture-of-experts, early exits).
  4. Metric and self-supervised learning.
  5. Debugging and understanding the models.