Theory, Analysis, and Best Practices for Sigmoid Self-Attention
Machine Learning Research at Apple
FEBRUARY 9, 2025
*Primary Contributors Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between keys and queries. Recent work has explored alternatives to softmax attention in transformers, such as ReLU and sigmoid activations.
Let's personalize your content