The Attention Sink Phenomenon in Transformers

LLM
Author

Imad Dabbura

Published

September 3, 2025

In many large language models (e.g., GPT, LLaMA, etc.):

Example:

When generating a sentence like

“The cat sat on the mat.”

some heads give large attention weights to the BOS token instead of semantically related tokens like “cat” or “mat.”

Few possible explanations:

It has major impacts: