Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Simplified Mixture of Experts routing each token to a single expert, enabling efficient scaling to trillion parameters while maintaining computational efficiency.
Foundational Models
Author

Imad Dabbura

Published

May 2, 2025

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

#nlp #llm

Back to top