Mixture of Experts - Part 2
The announcement of Mixtral 8x7B and DeepMind's Mixture-of-Depths, among others, has once again made the Mixture of Experts (MoE) architecture for transformers a popular choice in the NLP community. Continuing our previous blog on fine-tuning, we discuss the popular scaling of LLMs via sparse experts, referred to as