A Deep Dive into the Mixture of Experts Model

Introduction:
The Mixture of Experts model, also known as MoEs, has become a focal point in the field of open AI since the release of Mixtral 8x7B. In this blog post, we will explore the fundamental architecture, training methods, and various considerations required in practical applications of MoEs. Let’s dive in together!

Overview:
MoEs offer several advantages over dense models, including faster pre-training speed and faster inference speed compared to models with an equivalent number of parameters. However, they also have high memory requirements, as all expert models need to be loaded into memory. While there are challenges in fine-tuning, recent research on MoE instruction tuning has shown promising results.

What is the Mixture of Experts (MoE) Model?
Model size plays a crucial role in improving its quality. Training a larger model with fewer steps is often more effective than training a smaller model with more steps, given limited computational resources. The MoE model allows for pre-training at a significantly lower computational cost compared to traditional dense models. This means you can scale up your model or dataset significantly within the same computational budget. Particularly in the pre-training phase, MoE models can achieve the same performance level as their equivalent-sized dense models but in less time.

So, what exactly is MoE? In the context of Transformer models, MoE consists of two main components:

Sparse MoE Layer: This layer replaces the traditional dense feed-forward network (FFN) layer. The MoE layer consists of several “experts” (e.g., 8 experts), each representing an independent neural network. These experts are often FFNs, but they can also be more complex networks or even MoEs themselves, forming a hierarchical MoE structure.
Gate Network or Router: This network determines which tokens are assigned to which expert. For example, in the given illustration, the token “More” is assigned to the second expert, while the token “Parameters” is assigned to the first network. It’s worth noting that a token can be assigned to multiple experts. Efficiently assigning tokens to the appropriate experts is one of the key considerations when using MoE technology. This router consists of a series of learnable parameters and is pre-trained along with the other parts of the model.

The Switch Layer, as shown in the example from the Switch Transformers paper, represents the MoE layer.

Advantages and Challenges:
While MoEs offer advantages such as efficient pre-training and faster inference compared to dense models, they also present some challenges:

Training: MoEs show high computational efficiency during the pre-training phase but can struggle to adapt to new scenarios during fine-tuning, often leading to overfitting.
Inference: Although MoE models may contain a large number of parameters, only a fraction of them are used during inference, resulting in faster inference speed compared to dense models with the same number of parameters. However, this also poses a challenge as all parameters need to be loaded into memory, requiring significant memory resources. For example, for a MoE like Mixtral 8x7B, we need sufficient VRAM to support a dense model with 47B parameters (not 8x7B = 56B) since only the FFN layer is considered independent experts, while other parts of the model share parameters. Additionally, if each token uses only two experts, the inference speed (measured in FLOPs) is equivalent to using a 12B model (instead of 14B), as it achieves a 2x7B matrix multiplication, but some layers are shared (this will be further explained).

MoEs: A Brief History:
The concept of MoEs first appeared in the 1991 paper “Adaptive Mixture of Local Experts.” This idea, similar to ensemble methods, aims to manage a system consisting of different networks, with each network processing a portion of the training samples. Each individual network or “expert” has its strengths in different regions of the input space. The selection of these experts is determined by a gate network, and both the experts and the gate network are trained simultaneously.

Between 2010 and 2015, two different research areas further contributed to the development of MoEs:

Experts as Components: In traditional MoE structures, the system consists of a gate network and multiple experts. MoEs have been applied as a whole model in methods such as Support Vector Machines (SVM) and Gaussian Processes. Researchers like Eigen, Ranzato, and Ilya explored MoEs as part of deeper networks, allowing for a balance between large-scale and efficient models.
Conditional Computation: Traditional networks pass all input data through each layer. During this time, Yoshua Bengio explored a method of dynamically activating or deactivating network components based on input tokens.

These studies paved the way for the exploration of MoEs in the field of Natural Language Processing (NLP). In particular, the work of Shazeer et altranslated by Baoyu.io provides a comprehensive explanation of MoEs and their applications in the AI field. The blog post discusses the advantages of MoEs over dense models, such as faster pre-training speed and inference speed. It also highlights the challenges faced when working with MoEs, including high memory requirements and the need for fine-tuning.

The post delves into the concept of MoEs, which involves replacing the dense feed-forward network (FFN) layer in Transformer models with a sparse MoE layer. This layer consists of multiple experts, each representing an independent neural network. A gate network or router is used to assign tokens to the appropriate experts. Efficient token assignment is a crucial consideration in MoE technology.

While MoEs offer benefits like efficient pre-training and faster inference, they also present challenges during fine-tuning and require significant memory resources. The post provides insights into the historical development of MoEs, starting from the 1991 paper “Adaptive Mixture of Local Experts” and exploring subsequent research on experts as components and conditional computation.

By providing a thorough understanding of the MoE model, the blog post serves as a valuable resource for AI professionals and researchers looking to explore the potential of MoEs in their work.