LLM Models

[Mixtral] Mixtral of Experts🔗

Arxiv: https://arxiv.org/abs/2401.04088 8 Jan 2024 Mixtral.ai

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts).

Mistral Experts

G denotes n dimensionality of the gating network (router), E is the expert network.

Consecutive tokens are often assigned to the same experts. In fact, we observe some degree of positional locality in The Pile datasets. Table 5 shows the proportion of consecutive tokens that get the same expert assignments per domain and layer. Figures are not showing it clearly.

Mistral Decoding

[Gemini] A Family of Highly Capable Multimodal Models🔗

Arxiv: https://arxiv.org/abs/2312.11805 19 Dec 2023 Google

The reasoning capabilities of large language models show promise toward building generalist agents that can tackle more complex multi-step problems.

Gemini Sample

Gemini Architecture