LLM Optimization

    [LLM-in-a-Flash] Efficient LLM Inference with Limited MemoryπŸ”—

    Arxiv: https://arxiv.org/abs/2312.11514 12 Dec 2023 Apple

    First, "windowing'" strategically reduces data transfer by reusing previously activated neurons, and second, "row-column bundling", tailored to the sequential data access strengths of flash memory, increases the size of data chunks read from flash memory.

    [RoPE] RoFormer: Enhanced Transformer with Rotary Position EmbeddingπŸ”—

    Arxiv: https://arxiv.org/abs/2104.09864 20 Apr 2021 Zhuiyi Technology Co.

    We investigated the existing approaches to the relative position encoding and found that they are mostly built based on the idea of the decomposition of adding position encoding to the context representations. We introduce a novel method, namely Rotary Position Embedding(RoPE), to leverage the positional information into the learning process of PLMS. The key idea is to encode relative position by multiplying the context representations with a rotation matrix with a clear theoretical interpretation.

    [LORA] LOw-RAnk Adaptation of LLMπŸ”—

    Arxiv: https://arxiv.org/abs/2106.09685 17 Jun 2021 OpenAI

    Many applications in natural language processing rely on adapting one large-scale, pre-trained language model to multiple downstream applications. Such adaptation is usually done via fine-tuning,

    which updates all the parameters of the pre-trained model. The major downside of fine-tuning is that the new model contains as many

    parameters as in the original model.

    Many sought to mitigate this by adapting only some parameters or learning external modules for new tasks. This way, we only need to store and load a small number of task-specific parameters in addition to the pre-trained model for each task, greatly boosting the operational efficiency when deployed. However, existing techniques often introduce inference latency by extending model depth or reducing the model’s usable sequence length. More importantly, these methods often fail to match the fine-tuning baselines, posing a trade-off between efficiency and model quality.

    [Speculative] Fast Inference from Transformers via Speculative DecodingπŸ”—

    Arxiv: https://arxiv.org/abs/2211.17192 30 Nov 2022 Google

    The key observation above, that some inference steps are β€œharder” and some are β€œeasier”, is also a key motivator for our work. We additionally observe that inference from large models is often not bottlenecked on arithmetic operations, but rather on memory bandwidth and communication, so additional computation resources might be available.

    Therefore we suggest increasing concurrency as a complementary approach to using an adaptive amount of computation. Specifically, we are able to accelerate inference without changing the model architectures, without changing the training-procedures or needing to re-train the models, and without changing the model output distribution. This is accomplished via speculative execution.

    Speculative Decoding

    [GQA] Training Generalized Multi-Query Transformer Models from Multi-Head CheckpointsπŸ”—

    Arxiv: https://arxiv.org/abs/2305.13245 22 May 2023 Google

    GQA

    [Multi-Heads Sharing]Fast Transformer Decoding: One Write-Head is All You NeedπŸ”—

    Arxiv: https://arxiv.org/abs/1911.02150 6 Nov 2019 Google

    Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inference (where such parallelization is impossible) is often slow, due to the memory-bandwidth cost of repeatedly loading the large "keys'' and "values" tensors. We propose a variant called multi-query attention, where the keys and values are shared across all of the different attention "heads", greatly reducing the size of these tensors and hence the memory bandwidth requirements of incremental decoding. We verify experimentally that the resulting models can indeed be much faster to decode, and incur only minor quality degradation from the baseline.

    We introduce multi-query Attention as a variation of multi-head attention as described in [Vaswani et al., 2017]. Multi-head attention consists of multiple attention layers (heads) in parallel with different linear transformations on the queries, keys, values and outputs. Multi-query attention is identical except that the different heads share a single set of keys and values.

    [MoE] Outrageously Large Neural Networks: The Sparsely-Gated Mixture-Of-Experts LayerπŸ”—

    Arxiv: https://arxiv.org/abs/1701.06538 23 Jan 2017 Google

    The capacity of a neural network to absorb information is limited by its number of

    parameters. Conditional computation, where parts of the network are active on a

    per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice,

    However, there are significant algorithmic and performance challenges. In this

    work, we address these challenges and finally realize the promise of conditional

    computation, achieving greater than 1000x improvements in model capacity with

    only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to

    thousands of feed-forward sub-networks. A trainable gating network determines

    a sparse combination of these experts to use for each example.

    We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.

    [MoE] Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for LLMπŸ”—

    Arxiv: https://arxiv.org/abs/2305.14705 24 May 2023 Google

    Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to Large Language Models (LLMs) without increasing inference cost.