LLama Long

LLama Long

Meta built LLama Long on the foundation of OpenLLaMA and refined it using the Focused Transformer (FoT) method. LongLLaMA Code stands upon the base of Code Llama.

Dev team released a more compact 3B base variant (not instruction tuned) of the LongLLaMA model under a lenient license (Apache 2.0) and offered inference code that accommodates longer contexts via Hugging Face. The model weights they provided can replace the LLaMA in previous implementations for contexts up to 2048 tokens. Additionally, they shared evaluation outcomes and made comparisons with the original OpenLLaMA models.


LongLLaMA is a model originating from OpenLLaMA, further refined with the FoT method, incorporating three layers for context expansion. Notably, LongLLaMA possesses the capability to extrapolate far beyond the training context length.

For instance, in tasks like passkey retrieval, it manages input lengths of a specified amount. Meanwhile, LongLLaMA Code emerges from the Code Llama model, also refined using the FoT method. The FoT introduces a straightforward technique, empowering language models to manage context potentially spanning millions of tokens, despite training on much shorter inputs. This method allows select attention layers to access a memory cache containing (key, value) pairs, thereby extending the context length.

LongLLaMA-3BLongLLaMA-3Bv1.1LongLLaMA-Code 7B
Source modelOpenLLaMA-3BOpenLLaMA-3Bv2CodeLLaMA-7b-hf
Source model tokens1T1 T2T + 0.5 T
Fine-tuning tokens10B5B35B
Memory layers6, 12, 186, 12, 188, 16, 24
LLama Long : Llama-2 with 32k Context


pip install --upgrade pip
pip install transformers==4.33.2 sentencepiece accelerate

LLama Long Additional configuration

LongLLaMA has several other parameters:

  • mem_layers: This parameter dictates which layers have memory. It should either be an empty list or a list comprising all the memory layers mentioned in the checkpoint’s description.
  • mem_dtype: This parameter provides the flexibility to modify the memory cache type.
  • mem_attention_grouping: This parameter can strike a balance between processing speed and reduced memory consumption. When set to (4, 2048), the memory layers will handle a maximum of specific queries simultaneously, with a designated number of heads and a certain number of queries for each head.

Drop-in use with LLaMA code

The checkpoints of LongLLaMA can also serve as direct substitutes for LLaMA checkpoints in the Hugging Face’s LLaMA implementation. However, when used in this manner, they will be constrained to the original context length specified.

Read related articles: