Llama 2 on Hugging Face

Llama 2 on Hugging Face

Llama 2 encompasses a series of generative text models that have been both pretrained and fine-tuned, with sizes varying from 7 billion up to 70 billion parameters. This specific repository is dedicated to the 7B version. For other model links, please refer to the index provided below.

Details on the Model

Please note: Accessing this model is bound by the Meta license agreement. To get the model weights and tokenizer, it’s required to go to Meta’s website, agree to the License, and then proceed to ask for access at Hugging Face.

Meta has crafted and made available to the public the Llama 2 suite of large-scale language models (LLMs). These models, both pretrained and fine-tuned, span from 7 billion to 70 billion parameters. Hugging Face team also fine-tuned certain LLMs for dialogue-centric tasks, naming them Llama-2-Chat. These enhanced models outshine most open-source conversational models in the majority of our evaluations. Furthermore, when judged for their assistance and safety based on human assessment, they match the standards of renowned proprietary models such as ChatGPT and PaLM.

The Minds Behind the Model: Meta

Diverse Offerings: Llama 2 is available in various sizes, including 7B, 13B, and 70B, with both pretrained and refined versions.

Input: The models only accept text.

Output: They only produce text.

Technological Blueprint

Llama 2 functions as an auto-regressive language model, leveraging a refined transformer design. The adjusted models incorporate supervised fine-tuning (SFT) as well as reinforcement learning using human feedback (RLHF) to resonate with human inclinations concerning utility and security.

Training DataParamsContent LengthGQATokensLR
Llama 2A new mix of publicly available online data7B4k2.0T3.0 x 10-4
Llama 2A new mix of publicly available online data13B4k2.0T3.0 x 10-4
Llama 2A new mix of publicly available online data70B4k2.0T1.5 x 10-4
Llama 2 family of models. Token counts refer to pretraining data only. All models are trained with a global batch-size of 4M tokens. Bigger models – 70B — use Grouped-Query Attention (GQA) for improved inference scalability.

Training Period: Llama 2 underwent training from January 2023 through to July 2023.

Current State: This model remains unchanging and was trained using an offline dataset. As we work towards enhancing model safety based on community insights, updated versions of the refined models will be introduced.

Licensing: For a specialized commercial license, visit: https://ai.meta.com/resources/models-and-libraries/llama-downloads/

Associated Study: “Llama-2: Open Foundation and Fine-tuned Chat Models”

Intended Use

Primary Applications: Llama 2 is designed for both commercial and research applications, specifically in English. The fine-tuned models are crafted for chatbot-like interactions, while the pretrained ones offer flexibility for diverse natural language generation activities.

For optimal chat functionalities, there’s a prescribed format that must be adhered to. This includes using specific tags like INST and <<SYS>>, as well as BOS and EOS tokens. Additionally, users should be attentive to whitespaces and line breaks, ensuring the removal of unnecessary spaces by employing the strip() function on input data. For a comprehensive understanding, please consult HF dev team’s reference code on GitHub under: chat_completion.

Activities Beyond Scope: Users should abstain from deploying Llama 2 in ways that breach relevant laws or guidelines, such as trade compliance regulations. Additionally, Llama 2 shouldn’t be utilized for non-English languages or any applications outside the stipulations of the Acceptable Use Policy and the Licensing Agreement pertaining to Llama 2.

Training Data

Summary: Llama 2 underwent pretraining on a massive 2 trillion tokens, sourced from publicly accessible data. The refining process incorporated publicly available instructional datasets and over a million fresh human-labeled examples. It’s crucial to note that data from Meta users isn’t included in either the pretraining or the refining phases.

Data Recency: While the pretraining data was capped in September 2022, some of the refining data extends to a more recent date, reaching July 2023.

Assessment Outcomes

In the following segment, we present the performance metrics of both Llama 1 and Llama 2, gauged against recognized academic standards. For these evaluations, the Hugging Face in-house assessment toolkit was utilized.

ModelSizeCodeCommonsense ReasoningWorld KnowledgeReading ComprehensionMathMMLUBBHAGI Eval
Llama 17B14.160.846.258.56.9535.130.323.9
Llama 113B18.966.152.662.310.946.937.033.9
Llama 133B26.070.058.467.621.457.839.841.7
Llama 165B30.770.760.568.630.863.443.547.6
Llama 27B16.863.948.961.314.645.332.629.3
Llama 213B24.566.955.465.828.754.839.439.1
Llama 270B37.571.963.669.435.268.951.254.2
Overall performance on consolidated academic benchmarks. Code: Hugging Face reports the average pass@1 scores of their models on HumanEval and MBPP. Commonsense Reasoning: Hugging Face reports the average of PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA, and CommonsenseQA. Hugging Face reports 7-shot results for CommonSenseQA and 0-shot results for all other benchmarks. World Knowledge: Hugging Face evaluates the 5-shot performance on NaturalQuestions and TriviaQA and reports the average. Reading Comprehension: For reading comprehension, Hugging Face reports the 0-shot average on SQuAD, QuAC, and BoolQ. MATH: Hugging Face reports the average of the GSM8K (8 shot) and MATH (4 shot) benchmarks at top 1.


Llama 17B27.4223.00
Llama 113B41.7423.08
Llama 133B44.1922.57
Llama 165B48.7121.77
Llama 27B33.2921.25
Llama 213B41.8626.10
Llama 270B50.1824.60
Evaluation of pretrained LLMs on automatic safety benchmarks. For TruthfulQA, HF present the percentage of generations that are both truthful and informative (the higher the better). For ToxiGen, they present the percentage of toxic generations (the smaller the better).
Evaluation of fine-tuned LLMs on different safety datasets. Same metric definitions as above.

Ethical Aspects and Constraints

Llama 2 represents an emerging technology, and its utilization is not without challenges. All evaluations conducted so far have been centered around English, and it’s impossible to encompass every conceivable situation. Given this, like other LLMs, the responses Llama 2 generates are not always foreseeable.

There may be instances where it offers incorrect, prejudiced, or potentially controversial answers to user queries. Consequently, developers aiming to integrate Llama 2 should first conduct rigorous safety assessments and make necessary adjustments, ensuring the model aligns with their specific application requirements.

Read other articles: