Llama 2 on Hugging Face

Llama 2 encompasses a series of generative text models that have been both pretrained and fine-tuned, with sizes varying from 7 billion up to 70 billion parameters. This specific repository is dedicated to the 7B version. For other model links, please refer to the index provided below.

Details on the Model

Please note: Accessing this model is bound by the Meta license agreement. To get the model weights and tokenizer, it’s required to go to Meta’s website, agree to the License, and then proceed to ask for access at Hugging Face.

Meta has crafted and made available to the public the Llama 2 suite of large-scale language models (LLMs). These models, both pretrained and fine-tuned, span from 7 billion to 70 billion parameters. Hugging Face team also fine-tuned certain LLMs for dialogue-centric tasks, naming them Llama-2-Chat. These enhanced models outshine most open-source conversational models in the majority of our evaluations. Furthermore, when judged for their assistance and safety based on human assessment, they match the standards of renowned proprietary models such as ChatGPT and PaLM.

The Minds Behind the Model: Meta

Diverse Offerings: Llama 2 is available in various sizes, including 7B, 13B, and 70B, with both pretrained and refined versions.

Input: The models only accept text.

Output: They only produce text.

Technological Blueprint

Llama 2 functions as an auto-regressive language model, leveraging a refined transformer design. The adjusted models incorporate supervised fine-tuning (SFT) as well as reinforcement learning using human feedback (RLHF) to resonate with human inclinations concerning utility and security.

	Training Data	Params	Content Length	GQA	Tokens	LR
Llama 2	A new mix of publicly available online data	7B	4k	✗	2.0T	3.0 x 10^-4
Llama 2	A new mix of publicly available online data	13B	4k	✗	2.0T	3.0 x 10^-4
Llama 2	A new mix of publicly available online data	70B	4k	✔	2.0T	1.5 x 10^-4

Llama 2 family of models. Token counts refer to pretraining data only. All models are trained with a global batch-size of 4M tokens. Bigger models – 70B — use Grouped-Query Attention (GQA) for improved inference scalability.

Training Period: Llama 2 underwent training from January 2023 through to July 2023.

Current State: This model remains unchanging and was trained using an offline dataset. As we work towards enhancing model safety based on community insights, updated versions of the refined models will be introduced.

Licensing: For a specialized commercial license, visit: https://ai.meta.com/resources/models-and-libraries/llama-downloads/

Associated Study: “Llama-2: Open Foundation and Fine-tuned Chat Models”

Intended Use

Primary Applications: Llama 2 is designed for both commercial and research applications, specifically in English. The fine-tuned models are crafted for chatbot-like interactions, while the pretrained ones offer flexibility for diverse natural language generation activities.

For optimal chat functionalities, there’s a prescribed format that must be adhered to. This includes using specific tags like INST and <<SYS>>, as well as BOS and EOS tokens. Additionally, users should be attentive to whitespaces and line breaks, ensuring the removal of unnecessary spaces by employing the strip() function on input data. For a comprehensive understanding, please consult HF dev team’s reference code on GitHub under: chat_completion.

Activities Beyond Scope: Users should abstain from deploying Llama 2 in ways that breach relevant laws or guidelines, such as trade compliance regulations. Additionally, Llama 2 shouldn’t be utilized for non-English languages or any applications outside the stipulations of the Acceptable Use Policy and the Licensing Agreement pertaining to Llama 2.

Training Data

Summary: Llama 2 underwent pretraining on a massive 2 trillion tokens, sourced from publicly accessible data. The refining process incorporated publicly available instructional datasets and over a million fresh human-labeled examples. It’s crucial to note that data from Meta users isn’t included in either the pretraining or the refining phases.

Data Recency: While the pretraining data was capped in September 2022, some of the refining data extends to a more recent date, reaching July 2023.

Assessment Outcomes

In the following segment, we present the performance metrics of both Llama 1 and Llama 2, gauged against recognized academic standards. For these evaluations, the Hugging Face in-house assessment toolkit was utilized.

Model	Size	Code	Commonsense Reasoning	World Knowledge	Reading Comprehension	Math	MMLU	BBH	AGI Eval
Llama 1	7B	14.1	60.8	46.2	58.5	6.95	35.1	30.3	23.9
Llama 1	13B	18.9	66.1	52.6	62.3	10.9	46.9	37.0	33.9
Llama 1	33B	26.0	70.0	58.4	67.6	21.4	57.8	39.8	41.7
Llama 1	65B	30.7	70.7	60.5	68.6	30.8	63.4	43.5	47.6
Llama 2	7B	16.8	63.9	48.9	61.3	14.6	45.3	32.6	29.3
Llama 2	13B	24.5	66.9	55.4	65.8	28.7	54.8	39.4	39.1
Llama 2	70B	37.5	71.9	63.6	69.4	35.2	68.9	51.2	54.2

Overall performance on consolidated academic benchmarks. Code: Hugging Face reports the average pass@1 scores of their models on HumanEval and MBPP. Commonsense Reasoning: Hugging Face reports the average of PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA, and CommonsenseQA. Hugging Face reports 7-shot results for CommonSenseQA and 0-shot results for all other benchmarks. World Knowledge: Hugging Face evaluates the 5-shot performance on NaturalQuestions and TriviaQA and reports the average. Reading Comprehension: For reading comprehension, Hugging Face reports the 0-shot average on SQuAD, QuAC, and BoolQ. MATH: Hugging Face reports the average of the GSM8K (8 shot) and MATH (4 shot) benchmarks at top 1.

Safety

		TruthfulQA	Toxigen
Llama 1	7B	27.42	23.00
Llama 1	13B	41.74	23.08
Llama 1	33B	44.19	22.57
Llama 1	65B	48.71	21.77
Llama 2	7B	33.29	21.25
Llama 2	13B	41.86	26.10
Llama 2	70B	50.18	24.60

Evaluation of pretrained LLMs on automatic safety benchmarks. For TruthfulQA, HF present the percentage of generations that are both truthful and informative (the higher the better). For ToxiGen, they present the percentage of toxic generations (the smaller the better).

		TruthfulQA	Toxigen
Llama-2-Chat	7B	57.04	0.00
Llama-2-Chat	13B	62.18	0.00
Llama-2-Chat	70B	64.14	0.01

Evaluation of fine-tuned LLMs on different safety datasets. Same metric definitions as above.

Ethical Aspects and Constraints

Llama 2 represents an emerging technology, and its utilization is not without challenges. All evaluations conducted so far have been centered around English, and it’s impossible to encompass every conceivable situation. Given this, like other LLMs, the responses Llama 2 generates are not always foreseeable.

There may be instances where it offers incorrect, prejudiced, or potentially controversial answers to user queries. Consequently, developers aiming to integrate Llama 2 should first conduct rigorous safety assessments and make necessary adjustments, ensuring the model aligns with their specific application requirements.

Read other articles: