Llama 2 Model Card

Meta introduced the Llama 2 series of large language models with sizes varying from 7 billion to 70 billion parameters. Engineers designed these LLMs specifically for dialogue applications. In terms of performance, Llama-2-Chat excels over open-source chat models in many of the benchmarks we assessed. When gauged for helpfulness and safety in human reviews, they match well-known proprietary models such as ChatGPT and PaLM.

Llama 2 models are available in three parameter sizes: 7B, 13B, and 70B, and come in both pretrained and fine-tuned forms. These models solely accept text as input and produce text as output.

The underlying framework for Llama 2 is an auto-regressive language model. Enhanced versions undergo supervised fine-tuning (SFT) and harness reinforcement learning combined with human insights (RLHF) to better resonate with human desires for safety and utility.

Model	Training Data	Params	Content Length	GQA	Tokens	LR
Llama 2	A new mix of publicly available online data	7B	4k	✗	2.0T	3.0 x 10-4
Llama 2	A new mix of publicly available online data	13B	4k	✗	2.0T	3.0 x 10-4
Llama 2	A new mix of publicly available online data	70B	4k	✔	2.0T	1.5 x 10-4

The Llama 2 series of models utilize token counts exclusively from their pretraining data. All models in this series have been trained using a global batch-size encompassing 4M tokens. The 70B iteration employs Grouped-Query Attention (GQA) to enhance inference scalability.

Timeline

Llama 2 underwent training from January to July 2023.

Current Version

This iteration is a static model based on an offline dataset. As Meta further enhance the model’s safety through community insights, subsequent refined versions will be launched.

Licensing

Those interested in a bespoke commercial license can visit: https://ai.meta.com/resources/models-and-libraries/llama-downloads/

Detailed Insights

For an in-depth understanding, refer to the research article titled “Llama-2: Open Foundation and Fine-tuned Chat Models” available here.

Feedback & Queries

Directions on offering feedback or raising queries about the model are available in the model’s README documentation.

Intended Use

Purpose

Llama 2 good for both commercial and academic purposes, specifically targeting English language applications. While tuned models are crafted to function as chat assistants, the pretrained versions are versatile enough for an array of natural language generation tasks.

Limitations

Activities that breach legal and regulatory standards, including international trade laws, are discouraged.
Applications in languages other than English.
Any applications which contradict the terms outlined in the Acceptable Use Policy and Licensing Agreement for Llama 2 are considered inappropriate.

Hardware and Software Information

Training Details

For training the Llama 2 model, we leveraged proprietary training libraries and tapped into the immense computing power of Meta’s Research Super Cluster as well as other production clusters. The stages of fine-tuning, annotating, and evaluating the model were executed on third-party cloud computing platforms.

Environmental Impact

The pretraining phase required an extensive 3.3M GPU hours on A100-80GB hardware, with a Thermal Design Power (TDP) rating between 350-400W. This resulted in a carbon footprint of approximately 539 tCO2eq. In alignment with our commitment to sustainability, Meta offset the entire carbon emissions arising from this process.

Model	Time (GPU hours)	Power Consumption (W)	Carbon Emitted(tCO2eq)
Llama 2 7B	184320	400	31.22
Llama 2 13B	368640	400	62.44
Llama 2 70B	1720320	400	291.42
Total	3311616		539.00

LLama 2 Training Environmental Impact

Training Data

Summary

Llama 2 underwent a pretraining process using 2 trillion tokens derived from publicly accessible sources. The data used for fine-tuning comprises standard public instruction datasets, complemented by over a million freshly annotated human examples. It is essential to note that data from Meta users is absent from both the pretraining and fine-tuning datasets.

Recency of Data

While the pretraining dataset stops at September 2022, the fine-tuning dataset extends to incorporate more recent data, up until July 2023.

Evaluation Results

In this segment, a comparative analysis of Llama 1 and Llama 2 models will be presented based on customary academic benchmarks. All these evaluations will utilize our proprietary internal evaluation toolset.

Model	Size	Code	Commonsense Reasoning	World Knowledge	Reading Comprehension	Math	MMLU	BBH	AGI Eval
Llama 1	7B	14.1	60.8	46.2	58.5	6.95	35.1	30.3	23.9
Llama 1	13B	18.9	66.1	52.6	62.3	10.9	46.9	37.0	33.9
Llama 1	33B	26.0	70.0	58.4	67.6	21.4	57.8	39.8	41.7
Llama 1	65B	30.7	70.7	60.5	68.6	30.8	63.4	43.5	47.6
Llama 2	7B	16.8	63.9	48.9	61.3	14.6	45.3	32.6	29.3
Llama 2	13B	24.5	66.9	55.4	65.8	28.7	54.8	39.4	39.1
Llama 2	70B	37.5	71.9	63.6	69.4	35.2	68.9	51.2	54.2

Ethical Considerations and Limitations

Llama 2, while groundbreaking, is not exempt from the inherent risks associated with new technologies. Our evaluations thus far have been predominantly focused on the English language, and it’s impractical to cover every potential situation. Consequently, just like other LLMs, Llama 2’s output remains somewhat unpredictable. There’s a possibility that the model may sometimes deliver responses that are erroneous, display bias, or are generally objectionable. As a precaution, it’s imperative for developers to conduct comprehensive safety checks and calibration to align Llama 2’s behavior with their specific application requirements before integration.

For a deeper understanding of responsible usage, please refer to the Responsible Use Guide on https://ai.meta.com/llama/responsible-use-guide/.

Download Llama 2 Model Card