Run Llama-2

Meta has unveiled Llama 2, a model trained on a massive 2 trillion tokens with a default context length of 4096. The Llama 2-Chat models, specifically fine-tuned using over a million human annotations, are optimized for chat applications. In this article we will explain ho to Run Llama-2 locally using Ollama.

Training for Llama 2 spanned from January 2023 to July 2023.

Dive into Llama 2

In the given example, we showcase the Chat model of Llama 2, which boasts 13b parameters and functions as a chat/instruct model.

API Guide

First, initiate the Ollama server:

ollama serve

To use the model:

curl -X POST http://localhost:11434/api/generate -d '{ "model": "llama2", "prompt":"Why is the sky blue?" }'

Command-Line Interface (CLI)

First, download Ollama.
Launch the terminal and input: ollama run llama2

Tip: If ‘ollama run’ detects that the model hasn’t been downloaded yet, it will initiate ‘ollama pull’. If you wish to only download without activating it, opt for: ollama pull llama2.

Memory Considerations

7b models: Minimum 8GB RAM
13b models: Minimum 16GB RAM
70b models: Minimum 64GB RAM
For challenges with high quantization levels, consider the q4 variant or terminate memory-intensive applications.

Variants of the Model

Ollama presents numerous Llama-2 model versions that are adjusted based on the official models for optimal local performance.

Chat Variant: Tailored for chat or dialogue purposes. This is the Ollama default and is indicated with the -chat tag in the tags section. Usage: ollama run llama2
Pre-trained Variant: This is the basic version without any chat-specific fine-tuning and is marked with the -text tag. Usage: ollama run llama2:text

Ollama typically operates on a 4-bit quantization. To experiment with varying quantization levels, explore other tag options. The digit following ‘q’ signifies the quantization bit count, e.g., q4 denotes 4-bit quantization. Also, a higher number translates to improved model precision, albeit at a cost to speed and augmented memory usage.

Aliases
latest, 7b, 7b-chat, 7b-chat-q4_0, chat
7b-text, 7b-text-q4_0, text
13b, 13b-chat, 13b-chat-q4_0
13b-text, 13b-text-q4_0
70b, 70b-chat, 70b-chat-q4_0
70b-text, 70b-text-q4_0

Model source

Chat source on Ollama

7b parameters source: The Bloke
7b parameters original source: Meta
13b parameters source: The Bloke
13b parameters original source: Meta
70b parameters source: The Bloke
70b parameters original source: Meta

Pre-trained source on Ollama

7b parameters source: The Bloke
7b parameters original source: Meta
13b parameters source: The Bloke
13b parameters original source: Meta
70b parameters source: The Bloke
70b parameters original source: Meta