Llama 2 Fine-tuning

This article’s objective is to deliver examples that allow for an immediate start with Llama 2 fine-tuning tailored for domain adaptation and the process of executing inference on these adjusted models. For user convenience, the showcased examples utilize the models transformed by Hugging Face. This is a high level guide.

Llama 2 represents an emerging technology that, while promising, also brings along inherent risks. The evaluations performed thus far have neither encompassed nor been able to address every potential situation. Our research paper provides a more in-depth look at this, and for model downloading procedures, please refer to the instructions in the Llama 2 repository.

To adapt Llama 2 models for your unique domain-specific needs, engineers incorporated recipes for PEFT, FSDP, and PEFT+FSDP, complemented by several demonstration datasets. For an in-depth overview refer to the LLM Fine-tuning section.

Single & Multi GPU Llama 2 Fine-tuning Approaches

For those eager to venture directly into single or multi GPU fine-tuning, the subsequent examples are designed for a single GPU such as A10, T4, V100, A100, and the likes. It’s essential to adjust the parameters detailed in the examples and the following recipes to achieve the intended outcomes based on the specific model, technique, dataset, and objective.

Points to Remember:

To modify the dataset used in the following commands, incorporate the dataset argument. You can choose from grammar_dataset, alpaca_dataset, or samsum_dataset. Dataset.md offers an elaboration on these datasets, and guidance on integrating custom datasets. When implementing grammar_dataset and alpaca_dataset, it’s crucial to adhere to the recommended guidelines.
By default, the dataset and the LORA setup are aligned with samsum_dataset.
It’s imperative to ensure the accurate model path is reflected in the training configuration.

Single GPU Setup

On machines equipped with multiple GPUs:

export CUDA_VISIBLE_DEVICES=0

Then, initiate:

python llama_finetuning.py --use_peft --peft_method lora --quantization --model_name/path_to_model_directory/7B --output_dir Path/destination_to_store/PEFT/model

For individuals utilizing systems with several GPUs, ensure that only a single one is visible by applying export CUDA_VISIBLE_DEVICES=GPU:id. It’s vital to tweak the configurations in training.py and review other configurations in associated files as necessary.

Multiple GPUs (Single Node Configuration)

It’s paramount to utilize PyTorch Nightlies when working with PEFT+FSDP. A notable point: FSDP currently doesn’t accommodate int8 quantization from bit&bytes.

torchrun --nnodes 1 --nproc_per_node 4 llama_finetuning.py --enable_fsdp --use_peft --peft_method lora --model_name /path_to_model_directory/7B --pure_bf16 --output_dirPath/destination_to_store/PEFT/model

Also, to expedite the Llama 2 fine-tuning process, incorporate the use_fast_kernels option. This activates either the Flash Attention or the Xformer memory-efficient kernels, depending on the utilized hardware.

Exclusively Using FSDP for Finetuning

For individuals aiming for full-parameter fine-tuning devoid of PEFT methodologies:

torchrun --nnodes 1 --nproc_per_node 8 llama_finetuning.py --enable_fsdp --model_name/path_to_model_directory/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --use_fast_kernels

FSDP Fine-tuning on the Llama 2 70B Model

For enthusiasts looking to fine-tune the extensive 70B model, the low_cpu_fsdp mode can be activated as follows. This feature singularly loads the model on rank0, transitioning the model to devices for FSDP setup. This approach can lead to substantial CPU memory savings, especially with larger models.

torchrun --nnodes 1 --nproc_per_node 8 llama_finetuning.py --enable_fsdp --low_cpu_fsdp --pure_bf16 --model_name /path_to_model_directory/70B --batch_size_training 1 --micro_batch_size 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned

Multi GPU Across Multiple Nodes

sbatch multi_node.slurm

Prior to execution, adjust the total nodes and GPUs per node as per the script.

For a deeper understanding of fine-tuning methodologies, visit this guide.