This article’s objective is to deliver examples that allow for an immediate start with Llama 2 fine-tuning tailored for domain adaptation and the process of executing inference on these adjusted models. For user convenience, the showcased examples utilize the models transformed by Hugging Face. This is a high level guide.
Llama 2 represents an emerging technology that, while promising, also brings along inherent risks. The evaluations performed thus far have neither encompassed nor been able to address every potential situation. Our research paper provides a more in-depth look at this, and for model downloading procedures, please refer to the instructions in the Llama 2 repository.
To adapt Llama 2 models for your unique domain-specific needs, engineers incorporated recipes for PEFT, FSDP, and PEFT+FSDP, complemented by several demonstration datasets. For an in-depth overview refer to the LLM Fine-tuning section.
Single & Multi GPU Llama 2 Fine-tuning Approaches
For those eager to venture directly into single or multi GPU fine-tuning, the subsequent examples are designed for a single GPU such as A10, T4, V100, A100, and the likes. It’s essential to adjust the parameters detailed in the examples and the following recipes to achieve the intended outcomes based on the specific model, technique, dataset, and objective.
Points to Remember:
- To modify the dataset used in the following commands, incorporate the dataset argument. You can choose from grammar_dataset, alpaca_dataset, or samsum_dataset. Dataset.md offers an elaboration on these datasets, and guidance on integrating custom datasets. When implementing grammar_dataset and alpaca_dataset, it’s crucial to adhere to the recommended guidelines.
- By default, the dataset and the LORA setup are aligned with samsum_dataset.
- It’s imperative to ensure the accurate model path is reflected in the training configuration.
Single GPU Setup
On machines equipped with multiple GPUs:
export CUDA_VISIBLE_DEVICES=0
Then, initiate:
python llama_finetuning.py --use_peft --peft_method lora --quantization --model_name/path_to_model_directory/7B --output_dir Path/destination_to_store/PEFT/model
For individuals utilizing systems with several GPUs, ensure that only a single one is visible by applying export CUDA_VISIBLE_DEVICES=GPU:id
. It’s vital to tweak the configurations in training.py and review other configurations in associated files as necessary.
Multiple GPUs (Single Node Configuration)
It’s paramount to utilize PyTorch Nightlies when working with PEFT+FSDP. A notable point: FSDP currently doesn’t accommodate int8 quantization from bit&bytes.
torchrun --nnodes 1 --nproc_per_node 4 llama_finetuning.py --enable_fsdp --use_peft --peft_method lora --model_name /path_to_model_directory/7B --pure_bf16 --output_dirPath/destination_to_store/PEFT/model
Also, to expedite the Llama 2 fine-tuning process, incorporate the use_fast_kernels
option. This activates either the Flash Attention or the Xformer memory-efficient kernels, depending on the utilized hardware.
Exclusively Using FSDP for Finetuning
For individuals aiming for full-parameter fine-tuning devoid of PEFT methodologies:
torchrun --nnodes 1 --nproc_per_node 8 llama_finetuning.py --enable_fsdp --model_name/path_to_model_directory/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --use_fast_kernels
FSDP Fine-tuning on the Llama 2 70B Model
For enthusiasts looking to fine-tune the extensive 70B model, the low_cpu_fsdp
mode can be activated as follows. This feature singularly loads the model on rank0, transitioning the model to devices for FSDP setup. This approach can lead to substantial CPU memory savings, especially with larger models.
torchrun --nnodes 1 --nproc_per_node 8 llama_finetuning.py --enable_fsdp --low_cpu_fsdp --pure_bf16 --model_name /path_to_model_directory/70B --batch_size_training 1 --micro_batch_size 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned
Multi GPU Across Multiple Nodes
sbatch multi_node.slurm
Prior to execution, adjust the total nodes and GPUs per node as per the script.
For a deeper understanding of fine-tuning methodologies, visit this guide.
Read more related articles: