Llama-4

Llama-4

Ahmad Al-Dahle shared a glimpse into Meta’s massive AI project—training Llama 4 on a cluster with over 100,000 H100 GPUs! This scale is pushing AI boundaries and advancing both product capabilities and open-source contributions.

https://twitter.com/Ahmad_Al_Dahle/status/1851822285377933809

Great to visit one of our data centers where we’re training Llama 4 models on a cluster bigger than 100K H100’s! So proud of the incredible work we’re doing to advance our products, the AI field and the open source community. We’re hiring top researchers to work on reasoning, code generation, and more – come join us!

Meta is hiring top researchers in areas like reasoning, code generation, and more.

Meta’s Llama-4 will give enterprises full customization for the best cost/performance, ownership over their own data, strategic independence and stability with no hidden changes in production. It will be the primary choice over the long-term and Meta just getting started.

Llama on device and small models

Small Llama Models

On device and small models are a really important part of the Llama herd so Meta AI Team are introducing quantized versions with significantly increased speed. These models have a 2-3x increased speedup – that is fast!

Lightweight Llama-4 models comming soon

Ahmad mentioned: We utilize Quantization-Aware Training (QAT) to emulate quantization effects during the training of Llama 4 models, enabling their optimization for low-precision deployment. To initiate QAT, we start with BF16 Llama-4 AI model checkpoints obtained after supervised fine-tuning (SFT) and conduct an additional full SFT training cycle with QAT enabled. Next, we freeze the QAT model’s backbone and apply another SFT round using low-rank adaptation (LoRA) adapters on all transformer block layers. During this phase, the weights and activations of the LoRA adapters are kept in BF16 format. Since our approach mirrors the principles of QLoRA (quantization followed by LoRA adapters), we refer to it as QLoRA throughout this post.

Finally, we fine-tune the entire model (both backbone and LoRA adapters) with direct preference optimization (DPO). This results in a highly efficient model with accuracy close to the BF16 model, while offering speed and memory efficiency similar to other quantization methods (see figure below).

We employed torchao APIs to implement QAT. Developers can leverage QAT as a foundational model and further fine-tune Llama with LoRA for specialized applications, reducing both time and computational resources.

Read more about Llama Model Family in our Blog:


Tags: