Llama 2 in DirectML

At this year’s Inspire event, the team from Microsoft discussed the capability for developers to operate Llama 2 on Windows, utilizing DirectML and the ONNX Runtime. Their efforts have been focused on turning this into a tangible reality. Progress is evident with the introduction of a sample showcasing Llama 2 7B, accessible at GitHub – Microsoft Olive Repository.

This sample is based on an initial optimization stage using Olive, an advanced tool for optimizing ONNX models. Olive leverages sophisticated graph fusion optimizations from ONNX Runtime and a model architecture fine-tuned for DirectML, enhancing inference speeds by up to tenfold. Post-optimization, Llama 2 7B achieves speeds that enable real-time conversations across various hardware platforms.

Additionally, a user-friendly interface has been developed to demonstrate the optimized model effectively. The team extends their gratitude to the hardware partners whose collaboration was instrumental in this achievement. More details on how Llama 2 is enhanced on partner hardware through DirectML can be found in their documentation.

For those interested in getting started

Requesting Access to Llama 2

Before using the Olive optimization in the sample, it is advised to request access to the Llama 2 weights from Meta.

Driver Recommendations

For optimal performance, updating to the latest drivers is recommended. AMD has introduced optimized graphics drivers for AMD RDNA™ 3 devices, including the AMD Radeon™ RX 7900 Series graphics cards, available from Adrenalin Edition™ 23.11.1 onwards (AMD Support).

Intel has also released drivers for Intel Arc A-Series graphics cards. Download the latest drivers.

NVIDIA users with GeForce RTX 20, 30, and 40 Series GPUs can experience these enhancements with the GeForce Game Ready Driver 546.01.

Read related articles: