Run LLama-2 on Groq

Run LLama-2 on Groq

Groq is insanely fast, and we’re excited to feature an official integration with LlamaIndex. The @GroqInc LPU is specially designed for LLM generation and currently supports llama-2 and Mixtral models.

About Groq

Welcome to Groq! 🚀 Here at Groq, we are proud to introduce the world’s inaugural Language Processing Unit™, or LPU. This groundbreaking LPU boasts a deterministic, single-core streaming design that establishes a new benchmark for GenAI inference speeds, ensuring consistent and reproducible results for any task.

Our software is crafted to equip developers like you with the necessary tools to forge ahead in the creation of powerful, innovative AI applications. With Groq as your powerhouse, you are set to:

  • Experience unparalleled low latency and performance for real-time AI and HPC inferences 🔥
  • Precisely predict the performance and computation time for any task 🔮
  • Leverage our advanced technology to gain a competitive edge 💪

Interested in more Groq? Visit our website for additional resources and join our Discord community to engage with our developers!

Setup

If you’re accessing this Notebook via Colab, you might need to install LlamaIndex 🦙.

To install, run the following commands:

%pip install llama-index-llms-groq
!pip install llama-index
from llama_index.llms.groq import Groq

Note that without PyTorch, TensorFlow >= 2.0, or Flax installed, the functionality will be limited to tokenizers, configurations, and file/data utilities as model capabilities will not be accessible.

For API access, create an API key through the Groq console and assign it to the GROQ_API_KEY environment variable:

export GROQ_API_KEY=<your_api_key>

Alternatively, you can directly provide your API key when initializing the LLM:

llm = Groq(model="mixtral-8x7b-32768", api_key="your_api_key")

You can find a list of available LLM models on the provided link.

Here’s how to use the LLM for a query:

response = llm.complete("Explain the importance of low latency LLMs")
print(response)

Low latency in Large Language Models (LLMs) is crucial for applications requiring swift processing and responses to inputs. Latency, the delay between a request and response, is paramount in real-time or sensitive contexts to ensure seamless user experiences and avoid lag.

For instance, conversational agents or chatbots need to offer prompt replies to maintain engagement and satisfaction. Similarly, real-time language translation or speech recognition demands low latency to provide accurate, timely feedback.

Moreover, low latency LLMs can foster new applications needing quick language input processing, like in autonomous vehicles for voice control, keeping drivers focused on driving.

In essence, low latency LLMs enhance user experience, enable real-time processing, and open up possibilities for innovative applications requiring immediate language input handling.

Call chat with a list of messages

from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="What is your name"),
]
resp = llm.chat(messages)
print(resp)

Arr, I be known as Captain Redbeard, the fiercest pirate on the seven seas! But ye can call me Cap’n Redbeard for short. I’m a fearsome pirate with a love for treasure and adventure, and I’m always ready for a good time! Whether I’m swabbin’ the deck or swiggin’ grog, I’m always up for a bit of fun. So hoist the Jolly Roger and let’s set sail for adventure, me hearties!

Streaming

To utilize the stream_complete endpoint for continuous output:

response = llm.stream_complete("Explain the importance of low latency LLMs")
for r in response:
    print(r.delta, end="")

This code snippet demonstrates how to fetch and display responses in a streaming fashion, which is particularly useful for low latency Large Language Models (LLMs). This feature emphasizes the significance of quick response times in various applications, highlighting benefits like real-time functionality, enhanced user experience, more effective decision-making, increased scalability, and competitive advantage across industries.

For streaming chat responses:

from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="What is your name"),
]
resp = llm.stream_chat(messages)
for r in resp:
    print(r.delta, end="")

This approach is used to simulate a dynamic conversation with a character, in this case, a pirate known as Captain Candybeard. It showcases the application of streaming for interactive and engaging user experiences in real-time conversations, bringing characters to life with unique responses that unfold in a natural, conversational manner.

Read related articles: