LlamaCloud

LlamaCloud

Access control over data is a big requirement for any enterprise building LLM applications. LlamaCloud makes it easy to set this up. LlamaCloud lets you natively index ACLs through our data connectors – for instance, we directly load in the user/org-level permissions as metadata in Sharepoint

It’s also easy to inject custom metadata through source documents, or programmatically have each user modeled by a separate index within the same project, and have this natively integrated with the downstream vector store.

Waitlist signup Form.

LlamaCloud Client SDK

This tutorial shows you how to use LlamaCloud to build RAG workflows that can support user-level data. For instance, you may want to build a chatbot where each user can upload their own files. Each user should only be able to ask questions over the files they’ve uploaded (and optionally organizational data), but not the files of other users.

We show two approaches to do this:

  1. [Preferred] Create a separate index per user
  2. Create a single index, separate users by metadata.
import nest_asyncio
nest_asyncio.apply()
!pip install llama-index
!pip install llama-cloud

Setup

Here we setup our environment variables, data, and the client SDK.

import os

os.environ["LLAMA_CLOUD_BASE_URL"] = "https://api.cloud.llamaindex.ai"
os.environ["LLAMA_CLOUD_API_KEY"] = "<LLAMA_CLOUD_API_KEY>"
os.environ["OPENAI_API_KEY"] = "<OPENAI_API_KEY>"

Load Data

We download three ArXiv papers and pretend that each paper file corresponds to a separate user upload.

urls = [
    "https://openreview.net/pdf?id=VtmBAGCN7o",
    "https://openreview.net/pdf?id=6PmJoRfdaK",
    "https://openreview.net/pdf?id=hSyW5go0v8",
]

papers = [
    "data/metagpt.pdf",
    "data/longlora.pdf",
    "data/selfrag.pdf",
]
for url, paper in zip(urls, papers):
    !wget "{url}" -O "{paper}"

Define Project Configuration

Specify your project name and project id below.

The pipeline_id and pipeline_name will be programmatically created below.

project_id = "<project_id>"
project_name = "<project_name>"

Setup LlamaCloud Client SDK

Here we define both the client (giving us access to low-level client operations).s

from llama_cloud.client import LlamaCloud

client = LlamaCloud(
    token=os.environ["LLAMA_CLOUD_API_KEY"],
    base_url=os.environ["LLAMA_CLOUD_BASE_URL"]
)

Inserting Documents For Each User

We use the upload_file capability in the SDK to upload files to LlamaCloud.

We show two ways of supporting per-user data in LlamaCloud (note: they are mutually exclusive).

  1. Create a separate index for each user.
  2. Add all files to the same index, separate them by metadata.
# pretend each user corresponds to a separate paper upload
users = ["jerry", "bob", "alice"]
# upload file and add file to pipeline
files = []
for paper in papers:
    with open(paper, 'rb') as f:
        file = client.files.upload_file(upload_file=f, project_id=project_id)
        files.append(file)
import os

# NOTE: specify if you want to explicitly specify a data sink
QDRANT_API_KEY = "<QDRANT_API_KEY>"
QDRANT_URL = "<QDRANT_URL>"

def get_qdrant_sink(collection_name):
    ds = {
        'name': 'qdrant',
        'sink_type': 'QDRANT', 
        'component': CloudQdrantVectorStore(
            collection_name=collection_name,
            url=QDRANT_URL, 
            api_key=QDRANT_API_KEY
        )
    }
    data_sink = client.data_sinks.create_data_sink(request=ds)
    return data_sink
import os
def create_pipeline(index_name, project_id, data_sink = None, transformations=None):
    """Create pipeline."""
    if transformations is None:
        transformations = [
          {
              'configurable_transformation_type': 'SENTENCE_AWARE_NODE_PARSER',
              'component': {
                  'chunk_size': 1024,
                  'chunk_overlap': 20,
              }
          },
          {
              'configurable_transformation_type': 'OPENAI_EMBEDDING',
              'component': {
                  'model_name': 'text-embedding-ada-002',
                  'api_key': os.environ["OPENAI_API_KEY"],
              }
          }
      ]
    pipeline_req = {
      'name': index_name,
      'configured_transformations': transformations,
      'data_sink': data_sink
    }
    pipeline = client.pipelines.upsert_pipeline(project_id=project_id, request=pipeline_req)
    return pipeline

Option 1: Create separate index for each user

We first configure a Qdrant data sink, and then we configure our transformations.

We then create a separate pipeline per user, and store it in an in-memory dict (note: you will likely want to persist the pipeline ids per user).

from llama_cloud.types import CloudQdrantVectorStore

# configure your transformations here 
user_pipeline_dict = {}
for user, paper in zip(users, papers):

    # uncomment if you want to use a managed vector store
    # data_sink = get_qdrant_sink(f"collection_{user}")
    data_sink = None
    pipeline = create_pipeline(f"{user}_index", project_id, data_sink=data_sink)
    user_pipeline_dict[user] = pipeline

Attach file per pipeline

for file, (user, pipeline) in zip(files, user_pipeline_dict.items()):
    print(f"Adding file {file.name} for user {user}")
    pipeline_files = client.pipelines.add_files_to_pipeline(pipeline.id, request=[{'file_id': file.id}])
Adding file metagpt.pdf for user jerry
Adding file longlora.pdf for user bob
Adding file selfrag.pdf for user alice
pipeline_docs = client.pipelines.list_pipeline_documents(user_pipeline_dict["bob"].id)
len(pipeline_docs)
1

Try querying your pipelines

Here we use our framework integration to perform retrieval against our pipelines/indexes.

Since each user corresponds to a separate pipeline, we will need to specify the index before we get a retriever.

from llama_index.indices.managed.llama_cloud import LlamaCloudIndex
from llama_index.llms.openai import OpenAI
import os

# change this user to whatever you prefer
USER = "bob"

llm = OpenAI(model="gpt-4o")
index = LlamaCloudIndex(
  name=user_pipeline_dict[USER].name, 
  project_name=project_name,
  api_key=os.getenv("LLAMA_CLOUD_API_KEY")
)
query_engine = index.as_query_engine(rerank_top_n=2, llm=llm)
response = query_engine.query("Tell me about the abstract of this paper")
print(str(response))
The abstract of the paper presents LongLoRA, an efficient fine-tuning approach designed to extend the context sizes of pre-trained large language models (LLMs) with limited computational cost. It addresses the high computational expense typically associated with training LLMs on long context sizes. The approach speeds up context extension through two main strategies: using sparse local attention during fine-tuning and revisiting parameter-efficient fine-tuning regimes for context expansion. LongLoRA combines these strategies to achieve significant computation savings while maintaining performance. It demonstrates strong empirical results on various tasks with Llama2 models and extends their context sizes significantly without altering their original architectures. The paper also mentions the availability of their code, models, dataset, and demo online.

Option 2: Add all files to the same index, separate by metadata

The other option is to create a single index, and then add all files to the same index (separated by metadata).

You have two options to create an index:

  1. Through the UI. Make sure to note down the pipeline id.
  2. Through the API.

Creating Pipeline

## Option 1: Through UI. Note down the pipeline_id below
# pipeline_id = "<pipeline_id>"
# pipeline_name = "<pipeline_name>"

## Option 2: Programmatically
data_sink = None
# data_sink = get_qdrant_sink(f"collection_{user}")
pipeline = create_pipeline("users_index", project_id, data_sink=data_sink)
pipeline_id = pipeline.id
pipeline_name = pipeline.name
project_id
'faaef3db-3f1a-4e9c-be79-31992b9c2fc9'

Upload each file, attach user metadata

for file, user in zip(files, users):
    pipeline_files = client.pipelines.add_files_to_pipeline(pipeline_id, request=[{'file_id': file.id}])
    # update metadata with user info
    pipeline_files = client.pipelines.update_pipeline_file(
        pipeline_id=pipeline_id, file_id=file.id, custom_metadata={ "user": user }
    ) 
pipeline_docs = client.pipelines.list_pipeline_documents(pipeline_id)
len(pipeline_docs)
3

Try querying your pipelines

Here we use our framework integration to perform retrieval against our LlamaCloud APIs. We can query the pipeline with a set of specified metadata filters to filter for specific user data.

from llama_index.indices.managed.llama_cloud import LlamaCloudIndex
from llama_index.llms.openai import OpenAI
import os

llm = OpenAI(model="gpt-4o")
index = LlamaCloudIndex(
  name=pipeline_name, 
  project_name=project_name,
  api_key=os.getenv("LLAMA_CLOUD_API_KEY")
)
# Try filtering for a specific user
from llama_index.core.vector_stores import MetadataFilters
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.response.notebook_utils import display_source_node

# specify the user you want to filter by here 
filters = MetadataFilters.from_dicts(
    [{"key": "user", "operator": "==", "value": "bob"}]
    # []
)
retriever = index.as_retriever(
    rerank_top_n=2, 
    filters=filters
)
query_engine = RetrieverQueryEngine.from_args(retriever, llm=llm)
response = query_engine.query("Tell me about the abstract of this paper")
print(str(response))

for n in response.source_nodes:
    display_source_node(n, source_length=1000, show_source_metadata=True)
The abstract of the paper introduces LongLoRA, an efficient fine-tuning approach designed to extend the context sizes of pre-trained large language models (LLMs) with minimal computational cost. Training LLMs with long context sizes typically requires significant computational resources. LongLoRA addresses this by using sparse local attention during fine-tuning, which reduces computation while maintaining performance similar to dense global attention. Additionally, it combines an improved LoRA with shifted sparse attention (S2-Attn) to achieve context extension efficiently. LongLoRA has demonstrated strong empirical results on various tasks using Llama2 models, extending their context sizes significantly while retaining their original architectures and compatibility with existing techniques. The paper also mentions the availability of their code, models, dataset, and demo on GitHub.

Node ID: fc5bd4bc-11c5-424f-80bd-8b1daf253551
Similarity: 0.89947516
Text: ABSTRACT

We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16× computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shifted sparse attention (S2-Attn) effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other han…
Metadata: {‘user’: ‘bob’, ‘file_size’: ‘1168720’, ‘last_modified_at’: ‘2024-07-06T18:19:04’, ‘file_path’: ‘longlora.pdf’, ‘file_name’: ‘longlora.pdf’, ‘pipeline_id’: ‘2532c4c9-f216-445d-8766-e7dfa22397d4’}

Node ID: 011b9560-abdb-4bdc-bafb-9102f7e8d1de
Similarity: 0.85524774
Text: We set the per-device batch size as 1 and gradient accumulation steps as 8, which means that the global batch size equals 64, using 8 GPUs. We train our models for 1000 steps.

Datasets

We use the Redpajama (Computer, 2023) dataset for training. Dev Team evaluate the long-sequence language modeling performance of our fine-tuned models on the book corpus dataset PG19 (Rae et al., 2020) and the cleaned Arxiv Math proof-pile dataset (Azerbayev et al., 2022). We use the test split of PG19 (Rae et al., 2020), consisting of 100 documents. For the proof-pile dataset, we also use the test split of it for evaluation. We follow Position Interpolation (Chen et al., 2023) for proof-pile data processing. We evaluate perplexity by using a sliding window approach with S = 256, following (Press et al., 2022).

4.2 MAIN RESULTS

Long-sequence Language Modeling.

In Table 3, we report the perplexity for our models and baseline on proof-pile (Azerbayev et al., 2022) and PG19 datasets. Under certain traini…
Metadata: {‘user’: ‘bob’, ‘file_size’: ‘1168720’, ‘last_modified_at’: ‘2024-07-06T18:19:04’, ‘file_path’: ‘longlora.pdf’, ‘file_name’: ‘longlora.pdf’, ‘pipeline_id’: ‘2532c4c9-f216-445d-8766-e7dfa22397d4’}

Read related articles:


Tags: