WordLLama

WordLLama

WordLlama is a utility for NLP and word embedding that repurposes components from large language models (LLMs) to generate efficient and compact word representations, similar to GloVe, Word2Vec, or FastText. It starts by extracting the token embedding codebook from a state-of-the-art LLM (e.g., LLaMA3 70B) and trains a small, context-free model within a general-purpose embedding framework.

WordLlama outperforms other word models like GloVe 300d on all MTEB benchmarks, while being significantly smaller (16MB for the default 256-dimension model compared to over 2GB).

WordLLama Key Features

Key features of WordLlama include:

  • Matryoshka Representations: Flexible truncation of the embedding dimensions as needed.
  • Low Resource Requirements: Simple token lookup with average pooling allows it to run efficiently on CPUs.
  • Binarization: Models trained with the straight-through estimator can be packed into small integer arrays, enabling faster Hamming distance calculations (coming soon).
  • Numpy-only Inference: Lightweight and simple implementation.

WordLlama uses Matryoshka representation learning to offer flexibility. The largest model (1024 dimensions) can be truncated to 64, 128, 256, or 512 dimensions. For binary embedding models, straight-through estimators are employed during training. Dense embeddings perform well with 256 dimensions, while binary embeddings see validation accuracy nearing saturation at 512 dimensions (64 bytes packed).

The final weights are saved after weighting, projection, and truncation of the entire tokenizer vocabulary, reducing the model to a single embedding matrix (nn.Embedding) that is far smaller than the gigabyte-scale LLM codebooks. The original tokenizer preprocesses text into tokens, and the reduced-size embeddings are average pooled, requiring minimal computation. Resulting model sizes range from 16MB to 250MB for the 128k LLaMA3 vocabulary.

WordLlama is ideal for lightweight NLP tasks such as training sklearn classifiers, performing semantic matching, fuzzy deduplication, ranking, and clustering. It’s well-suited for creating LLM output evaluators or multi-hop workflows, and with its compact and portable design, you can train models on consumer GPUs in just a few hours. Its speed and versatility make it a great “Swiss Army knife” for exploratory analysis and utility applications.

MTEB Results (l2_supercat)

MetricWL64WL128WL256 (X)WL512WL1024GloVe 300dKomninosall-MiniLM-L6-v2
Clustering30.2732.2033.2533.4033.6227.7326.5742.35
Reranking50.3851.5252.0352.3252.3943.2944.7558.04
Classification53.1456.2558.2159.1359.5057.2957.6563.05
Pair Classification75.8077.5978.2278.5078.6070.9272.9482.37
STS66.2467.5367.9168.2268.2761.8562.4678.90
CQA DupStack18.7622.5424.1224.5924.8315.4716.7941.32
SummEval30.7929.9930.9929.5629.3928.8730.4930.81
MTEB Results (l2_supercat)

HuggingFace

License

This project is licensed under the MIT License.


Tags: