Llemma LLM

Llemma LLM

Llemma LLM – a Language Model for Mathematics. The model was kick-started with weights from Code Llama 7B and underwent training on Proof-Pile-2 for a duration covering 200B tokens. There’s also a variant of this model boasting 34B parameters, dubbed Llemma 34B.

Performance Insights

Llemma models excel in sequential mathematical thinking and are adept at utilizing computational mathematics tools like Python and formal theorem proving tools.

Sequential Mathematical Analysis

In tasks requiring sequential mathematical reasoning, the Llemma models have the edge over Llama-2 and Code Llama. Even when adjusted for model size, they surpass Minerva.

ModelSizeGSM8kOCWMMLU-STEMSATMATH
Llama 27B11.8%3.7%29.9%25%3.2%
Code Llama7B10.5%4.4%25.1%9.4%4.5%
LLEMMA7B36.4%7.7%37.7%53.1%18.0%
Minerva8B16.2%7.7%35.6%14.1%
————————–——-———–——-——-
Code Llama34B29.6%7.0%40.5%40.6%12.2%
LLEMMA34B51.5%11.8%49.0%71.9%25.0%
————————–——-———–——-——-
Minerva62B52.4%12.0%53.9%27.6%
Minerva540B58.8%17.6%63.9%33.6%

Further performance can be extracted by using majority voting: 

ModelSizeGSM8k maj@100OCW maj@100MMLU-STEM maj@16SAT maj@16MATH maj@256
LLEMMA7B54.0%14.3%49.9%78.1%33.5
Minerva8B28.4%12.5%43.4%25.4%
—————————-———–—————–———–————
LLEMMA34B69.3%18.4%59.7%81.3%43.1%
—————————-———–—————–———–————
Minerva62B68.5%23.5%63.5%43.4%
Minerva540B78.5%30.8%75.0%50.3%

Llemma LLM Use and Theorem Proving

In addition to chain-of-thought reasoning, Llemma has strong capabilities in computational mathematics tasks. For tool use and formal theorem proving evaluations, see paper.

Citation

@misc{azerbayev2023llemma,
title={Llemma: An Open Language Model For Mathematics},
author={Zhangir Azerbayev and Hailey Schoelkopf and Keiran Paster and Marco Dos Santos and Stephen McAleer and Albert Q. Jiang and Jia Deng and Stella Biderman and Sean Welleck},
year={2023},
eprint={2310.10631},
archivePrefix={arXiv},
primaryClass={cs.CL}
}

Read related articles:


Tags: