Memorization or Generation of Big Code Models Leaderboard

Inspired by the 🤗 Open LLM Leaderboard and 🤗 Open LLM-Perf Leaderboard 🏋️, we compare the performance of base code generation models on the HumanEval and HumanEval-ET benchmarks. We also measure the Memorization-Generalization Index and provide the results for the models. We compare both open-source and closed-source pre-trained code LLMs that can serve as base models for further training.

Model MGI Pass@1(temp=0) Pass@1(temp=0.8)
HumanEval HumanEval-ET HumanEval HumanEval-ET

Notes

Pass@1 (temp = 0) Pass@1 (temp = 0.8)
Pass@1 (temp = 0) Pass@1 (temp = 0.8)

Benchmarking and Prompts

For all models (except for the Starcoder family), we used the original benchmark prompts from HumanEval and added a `<bos>` token before the provided prompt. The maximum generation length was set to the length of the original prompt plus 300 tokens.

For the Starcoder family models (such as Starcoder2-7b and Starcoder2-15b), we used the official bigcode-evaluation-harness for generation. More details can be found here.

Evaluation Parameters

For all models, we sampled 1 and 50 samples under temperatures of 0 and 0.8, respectively, for the subsequent result calculations. The parameters are set as follows:

Performance Metrics

How to submit models/results to the leaderboard?

We welcome the community to submit evaluation results of new models. These results will be added as non-verified, the authors are however required to upload their generations in case other members want to check.

To submit your results create a Pull Request in the community tab to add them under the folder community_results in the repository:

The title of the PR should be [Community Submission] Model: org/model, Username: your_username, replace org and model with those corresponding to the model you evaluated.

Context

In addition to Memorization or Generation of Big Code Models Leaderboard, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as: