This release includes 3 breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
Summary
AI summaryMMLU dataset migration to cais/mmlu, default generation limits increased, and vLLM/SGLang temperature defaults set to 0.0.
Full changelog
lm-eval v0.4.9 Release Notes
Key Improvements
-
Enhanced Backend Support:
- SGLang Generate API by @baberabb in #2997
- vLLM enhancements: Added support for
enable_thinkingargument (#2947) and data parallel for V1 (#3011) by @anmarques and @baberabb - Chat template improvements: Extended vLLM chat template support (#2902) and fixed HF chat template resolution (#2992) by @anmarques and @fxmarty-amd
-
Multimodal Capabilities:
- Audio modality support for Qwen2 Audio models by @artemorloff in #2689
- Image processing improvements: Added resize images support (#2958) and enabled multimodal API usage (#2981) by @artemorloff and @baberabb
- ChartQA multimodal task implementation by @baberabb in #2544
-
Performance & Reliability:
- Quantization support added via
quantization_configby @jerryzh168 in #2842 - Memory optimization: Use
yaml.CLoaderfor faster YAML loading by @giuliolovisotto in #2777 - Bug fixes: Resolved MMLU generative metric aggregation (#2761) and context length handling issues (#2972)
- Quantization support added via
New Benchmarks & Tasks
Code Evaluation
- HumanEval Instruct - Instruction-following code generation benchmark by @baberabb in #2650
- MBPP Instruct - Instruction-based Python programming evaluation by @baberabb in #2995
Language Modeling
- C4 Dataset Support - Added perplexity evaluation on C4 web crawl dataset by @Zephyr271828 in #2889
Long Context Benchmarks
- RULER and Longbench - Long-context evaluation suites added by @baberabb in #2629
Mathematical & Reasoning
- GSM8K Platinum - Enhanced mathematical reasoning benchmark by @Qubitium in #2771
- MastermindEval - Logic reasoning evaluation by @whoisjones in #2788
- JSONSchemaBench - Structured output evaluation by @Saibo-creator in #2865
Llama Reference Implementations
- Llama Reference Implementations - Added task variants for Multilingual MMLU, MMLU CoT, GSM8K, and ARC Challenge based on Llama evaluation standards by @anmarques in #2797, #2826, #2829
Multilingual Expansion
Asian Languages:
- Korean MMLU (KMMLU) multiple-choice task by @Aprilistic in #2849
- MMLU-ProX extended evaluation by @heli-qi in #2811
- KBL 2025 Dataset - Updated Korean benchmark evaluation by @abzb1 in #3000
European Languages:
- NorEval - Comprehensive Norwegian benchmark by @vmkhlv in #2919
African Languages:
- AfroBench - Multi-African language evaluation by @JessicaOjo in #2825
- Darija tasks - Moroccan dialect benchmarks (DarijaMMLU, DarijaHellaSwag, Darija_Bench) by @hadi-abdine in #2521
Arabic Languages:
- Arab Culture task for cultural understanding by @bodasadallah in #3006
Domain-Specific Benchmarks
- CareQA - Healthcare evaluation benchmark by @PabloAgustin in #2714
- ACPBench & ACPBench Hard - Automated code generation evaluation by @harshakokel in #2807, #2980
- INCLUDE tasks - Inclusivity evaluation suite by @agromanou in #2769
- Cocoteros VA dataset by @sgs97ua in #2787
Social & Bias Evaluation
- Various social bias tasks for fairness assessment by @oskarvanderwal in #1185
Technical Enhancements
- Fine-grained evaluation: Added
--examplesargument for efficient multi-prompt evaluation by @felipemaiapolo and @mirianfsilva in #2520 - Improved tokenization: Better handling of
add_bos_tokeninitialization by @baberabb in #2781 - Memory management: Enhanced softmax computations with
softmax_dtypeargument forHFLMby @Avelina9X in #2921
Critical Bug Fixes
- Collating Queries Fix - Resolved error with different continuation lengths that was causing evaluation failures by @ameyagodbole in #2987
- Mutual Information Metric - Fixed acc_mutual_info calculation bug that affected metric accuracy by @baberabb in #3035
Breaking Changes & Important Updates
- MMLU dataset migration: Switched to
cais/mmludataset source by @baberabb in #2918 - Default parameter updates: Increased
max_gen_toksto 2048 andmax_lengthto 8192 for MMLU Pro tests by @dazipe in #2824 - Temperature defaults: Set default temperature to 0.0 for vLLM and SGLang backends by @baberabb in #2819
We extend our heartfelt thanks to all contributors who made this release possible, including 43 first-time contributors who brought fresh perspectives and valuable improvements to the evaluation harness.
What's Changed
- fix mmlu (generative) metric aggregation by @wangcho2k in https://github.com/EleutherAI/lm-evaluation-harness/pull/2761
- Bugfix by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2762
- fix verbosity typo by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2765
- docs: Fix typos in README.md by @ruivieira in https://github.com/EleutherAI/lm-evaluation-harness/pull/2778
- initialize tokenizer with
add_bos_tokenby @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2781 - improvement: Use yaml.CLoader to load yaml files when available. by @giuliolovisotto in https://github.com/EleutherAI/lm-evaluation-harness/pull/2777
- Consistency Fix: Filter new leaderboard_math_hard dataset to "Level 5" only by @perlitz in https://github.com/EleutherAI/lm-evaluation-harness/pull/2773
- Fix for mc2 calculation by @kdymkiewicz in https://github.com/EleutherAI/lm-evaluation-harness/pull/2768
- New healthcare benchmark: careqa by @PabloAgustin in https://github.com/EleutherAI/lm-evaluation-harness/pull/2714
- Capture gen_kwargs from CLI in squad_completion by @ksurya in https://github.com/EleutherAI/lm-evaluation-harness/pull/2727
- humaneval instruct by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2650
- Update evaluator.py by @zhuzeyuan in https://github.com/EleutherAI/lm-evaluation-harness/pull/2786
- change piqa dataset path (uses parquet rather than dataset script) by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2790
- use verify_certificate flag in batch requests by @daniel-salib in https://github.com/EleutherAI/lm-evaluation-harness/pull/2785
- add audio modality (qwen2 audio only) by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2689
- Add various social bias tasks by @oskarvanderwal in https://github.com/EleutherAI/lm-evaluation-harness/pull/1185
- update pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2799
- Update Legacy OpenLLM leaderboard to use "train" split for ARC fewshot by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/2802
- Add INCLUDE tasks by @agromanou in https://github.com/EleutherAI/lm-evaluation-harness/pull/2769
- Add support for token-based auth for watsonx models by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2796
- add version by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2808
- Add cocoteros_va dataset by @sgs97ua in https://github.com/EleutherAI/lm-evaluation-harness/pull/2787
- Add MastermindEval by @whoisjones in https://github.com/EleutherAI/lm-evaluation-harness/pull/2788
- Add loncxt tasks by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2629
- [hf-multimodal] pass kwargs to self.processor by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2667
- [MM] Chartqa by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2544
- Allow writing config to wandb by @ksurya in https://github.com/EleutherAI/lm-evaluation-harness/pull/2736
- [change] group -> tag on afrimgsm, afrimmlu, afrixnli dataset by @jd730 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2813
- Clean up README and pyproject.toml by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2814
- Llama3 mmlu correction by @anmarques in https://github.com/EleutherAI/lm-evaluation-harness/pull/2797
- Add Markdown linter by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2818
- Configure the pad tokens for Qwen when using vLLM by @zhangruoxu in https://github.com/EleutherAI/lm-evaluation-harness/pull/2810
- fix typo in humaneval by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2820
- default temp=0.0 for vllm and slang by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2819
- Fixes to mmlu_pro_llama by @anmarques in https://github.com/EleutherAI/lm-evaluation-harness/pull/2816
- Add MMLU-ProX task by @heli-qi in https://github.com/EleutherAI/lm-evaluation-harness/pull/2811
- Quick fix for mmlu_pro_llama by @anmarques in https://github.com/EleutherAI/lm-evaluation-harness/pull/2827
- Fix: tj-actions/changed-files is compromised by @Tautorn in https://github.com/EleutherAI/lm-evaluation-harness/pull/2828
- Multilingual MMLU for Llama instruct models by @anmarques in https://github.com/EleutherAI/lm-evaluation-harness/pull/2826
- bbh - changed dataset to parquet version by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2845
- Fix typo in longbench metrics by @djwackey in https://github.com/EleutherAI/lm-evaluation-harness/pull/2854
- Add kmmlu multiple-choice(accuracy) task #2848 by @Aprilistic in https://github.com/EleutherAI/lm-evaluation-harness/pull/2849
- Adding ACPBench task by @harshakokel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2807
- add Darija (Moroccan dialects) tasks including darijammlu. darijahellaswag and darija_bench by @hadi-abdine in https://github.com/EleutherAI/lm-evaluation-harness/pull/2521
- Increase default max_gen_toks to 2048 and max_length to 8192 for MMLU Pro tests by @dazipe in https://github.com/EleutherAI/lm-evaluation-harness/pull/2824
- doc by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2857
- Fix: ACPBench Link by @harshakokel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2860
- Adds MMLU CoT, gsm8k and arc_challenge for llama instruct by @anmarques in https://github.com/EleutherAI/lm-evaluation-harness/pull/2829
- [leaderboard] math - sync with repo by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2817
- Update supported models by @danielholanda in https://github.com/EleutherAI/lm-evaluation-harness/pull/2866
- Add JSONSchemaBench: A Benchmark for Evaluating Structured Output from LLMs by @Saibo-creator in https://github.com/EleutherAI/lm-evaluation-harness/pull/2865
- leaderboard - add subtask scores by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2867
- Fix the deps of longbench from jeiba to jieba by @houseroad in https://github.com/EleutherAI/lm-evaluation-harness/pull/2873
- Optimization for evalita-llm rouge computation by @m-resta in https://github.com/EleutherAI/lm-evaluation-harness/pull/2878
- Update authentications methods, add support for deployment_id for IBM watsonx_ai by @Medokins in https://github.com/EleutherAI/lm-evaluation-harness/pull/2877
- Add GSM8K Platinum by @Qubitium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2771
- Add
--examplesArgument for Fine-Grained Task Evaluation inlm-evaluation-harness. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] by @felipemaiapolo in https://github.com/EleutherAI/lm-evaluation-harness/pull/2520 - Extend support for chat template in vLLM by @anmarques in https://github.com/EleutherAI/lm-evaluation-harness/pull/2902
- tasks README: fix dead link by @dtrifiro in https://github.com/EleutherAI/lm-evaluation-harness/pull/2899
- Add support for quantization_config by @jerryzh168 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2842
- Fix a typo in README for tasks by @eldarkurtic in https://github.com/EleutherAI/lm-evaluation-harness/pull/2910
- fix resolve_hf_chat_template version by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2917
- mmlu - switch dataset to cais/mmlu; fix tests by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2918
- init pixels before tokenizer creation by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2911
- Longbench bugfix by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2895
- Added softmax_dtype argument to HFLM to coerce log_softmax computations by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/2921
- [bbh] use np.nan for numpy > 2.0 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2937
- Add support for enable_thinking argument in vllm model by @anmarques in https://github.com/EleutherAI/lm-evaluation-harness/pull/2947
- Added NorEval, a novel Norwegian benchmark by @vmkhlv in https://github.com/EleutherAI/lm-evaluation-harness/pull/2919
- Fix import error for eval_logger in score utils by @annafontanaa in https://github.com/EleutherAI/lm-evaluation-harness/pull/2940
- Include all test files in sdist by @booxter in https://github.com/EleutherAI/lm-evaluation-harness/pull/2634
- Change citation name by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/2956
- [vllm] add warning on truncation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2962
- fix: type error while checking context length by @llsj14 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2972
- Fix import error for deepcopy by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2969
- Pin unitxt to most recent minor version to avoid test failures by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2970
- mmlu pro generation_kwargs until Q: -> Question: by @yoonniverse in https://github.com/EleutherAI/lm-evaluation-harness/pull/2945
- AfroBench: How Good are Large Language Models on African Languages? by @JessicaOjo in https://github.com/EleutherAI/lm-evaluation-harness/pull/2825
- Added C4 Support by @Zephyr271828 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2889
- Fixed a bug that in MMLU-Pro utils.py that throw index error if one choice was removed by @sleepingcat4 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2870
- Add question suffix before the <|assistant|> tag by @TingchenFu in https://github.com/EleutherAI/lm-evaluation-harness/pull/2876
- Add device arg to model_args passed to LLM object in VLLM model class by @momentino in https://github.com/EleutherAI/lm-evaluation-harness/pull/2879
- paws-x fix formatting by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2759
- Delete scripts/cost_estimate.py by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/2985
- Adding ACPBench Hard tasks by @harshakokel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2980
- [SGLANG] Add the SGLANG generate API by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2997
- fix example notebook by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2998
- Log tokenized request warning only once by @RobGeada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3002
- [Add Dataset Update] KBL 2025 by @abzb1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3000
- Output path fix by @Niccolo-Ajroldi in https://github.com/EleutherAI/lm-evaluation-harness/pull/2993
- use images with api models by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2981
- Adding resize images support by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2958
- Revert "feat: add question suffix (#2876)" by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3007
- [hotfix] modify multimodal check in evaluate by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3013
- [Fix] Update
resolve_hf_chat_templatearguments by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/2992 - Fix error due in Collating queries with different continuation lengths (fixes #2984) by @ameyagodbole in https://github.com/EleutherAI/lm-evaluation-harness/pull/2987
- [vllm] data parallel for V1 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3011
- add arab_culture task by @bodasadallah in https://github.com/EleutherAI/lm-evaluation-harness/pull/3006
- chore: clean up and extend .gitignore rules by @e1washere in https://github.com/EleutherAI/lm-evaluation-harness/pull/3030
- Enable text-only evals for VLM models by @ysulsky in https://github.com/EleutherAI/lm-evaluation-harness/pull/2999
- [Fix] acc_mutual_info metric calculation bug by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3035
- Fix: fix vllm issue with DP>1 by @younesbelkada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3025
- add Mbpp instruct by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2995
- remove prints by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3041
- [longbench] fix metric calculation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2983
- Fallback to super implementation in
fewshot_contextfor Unitxt tasks by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/3023 - Fix Typo in README and Comment in utils_mcq.py by @vtjl10 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3057
- fix longbech citation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3061
- mmlu task: update README.md by @annafontanaa in https://github.com/EleutherAI/lm-evaluation-harness/pull/3070
- Fix typos in docstrings in instructions.py by @maximevtush in https://github.com/EleutherAI/lm-evaluation-harness/pull/3060
- bump version to
0.4.9by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3073
New Contributors
- @wangcho2k made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2761
- @ruivieira made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2778
- @perlitz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2773
- @kdymkiewicz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2768
- @PabloAgustin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2714
- @ksurya made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2727
- @zhuzeyuan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2786
- @daniel-salib made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2785
- @oskarvanderwal made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1185
- @Avelina9X made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2802
- @agromanou made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2769
- @whoisjones made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2788
- @jd730 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2813
- @anmarques made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2797
- @zhangruoxu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2810
- @heli-qi made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2811
- @Tautorn made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2828
- @djwackey made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2854
- @Aprilistic made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2849
- @harshakokel made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2807
- @hadi-abdine made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2521
- @dazipe made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2824
- @danielholanda made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2866
- @Saibo-creator made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2865
- @houseroad made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2873
- @felipemaiapolo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2520
- @dtrifiro made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2899
- @jerryzh168 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2842
- @vmkhlv made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2919
- @annafontanaa made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2940
- @booxter made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2634
- @llsj14 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2972
- @yoonniverse made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2945
- @Zephyr271828 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2889
- @sleepingcat4 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2870
- @TingchenFu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2876
- @momentino made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2879
- @abzb1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3000
- @Niccolo-Ajroldi made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2993
- @fxmarty-amd made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2992
- @ameyagodbole made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2987
- @e1washere made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3030
- @ysulsky made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2999
- @younesbelkada made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3025
- @vtjl10 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3057
- @maximevtush made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3060
Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.8...v0.4.9
Breaking Changes
- MMLU dataset source switched to cais/mmlu
- Increased `max_gen_toks` default to 2048 and `max_length` default to 8192 for MMLU Pro tests
- Set default temperature to 0.0 for vLLM and SGLang backends
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About EleutherAI / Lm-Evaluation-Harness
All releases →Related context
Related tools
Earlier breaking changes
Beta — feedback welcome: [email protected]