EleutherAI / Lm-Evaluation-Harness

v0.4.9 Breaking

This release includes 3 breaking changes for platform teams planning a safe upgrade.

Published 1y AI Coding Tools

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

evaluation-framework language-model transformer

Summary

AI summary

MMLU dataset migration to cais/mmlu, default generation limits increased, and vLLM/SGLang temperature defaults set to 0.0.

Full changelog

lm-eval v0.4.9 Release Notes

Key Improvements

Enhanced Backend Support:
- SGLang Generate API by @baberabb in #2997
- vLLM enhancements: Added support for enable_thinking argument (#2947) and data parallel for V1 (#3011) by @anmarques and @baberabb
- Chat template improvements: Extended vLLM chat template support (#2902) and fixed HF chat template resolution (#2992) by @anmarques and @fxmarty-amd
Multimodal Capabilities:
- Audio modality support for Qwen2 Audio models by @artemorloff in #2689
- Image processing improvements: Added resize images support (#2958) and enabled multimodal API usage (#2981) by @artemorloff and @baberabb
- ChartQA multimodal task implementation by @baberabb in #2544
Performance & Reliability:
- Quantization support added via quantization_config by @jerryzh168 in #2842
- Memory optimization: Use yaml.CLoader for faster YAML loading by @giuliolovisotto in #2777
- Bug fixes: Resolved MMLU generative metric aggregation (#2761) and context length handling issues (#2972)

New Benchmarks & Tasks

Code Evaluation

HumanEval Instruct - Instruction-following code generation benchmark by @baberabb in #2650
MBPP Instruct - Instruction-based Python programming evaluation by @baberabb in #2995

Language Modeling

C4 Dataset Support - Added perplexity evaluation on C4 web crawl dataset by @Zephyr271828 in #2889

Long Context Benchmarks

RULER and Longbench - Long-context evaluation suites added by @baberabb in #2629

Mathematical & Reasoning

GSM8K Platinum - Enhanced mathematical reasoning benchmark by @Qubitium in #2771
MastermindEval - Logic reasoning evaluation by @whoisjones in #2788
JSONSchemaBench - Structured output evaluation by @Saibo-creator in #2865

Llama Reference Implementations

Llama Reference Implementations - Added task variants for Multilingual MMLU, MMLU CoT, GSM8K, and ARC Challenge based on Llama evaluation standards by @anmarques in #2797, #2826, #2829

Multilingual Expansion

Asian Languages:

Korean MMLU (KMMLU) multiple-choice task by @Aprilistic in #2849
MMLU-ProX extended evaluation by @heli-qi in #2811
KBL 2025 Dataset - Updated Korean benchmark evaluation by @abzb1 in #3000

European Languages:

NorEval - Comprehensive Norwegian benchmark by @vmkhlv in #2919

African Languages:

AfroBench - Multi-African language evaluation by @JessicaOjo in #2825
Darija tasks - Moroccan dialect benchmarks (DarijaMMLU, DarijaHellaSwag, Darija_Bench) by @hadi-abdine in #2521

Arabic Languages:

Arab Culture task for cultural understanding by @bodasadallah in #3006

Domain-Specific Benchmarks

CareQA - Healthcare evaluation benchmark by @PabloAgustin in #2714
ACPBench & ACPBench Hard - Automated code generation evaluation by @harshakokel in #2807, #2980
INCLUDE tasks - Inclusivity evaluation suite by @agromanou in #2769
Cocoteros VA dataset by @sgs97ua in #2787

Social & Bias Evaluation

Various social bias tasks for fairness assessment by @oskarvanderwal in #1185

Technical Enhancements

Fine-grained evaluation: Added --examples argument for efficient multi-prompt evaluation by @felipemaiapolo and @mirianfsilva in #2520
Improved tokenization: Better handling of add_bos_token initialization by @baberabb in #2781
Memory management: Enhanced softmax computations with softmax_dtype argument for HFLM by @Avelina9X in #2921

Critical Bug Fixes

Collating Queries Fix - Resolved error with different continuation lengths that was causing evaluation failures by @ameyagodbole in #2987
Mutual Information Metric - Fixed acc_mutual_info calculation bug that affected metric accuracy by @baberabb in #3035

Breaking Changes & Important Updates

MMLU dataset migration: Switched to cais/mmlu dataset source by @baberabb in #2918
Default parameter updates: Increased max_gen_toks to 2048 and max_length to 8192 for MMLU Pro tests by @dazipe in #2824
Temperature defaults: Set default temperature to 0.0 for vLLM and SGLang backends by @baberabb in #2819

We extend our heartfelt thanks to all contributors who made this release possible, including 43 first-time contributors who brought fresh perspectives and valuable improvements to the evaluation harness.

What's Changed

fix mmlu (generative) metric aggregation by @wangcho2k in https://github.com/EleutherAI/lm-evaluation-harness/pull/2761
Bugfix by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2762
fix verbosity typo by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2765
docs: Fix typos in README.md by @ruivieira in https://github.com/EleutherAI/lm-evaluation-harness/pull/2778
initialize tokenizer with add_bos_token by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2781
improvement: Use yaml.CLoader to load yaml files when available. by @giuliolovisotto in https://github.com/EleutherAI/lm-evaluation-harness/pull/2777
Consistency Fix: Filter new leaderboard_math_hard dataset to "Level 5" only by @perlitz in https://github.com/EleutherAI/lm-evaluation-harness/pull/2773
Fix for mc2 calculation by @kdymkiewicz in https://github.com/EleutherAI/lm-evaluation-harness/pull/2768
New healthcare benchmark: careqa by @PabloAgustin in https://github.com/EleutherAI/lm-evaluation-harness/pull/2714
Capture gen_kwargs from CLI in squad_completion by @ksurya in https://github.com/EleutherAI/lm-evaluation-harness/pull/2727
humaneval instruct by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2650
Update evaluator.py by @zhuzeyuan in https://github.com/EleutherAI/lm-evaluation-harness/pull/2786
change piqa dataset path (uses parquet rather than dataset script) by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2790
use verify_certificate flag in batch requests by @daniel-salib in https://github.com/EleutherAI/lm-evaluation-harness/pull/2785
add audio modality (qwen2 audio only) by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2689
Add various social bias tasks by @oskarvanderwal in https://github.com/EleutherAI/lm-evaluation-harness/pull/1185
update pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2799
Update Legacy OpenLLM leaderboard to use "train" split for ARC fewshot by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/2802
Add INCLUDE tasks by @agromanou in https://github.com/EleutherAI/lm-evaluation-harness/pull/2769
Add support for token-based auth for watsonx models by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2796
add version by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2808
Add cocoteros_va dataset by @sgs97ua in https://github.com/EleutherAI/lm-evaluation-harness/pull/2787
Add MastermindEval by @whoisjones in https://github.com/EleutherAI/lm-evaluation-harness/pull/2788
Add loncxt tasks by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2629
[hf-multimodal] pass kwargs to self.processor by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2667
[MM] Chartqa by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2544
Allow writing config to wandb by @ksurya in https://github.com/EleutherAI/lm-evaluation-harness/pull/2736
[change] group -> tag on afrimgsm, afrimmlu, afrixnli dataset by @jd730 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2813
Clean up README and pyproject.toml by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2814
Llama3 mmlu correction by @anmarques in https://github.com/EleutherAI/lm-evaluation-harness/pull/2797
Add Markdown linter by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2818
Configure the pad tokens for Qwen when using vLLM by @zhangruoxu in https://github.com/EleutherAI/lm-evaluation-harness/pull/2810
fix typo in humaneval by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2820
default temp=0.0 for vllm and slang by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2819
Fixes to mmlu_pro_llama by @anmarques in https://github.com/EleutherAI/lm-evaluation-harness/pull/2816
Add MMLU-ProX task by @heli-qi in https://github.com/EleutherAI/lm-evaluation-harness/pull/2811
Quick fix for mmlu_pro_llama by @anmarques in https://github.com/EleutherAI/lm-evaluation-harness/pull/2827
Fix: tj-actions/changed-files is compromised by @Tautorn in https://github.com/EleutherAI/lm-evaluation-harness/pull/2828
Multilingual MMLU for Llama instruct models by @anmarques in https://github.com/EleutherAI/lm-evaluation-harness/pull/2826
bbh - changed dataset to parquet version by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2845
Fix typo in longbench metrics by @djwackey in https://github.com/EleutherAI/lm-evaluation-harness/pull/2854
Add kmmlu multiple-choice(accuracy) task #2848 by @Aprilistic in https://github.com/EleutherAI/lm-evaluation-harness/pull/2849
Adding ACPBench task by @harshakokel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2807
add Darija (Moroccan dialects) tasks including darijammlu. darijahellaswag and darija_bench by @hadi-abdine in https://github.com/EleutherAI/lm-evaluation-harness/pull/2521
Increase default max_gen_toks to 2048 and max_length to 8192 for MMLU Pro tests by @dazipe in https://github.com/EleutherAI/lm-evaluation-harness/pull/2824
doc by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2857
Fix: ACPBench Link by @harshakokel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2860
Adds MMLU CoT, gsm8k and arc_challenge for llama instruct by @anmarques in https://github.com/EleutherAI/lm-evaluation-harness/pull/2829
[leaderboard] math - sync with repo by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2817
Update supported models by @danielholanda in https://github.com/EleutherAI/lm-evaluation-harness/pull/2866
Add JSONSchemaBench: A Benchmark for Evaluating Structured Output from LLMs by @Saibo-creator in https://github.com/EleutherAI/lm-evaluation-harness/pull/2865
leaderboard - add subtask scores by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2867
Fix the deps of longbench from jeiba to jieba by @houseroad in https://github.com/EleutherAI/lm-evaluation-harness/pull/2873
Optimization for evalita-llm rouge computation by @m-resta in https://github.com/EleutherAI/lm-evaluation-harness/pull/2878
Update authentications methods, add support for deployment_id for IBM watsonx_ai by @Medokins in https://github.com/EleutherAI/lm-evaluation-harness/pull/2877
Add GSM8K Platinum by @Qubitium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2771
Add --examples Argument for Fine-Grained Task Evaluation in lm-evaluation-harness. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] by @felipemaiapolo in https://github.com/EleutherAI/lm-evaluation-harness/pull/2520
Extend support for chat template in vLLM by @anmarques in https://github.com/EleutherAI/lm-evaluation-harness/pull/2902
tasks README: fix dead link by @dtrifiro in https://github.com/EleutherAI/lm-evaluation-harness/pull/2899
Add support for quantization_config by @jerryzh168 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2842
Fix a typo in README for tasks by @eldarkurtic in https://github.com/EleutherAI/lm-evaluation-harness/pull/2910
fix resolve_hf_chat_template version by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2917
mmlu - switch dataset to cais/mmlu; fix tests by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2918
init pixels before tokenizer creation by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2911
Longbench bugfix by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2895
Added softmax_dtype argument to HFLM to coerce log_softmax computations by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/2921
[bbh] use np.nan for numpy > 2.0 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2937
Add support for enable_thinking argument in vllm model by @anmarques in https://github.com/EleutherAI/lm-evaluation-harness/pull/2947
Added NorEval, a novel Norwegian benchmark by @vmkhlv in https://github.com/EleutherAI/lm-evaluation-harness/pull/2919
Fix import error for eval_logger in score utils by @annafontanaa in https://github.com/EleutherAI/lm-evaluation-harness/pull/2940
Include all test files in sdist by @booxter in https://github.com/EleutherAI/lm-evaluation-harness/pull/2634
Change citation name by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/2956
[vllm] add warning on truncation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2962
fix: type error while checking context length by @llsj14 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2972
Fix import error for deepcopy by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2969
Pin unitxt to most recent minor version to avoid test failures by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2970
mmlu pro generation_kwargs until Q: -> Question: by @yoonniverse in https://github.com/EleutherAI/lm-evaluation-harness/pull/2945
AfroBench: How Good are Large Language Models on African Languages? by @JessicaOjo in https://github.com/EleutherAI/lm-evaluation-harness/pull/2825
Added C4 Support by @Zephyr271828 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2889
Fixed a bug that in MMLU-Pro utils.py that throw index error if one choice was removed by @sleepingcat4 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2870
Add question suffix before the <|assistant|> tag by @TingchenFu in https://github.com/EleutherAI/lm-evaluation-harness/pull/2876
Add device arg to model_args passed to LLM object in VLLM model class by @momentino in https://github.com/EleutherAI/lm-evaluation-harness/pull/2879
paws-x fix formatting by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2759
Delete scripts/cost_estimate.py by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/2985
Adding ACPBench Hard tasks by @harshakokel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2980
[SGLANG] Add the SGLANG generate API by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2997
fix example notebook by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2998
Log tokenized request warning only once by @RobGeada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3002
[Add Dataset Update] KBL 2025 by @abzb1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3000
Output path fix by @Niccolo-Ajroldi in https://github.com/EleutherAI/lm-evaluation-harness/pull/2993
use images with api models by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2981
Adding resize images support by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2958
Revert "feat: add question suffix (#2876)" by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3007
[hotfix] modify multimodal check in evaluate by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3013
[Fix] Update resolve_hf_chat_template arguments by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/2992
Fix error due in Collating queries with different continuation lengths (fixes #2984) by @ameyagodbole in https://github.com/EleutherAI/lm-evaluation-harness/pull/2987
[vllm] data parallel for V1 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3011
add arab_culture task by @bodasadallah in https://github.com/EleutherAI/lm-evaluation-harness/pull/3006
chore: clean up and extend .gitignore rules by @e1washere in https://github.com/EleutherAI/lm-evaluation-harness/pull/3030
Enable text-only evals for VLM models by @ysulsky in https://github.com/EleutherAI/lm-evaluation-harness/pull/2999
[Fix] acc_mutual_info metric calculation bug by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3035
Fix: fix vllm issue with DP>1 by @younesbelkada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3025
add Mbpp instruct by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2995
remove prints by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3041
[longbench] fix metric calculation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2983
Fallback to super implementation in fewshot_context for Unitxt tasks by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/3023
Fix Typo in README and Comment in utils_mcq.py by @vtjl10 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3057
fix longbech citation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3061
mmlu task: update README.md by @annafontanaa in https://github.com/EleutherAI/lm-evaluation-harness/pull/3070
Fix typos in docstrings in instructions.py by @maximevtush in https://github.com/EleutherAI/lm-evaluation-harness/pull/3060
bump version to 0.4.9 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3073

New Contributors

@wangcho2k made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2761
@ruivieira made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2778
@perlitz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2773
@kdymkiewicz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2768
@PabloAgustin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2714
@ksurya made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2727
@zhuzeyuan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2786
@daniel-salib made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2785
@oskarvanderwal made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1185
@Avelina9X made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2802
@agromanou made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2769
@whoisjones made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2788
@jd730 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2813
@anmarques made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2797
@zhangruoxu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2810
@heli-qi made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2811
@Tautorn made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2828
@djwackey made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2854
@Aprilistic made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2849
@harshakokel made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2807
@hadi-abdine made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2521
@dazipe made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2824
@danielholanda made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2866
@Saibo-creator made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2865
@houseroad made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2873
@felipemaiapolo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2520
@dtrifiro made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2899
@jerryzh168 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2842
@vmkhlv made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2919
@annafontanaa made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2940
@booxter made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2634
@llsj14 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2972
@yoonniverse made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2945
@Zephyr271828 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2889
@sleepingcat4 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2870
@TingchenFu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2876
@momentino made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2879
@abzb1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3000
@Niccolo-Ajroldi made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2993
@fxmarty-amd made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2992
@ameyagodbole made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2987
@e1washere made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3030
@ysulsky made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2999
@younesbelkada made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3025
@vtjl10 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3057
@maximevtush made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3060

Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.8...v0.4.9

Breaking Changes

MMLU dataset source switched to cais/mmlu
Increased `max_gen_toks` default to 2048 and `max_length` default to 8192 for MMLU Pro tests
Set default temperature to 0.0 for vLLM and SGLang backends

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track EleutherAI / Lm-Evaluation-Harness

Get notified when new releases ship.

About EleutherAI / Lm-Evaluation-Harness

All releases →

Related context

Related tools

Earlier breaking changes

v0.4.12 vLLM minimum version requirement bumped to 0.18
v0.4.12 enable_thinking now disallowed for multiple_choice and loglikelihood tasks
v0.4.12 SteeredHF backend renamed to SteeredModel, update imports
v0.4.12 TaskManager.load() returns flat dict instead of nested structure