This release includes 4 breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+8 more
Summary
AI summaryBroad release touches Reflexive white-box scorers, Datasets, LLMs, and Single-generation white-box scorers.
Full changelog
Highlights
1. Varied tutorials for more model and dataset coverage
We have updated the example notebooks to have broader coverage over LLMs and example datasets.
LLMs
- Gemini models
- GPT-4* models
- o3-mini
- Qwen
- Mistral
- LLama
- Deepseek
Datasets
- GSM8K
- SVAMP
- PopQA
- NQ-Open
- AI2-ARC
- CSQA
- SimpleQA
- HotpotQA
- Image (multimodal demo)
2. New scorers added
This release includes the addition of 11 new scorers spanning various categories (with accompanying unit tests). Details are provided below.
White-Box scorers
We are offering 9 new white-box scorers with this release. These scorers can be implemented with WhiteBoxUQ by specifying the respective scorer names in the scorers list. The length_normalize parameter determines whether response probabilities are length-normalized for the sampling-based white-box scorers.
Single-generation white-box scorers
- Likelihood margin Farr et al., 2024
- Sequence probability Vashurin et al., 2024
- Mean top-k token entropy Scalena et al., 2025
- Max top-token entropy Scalena et al., 2025
Sampling-based white-box scorers
- Semantic Entropy (logprobs version) Farquhar et al., 2024
- Semantic Density Qiu et al., 2024 (can also be implemented with
SemanticDensityfromuqlm.scorers - Monte carlo predictive entropy Kuhn et al., 2023
- CoCoA Vashurin et al., 2025
Reflexive white-box scorers
- P(True) Kadavath et al., 2022
Black-Box scorers
We are implementing two new black-box scorers with this release. They can be specified using the scorers parameter in BlackBoxUQ.
- Number of Semantic Sets (Lin et al., 2024; Vashurin et al., 2025; Kuhn et al., 2023)
- Entailment Probability (Lin et al., 2025; Chen & Mueller, 2023)
Definitions of new scorers are provided with LaTeX at the end of applicable tutorial notebooks. We have also added new tutorial notebooks for Semantic Density and multi-generation white-box scorers. The readme has also been updated to reflect the new scorers.
3. New LLMGrader class and updated default grader for UQEnsemble
This release includes a new utility class uqlm.utils.grader.LLMGrader which is instantiated from a BaseChatModel and grades LLM responses against an answer key. This class appears:
- in the example notebooks for evaluating hallucination detection performance.
- as the updated default grader, replacing
vectara/hallucination_evaluation_model, as that model is now gated.
4. Option to provide additional context to LLM judges
Users can now pass additional instructions to their LLM judges by using the additional_context parameter in the constructor of LLMPanel.
5. New datasets available with load_example_dataset
The utility function load_example_dataset now offers HotpotQA and SimpleQA datasets.
6. uqlm.nli sub-package
Created uqlm.nli sub-package that contains the following:
NLIclass for NLI scoring only. Semantic entropy and noncontradiction calculations are respectively moved touqlm.scorers.SemanticEntropyanduqlm.black_box.ConsistencyScorerclasses.SemanticClustererclass for semantic clustering (used for semantic entropy, semantic density, and number of semantic sets)
7. uqlm.white_box sub-package
Created uqlm.white_box sub-package that contains three classes for white-box computations from logprobs:
SingleLogprobsScorerfor computing scores that depend on only logprobs from one generated response: normalized probability, sequence probability, minimum probabilityTopLogprobsScorerfor computing scorers that depend on top-K logprobs from generated response: mean top-k token negentropy, min top-k token negentropy, and likelihood marginSampledLogprobsScorerfor computing scores that that depend on logprobs from multiple sampled responses: monte carlo probability, CoCoA, semantic entropy, and semantic densityPTrueScorerfor implementing the P(True) method
8. Minor changes & future deprecations
- Renamed
NLIScorer->ConsistencyScorerand moved some methods touqlm.nli.NLIclass normalized_probabilityscorer name inWhiteBoxUQwill be deprecated inv0.5in favor ofsequence_probabilitywithlength_normalize. The default scorers ofWhiteBoxUQwill bescorers=["min_probability", "sequence_probability"]. The default value oflength_normalize=Truewill apply tosequence_probability, so that it returns whatnormalized_probabilitycurrently returns.system_promptandtemplate_ques_ansare deprecated in favor ofadditional_contextparameter- default grader in
UQEnsemble.tunenow usesLLMGraderwith the user-provided LLM used for generation
What's Changed
- Add Semantic Density scorer by @dross20 in https://github.com/cvs-health/uqlm/pull/209
- Adding HotPotQA and SimpleQA by @dskarbrevik in https://github.com/cvs-health/uqlm/pull/210
- Semantic density, docs by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/212
- Semantic density notebook by @dross20 in https://github.com/cvs-health/uqlm/pull/213
- Semantic density by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/214
- v0.3.1 updates by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/220
- add judge customization option by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/221
- v0.3.1 updates by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/223
- New White Box Scorers by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/219
- Diversify demos by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/232
- Update notebooks by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/233
- update demo notebooks by @zeya30 in https://github.com/cvs-health/uqlm/pull/229
- Update demo notebooks by @zeya30 in https://github.com/cvs-health/uqlm/pull/234
- Llm grader by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/238
- update demo notebooks by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/239
- Refactor: NLI Subpackage II by @mohitcek in https://github.com/cvs-health/uqlm/pull/237
- Feature: Integrate SemanticEntropy and SemanticDensity methods with WhiteBoxUQ class by @mohitcek in https://github.com/cvs-health/uqlm/pull/240
- Drop python 3.9 support by @doyajii1 in https://github.com/cvs-health/uqlm/pull/242
- Jmlr revisions by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/243
- Bump sphinx-autodoc-typehints from 2.2.0 to 2.3.0 by @dependabot[bot] in https://github.com/cvs-health/uqlm/pull/230
- Polish notebooks and readme by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/246
- Bump pytest-cov from 6.3.0 to 7.0.0 by @dependabot[bot] in https://github.com/cvs-health/uqlm/pull/176
- Number of semantic sets scorer by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/247
- Improve unit tests code coverage by @zeya30 in https://github.com/cvs-health/uqlm/pull/241
- Minor refactor + Improved test coverage by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/255
- Minor refactor + updated demos by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/256
- Update scorer definitions + fix logprobs bug in
SemanticEntropyby @dylanbouchard in https://github.com/cvs-health/uqlm/pull/257 - Reuse NLI Scores by @mohitcek in https://github.com/cvs-health/uqlm/pull/260
- Improve unit tests code coverage by @zeya30 in https://github.com/cvs-health/uqlm/pull/258
- Allow for
torch.deviceinWhitBoxUQby @dylanbouchard in https://github.com/cvs-health/uqlm/pull/261 - Fix logprob bug by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/262
- Update docs site by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/259
- Release PR:
v0.4.0by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/263
New Contributors
- @dross20 made their first contribution in https://github.com/cvs-health/uqlm/pull/209
Full Changelog: https://github.com/cvs-health/uqlm/compare/v0.3.1...v0.4.0
Breaking Changes
- Dropped support for Python 3.9
- Renamed `NLIScorer` to `ConsistencyScorer` and moved related methods to `uqlm.nli.NLI`
- Deprecated `normalized_probability` scorer in favor of `sequence_probability` with `length_normalize` (removal planned in v0.5)
- Deprecated `system_prompt` and `template_ques_ans`; use `additional_context` instead
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About UQLM
All releases →Related context
Related tools
Beta — feedback welcome: [email protected]