UQLM

v0.4.0 Breaking

This release includes 4 breaking changes for platform teams planning a safe upgrade.

Published 8mo AI Agents & Assistants

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-evaluation ai-safety confidence-estimation confidence-score hallucination hallucination-detection

+8 more

hallucination-evaluation hallucination-mitigation llm llm-evaluation llm-hallucination llm-safety uncertainty-estimation uncertainty-quantification

Summary

AI summary

Broad release touches Reflexive white-box scorers, Datasets, LLMs, and Single-generation white-box scorers.

Full changelog

Highlights

1. Varied tutorials for more model and dataset coverage

We have updated the example notebooks to have broader coverage over LLMs and example datasets.

LLMs

Gemini models
GPT-4* models
o3-mini
Qwen
Mistral
LLama
Deepseek

Datasets

GSM8K
SVAMP
PopQA
NQ-Open
AI2-ARC
CSQA
SimpleQA
HotpotQA
Image (multimodal demo)

2. New scorers added

This release includes the addition of 11 new scorers spanning various categories (with accompanying unit tests). Details are provided below.

White-Box scorers

We are offering 9 new white-box scorers with this release. These scorers can be implemented with WhiteBoxUQ by specifying the respective scorer names in the scorers list. The length_normalize parameter determines whether response probabilities are length-normalized for the sampling-based white-box scorers.

Single-generation white-box scorers

Likelihood margin Farr et al., 2024
Sequence probability Vashurin et al., 2024
Mean top-k token entropy Scalena et al., 2025
Max top-token entropy Scalena et al., 2025

Sampling-based white-box scorers

Semantic Entropy (logprobs version) Farquhar et al., 2024
Semantic Density Qiu et al., 2024 (can also be implemented with SemanticDensity from uqlm.scorers
Monte carlo predictive entropy Kuhn et al., 2023
CoCoA Vashurin et al., 2025

Reflexive white-box scorers

P(True) Kadavath et al., 2022

Black-Box scorers

We are implementing two new black-box scorers with this release. They can be specified using the scorers parameter in BlackBoxUQ.

Number of Semantic Sets (Lin et al., 2024; Vashurin et al., 2025; Kuhn et al., 2023)
Entailment Probability (Lin et al., 2025; Chen & Mueller, 2023)

Definitions of new scorers are provided with LaTeX at the end of applicable tutorial notebooks. We have also added new tutorial notebooks for Semantic Density and multi-generation white-box scorers. The readme has also been updated to reflect the new scorers.

3. New `LLMGrader` class and updated default grader for `UQEnsemble`

This release includes a new utility class uqlm.utils.grader.LLMGrader which is instantiated from a BaseChatModel and grades LLM responses against an answer key. This class appears:

in the example notebooks for evaluating hallucination detection performance.
as the updated default grader, replacing vectara/hallucination_evaluation_model, as that model is now gated.

4. Option to provide additional context to LLM judges

Users can now pass additional instructions to their LLM judges by using the additional_context parameter in the constructor of LLMPanel.

5. New datasets available with `load_example_dataset`

The utility function load_example_dataset now offers HotpotQA and SimpleQA datasets.

6. `uqlm.nli` sub-package

Created uqlm.nli sub-package that contains the following:

NLI class for NLI scoring only. Semantic entropy and noncontradiction calculations are respectively moved to uqlm.scorers.SemanticEntropy and uqlm.black_box.ConsistencyScorer classes.
SemanticClusterer class for semantic clustering (used for semantic entropy, semantic density, and number of semantic sets)

7. `uqlm.white_box` sub-package

Created uqlm.white_box sub-package that contains three classes for white-box computations from logprobs:

SingleLogprobsScorer for computing scores that depend on only logprobs from one generated response: normalized probability, sequence probability, minimum probability
TopLogprobsScorer for computing scorers that depend on top-K logprobs from generated response: mean top-k token negentropy, min top-k token negentropy, and likelihood margin
SampledLogprobsScorer for computing scores that that depend on logprobs from multiple sampled responses: monte carlo probability, CoCoA, semantic entropy, and semantic density
PTrueScorer for implementing the P(True) method

8. Minor changes & future deprecations

Renamed NLIScorer -> ConsistencyScorer and moved some methods to uqlm.nli.NLI class
normalized_probability scorer name in WhiteBoxUQ will be deprecated in v0.5 in favor of sequence_probability with length_normalize. The default scorers of WhiteBoxUQ will be scorers=["min_probability", "sequence_probability"]. The default value of length_normalize=True will apply to sequence_probability, so that it returns what normalized_probability currently returns.
system_prompt and template_ques_ans are deprecated in favor of additional_context parameter
default grader in UQEnsemble.tune now uses LLMGrader with the user-provided LLM used for generation

What's Changed

Add Semantic Density scorer by @dross20 in https://github.com/cvs-health/uqlm/pull/209
Adding HotPotQA and SimpleQA by @dskarbrevik in https://github.com/cvs-health/uqlm/pull/210
Semantic density, docs by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/212
Semantic density notebook by @dross20 in https://github.com/cvs-health/uqlm/pull/213
Semantic density by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/214
v0.3.1 updates by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/220
add judge customization option by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/221
v0.3.1 updates by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/223
New White Box Scorers by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/219
Diversify demos by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/232
Update notebooks by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/233
update demo notebooks by @zeya30 in https://github.com/cvs-health/uqlm/pull/229
Update demo notebooks by @zeya30 in https://github.com/cvs-health/uqlm/pull/234
Llm grader by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/238
update demo notebooks by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/239
Refactor: NLI Subpackage II by @mohitcek in https://github.com/cvs-health/uqlm/pull/237
Feature: Integrate SemanticEntropy and SemanticDensity methods with WhiteBoxUQ class by @mohitcek in https://github.com/cvs-health/uqlm/pull/240
Drop python 3.9 support by @doyajii1 in https://github.com/cvs-health/uqlm/pull/242
Jmlr revisions by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/243
Bump sphinx-autodoc-typehints from 2.2.0 to 2.3.0 by @dependabot[bot] in https://github.com/cvs-health/uqlm/pull/230
Polish notebooks and readme by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/246
Bump pytest-cov from 6.3.0 to 7.0.0 by @dependabot[bot] in https://github.com/cvs-health/uqlm/pull/176
Number of semantic sets scorer by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/247
Improve unit tests code coverage by @zeya30 in https://github.com/cvs-health/uqlm/pull/241
Minor refactor + Improved test coverage by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/255
Minor refactor + updated demos by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/256
Update scorer definitions + fix logprobs bug in SemanticEntropy by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/257
Reuse NLI Scores by @mohitcek in https://github.com/cvs-health/uqlm/pull/260
Improve unit tests code coverage by @zeya30 in https://github.com/cvs-health/uqlm/pull/258
Allow for torch.device in WhitBoxUQ by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/261
Fix logprob bug by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/262
Update docs site by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/259
Release PR: v0.4.0 by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/263

New Contributors

@dross20 made their first contribution in https://github.com/cvs-health/uqlm/pull/209

Full Changelog: https://github.com/cvs-health/uqlm/compare/v0.3.1...v0.4.0

Breaking Changes

Dropped support for Python 3.9
Renamed `NLIScorer` to `ConsistencyScorer` and moved related methods to `uqlm.nli.NLI`
Deprecated `normalized_probability` scorer in favor of `sequence_probability` with `length_normalize` (removal planned in v0.5)
Deprecated `system_prompt` and `template_ques_ans`; use `additional_context` instead

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track UQLM

Get notified when new releases ship.

About UQLM

All releases →

UQLM

Summary

Highlights

1. Varied tutorials for more model and dataset coverage

LLMs

Datasets

2. New scorers added

White-Box scorers

Single-generation white-box scorers

Sampling-based white-box scorers

Reflexive white-box scorers

Black-Box scorers

3. New `LLMGrader` class and updated default grader for `UQEnsemble`

4. Option to provide additional context to LLM judges

5. New datasets available with `load_example_dataset`

6. `uqlm.nli` sub-package

7. `uqlm.white_box` sub-package

8. Minor changes & future deprecations

What's Changed

New Contributors

Breaking Changes

Related context

Related tools

UQLM

Summary

Highlights

1. Varied tutorials for more model and dataset coverage

LLMs

Datasets

2. New scorers added

White-Box scorers

Single-generation white-box scorers

Sampling-based white-box scorers

Reflexive white-box scorers

Black-Box scorers

3. New LLMGrader class and updated default grader for UQEnsemble

4. Option to provide additional context to LLM judges

5. New datasets available with load_example_dataset

6. uqlm.nli sub-package

7. uqlm.white_box sub-package

8. Minor changes & future deprecations

What's Changed

New Contributors

Breaking Changes

Related context

Related tools

3. New `LLMGrader` class and updated default grader for `UQEnsemble`

5. New datasets available with `load_example_dataset`

6. `uqlm.nli` sub-package

7. `uqlm.white_box` sub-package