Skip to content

UQLM

v0.4.0 Breaking

This release includes 4 breaking changes for platform teams planning a safe upgrade.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-evaluation ai-safety confidence-estimation confidence-score hallucination hallucination-detection
+8 more
hallucination-evaluation hallucination-mitigation llm llm-evaluation llm-hallucination llm-safety uncertainty-estimation uncertainty-quantification

Summary

AI summary

Broad release touches Reflexive white-box scorers, Datasets, LLMs, and Single-generation white-box scorers.

Full changelog

Highlights

1. Varied tutorials for more model and dataset coverage

We have updated the example notebooks to have broader coverage over LLMs and example datasets.

LLMs

  • Gemini models
  • GPT-4* models
  • o3-mini
  • Qwen
  • Mistral
  • LLama
  • Deepseek

Datasets

  • GSM8K
  • SVAMP
  • PopQA
  • NQ-Open
  • AI2-ARC
  • CSQA
  • SimpleQA
  • HotpotQA
  • Image (multimodal demo)

2. New scorers added

This release includes the addition of 11 new scorers spanning various categories (with accompanying unit tests). Details are provided below.

White-Box scorers

We are offering 9 new white-box scorers with this release. These scorers can be implemented with WhiteBoxUQ by specifying the respective scorer names in the scorers list. The length_normalize parameter determines whether response probabilities are length-normalized for the sampling-based white-box scorers.

Single-generation white-box scorers
Sampling-based white-box scorers
Reflexive white-box scorers

Black-Box scorers

We are implementing two new black-box scorers with this release. They can be specified using the scorers parameter in BlackBoxUQ.

Definitions of new scorers are provided with LaTeX at the end of applicable tutorial notebooks. We have also added new tutorial notebooks for Semantic Density and multi-generation white-box scorers. The readme has also been updated to reflect the new scorers.

3. New LLMGrader class and updated default grader for UQEnsemble

This release includes a new utility class uqlm.utils.grader.LLMGrader which is instantiated from a BaseChatModel and grades LLM responses against an answer key. This class appears:

  • in the example notebooks for evaluating hallucination detection performance.
  • as the updated default grader, replacing vectara/hallucination_evaluation_model, as that model is now gated.

4. Option to provide additional context to LLM judges

Users can now pass additional instructions to their LLM judges by using the additional_context parameter in the constructor of LLMPanel.

5. New datasets available with load_example_dataset

The utility function load_example_dataset now offers HotpotQA and SimpleQA datasets.

6. uqlm.nli sub-package

Created uqlm.nli sub-package that contains the following:

  • NLI class for NLI scoring only. Semantic entropy and noncontradiction calculations are respectively moved to uqlm.scorers.SemanticEntropy and uqlm.black_box.ConsistencyScorer classes.
  • SemanticClusterer class for semantic clustering (used for semantic entropy, semantic density, and number of semantic sets)

7. uqlm.white_box sub-package

Created uqlm.white_box sub-package that contains three classes for white-box computations from logprobs:

  • SingleLogprobsScorer for computing scores that depend on only logprobs from one generated response: normalized probability, sequence probability, minimum probability
  • TopLogprobsScorer for computing scorers that depend on top-K logprobs from generated response: mean top-k token negentropy, min top-k token negentropy, and likelihood margin
  • SampledLogprobsScorer for computing scores that that depend on logprobs from multiple sampled responses: monte carlo probability, CoCoA, semantic entropy, and semantic density
  • PTrueScorer for implementing the P(True) method

8. Minor changes & future deprecations

  • Renamed NLIScorer -> ConsistencyScorer and moved some methods to uqlm.nli.NLI class
  • normalized_probability scorer name in WhiteBoxUQ will be deprecated in v0.5 in favor of sequence_probability with length_normalize. The default scorers of WhiteBoxUQ will be scorers=["min_probability", "sequence_probability"]. The default value of length_normalize=True will apply to sequence_probability, so that it returns what normalized_probability currently returns.
  • system_prompt and template_ques_ans are deprecated in favor of additional_context parameter
  • default grader in UQEnsemble.tune now uses LLMGrader with the user-provided LLM used for generation

What's Changed

  • Add Semantic Density scorer by @dross20 in https://github.com/cvs-health/uqlm/pull/209
  • Adding HotPotQA and SimpleQA by @dskarbrevik in https://github.com/cvs-health/uqlm/pull/210
  • Semantic density, docs by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/212
  • Semantic density notebook by @dross20 in https://github.com/cvs-health/uqlm/pull/213
  • Semantic density by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/214
  • v0.3.1 updates by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/220
  • add judge customization option by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/221
  • v0.3.1 updates by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/223
  • New White Box Scorers by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/219
  • Diversify demos by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/232
  • Update notebooks by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/233
  • update demo notebooks by @zeya30 in https://github.com/cvs-health/uqlm/pull/229
  • Update demo notebooks by @zeya30 in https://github.com/cvs-health/uqlm/pull/234
  • Llm grader by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/238
  • update demo notebooks by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/239
  • Refactor: NLI Subpackage II by @mohitcek in https://github.com/cvs-health/uqlm/pull/237
  • Feature: Integrate SemanticEntropy and SemanticDensity methods with WhiteBoxUQ class by @mohitcek in https://github.com/cvs-health/uqlm/pull/240
  • Drop python 3.9 support by @doyajii1 in https://github.com/cvs-health/uqlm/pull/242
  • Jmlr revisions by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/243
  • Bump sphinx-autodoc-typehints from 2.2.0 to 2.3.0 by @dependabot[bot] in https://github.com/cvs-health/uqlm/pull/230
  • Polish notebooks and readme by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/246
  • Bump pytest-cov from 6.3.0 to 7.0.0 by @dependabot[bot] in https://github.com/cvs-health/uqlm/pull/176
  • Number of semantic sets scorer by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/247
  • Improve unit tests code coverage by @zeya30 in https://github.com/cvs-health/uqlm/pull/241
  • Minor refactor + Improved test coverage by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/255
  • Minor refactor + updated demos by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/256
  • Update scorer definitions + fix logprobs bug in SemanticEntropy by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/257
  • Reuse NLI Scores by @mohitcek in https://github.com/cvs-health/uqlm/pull/260
  • Improve unit tests code coverage by @zeya30 in https://github.com/cvs-health/uqlm/pull/258
  • Allow for torch.device in WhitBoxUQ by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/261
  • Fix logprob bug by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/262
  • Update docs site by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/259
  • Release PR: v0.4.0 by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/263

New Contributors

  • @dross20 made their first contribution in https://github.com/cvs-health/uqlm/pull/209

Full Changelog: https://github.com/cvs-health/uqlm/compare/v0.3.1...v0.4.0

Breaking Changes

  • Dropped support for Python 3.9
  • Renamed `NLIScorer` to `ConsistencyScorer` and moved related methods to `uqlm.nli.NLI`
  • Deprecated `normalized_probability` scorer in favor of `sequence_probability` with `length_normalize` (removal planned in v0.5)
  • Deprecated `system_prompt` and `template_ques_ans`; use `additional_context` instead

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track UQLM

Get notified when new releases ship.

Sign up free

Related context

Beta — feedback welcome: [email protected]