Skip to content

UQLM

v0.5.0 Breaking

This release includes 1 breaking change for platform teams planning a safe upgrade.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-evaluation ai-safety confidence-estimation confidence-score hallucination hallucination-detection
+8 more
hallucination-evaluation hallucination-mitigation llm llm-evaluation llm-hallucination llm-safety uncertainty-estimation uncertainty-quantification

Affected surfaces

deps

Summary

AI summary

Updates Breaking changes, Other changes, and https://arxiv.org/abs/2403.20279 across a mixed release.

Full changelog

New Methods: Long-Form UQ

Short-form UQ methods have been shown to generalize poorly to long-form LLM outputs. Fine-grained methods for long-form UQ address these limitations by first decomposing responses into granular units (sentences or claims) and then scoring each unit.

Response Decomposition

We enable decomposition of responses into sentences or claims using our ResponseDecomposer class. This class implements claim decomposition using an LLM or sentence decomposition using a rule-based approach.

Scoring methods

We add three families of fine-grained scorers for long-form uncertainty quantification: Unit-Response, Matched-Unit, and Unit-QA

1. Unit-Response (Based on the LUQ/LUQ-Atomic methods)

These scorers measure whether sampled responses entail each unit (sentence or claim) in the original response and average across sampled responses to obtain unit-level confidence scores. This is implemented with the uqlm.scorers.longform.LongTextUQ class.

2. Matched-Unit (Based on the LUQ-pair method)

These scorers work by matching each original sentence or claim to its most similar counterpart in sampled responses before computing entailment scores. Matched scores are then averaged across sampled responses to obtain a confidence score for each unit in the original response. This is implemented with the uqlm.scorers.longform.LongTextUQ class.

3. Unit-QA (Based on the Longform Semantic Entropy method)

These scorers work by decomposing a response into granular units (sentences or claims), generating questions whose answers are the claims given context, sampling multiple answers, and computes black-box UQ scores across these answers. his is implemented with the uqlm.scorers.longform.LongTextQA class.

4. Graph-Based (Based on the Jiang et al., 2024)

Graph-based scorers decompose original and sampled responses into claims, obtain the union of unique claims across all responses, and compute graph centrality metrics on the bipartite graph of claim-response entailment to measure uncertainty. This is implemented with the uqlm.scorers.longform.LongTextGraph class.

These scorer classes all share the same parent class: uqlm.scorers.longform.baseclass.LongFormUQ.

Response Refinement with Uncertainty Aware Decoding

Response refinement works by dropping claims with confidence scores (specified with claim_filtering_scorer parameter) below a specified threshold (specified with response_refinement_threshold parameter) and reconstructing the response from the retained claims. This functionality is available in combination with any of the four methods described above by setting response_refinement=True in the constructor of the corresponding scorer class.

Performance Evaluation

We enable FactScore-based grading using an LLM. This works by comparing units (sentences or claims) in a generated response to a FactScore question against the corresponding text of the subject's wikipedia article.

New docs site pages

We have added a "Scorer Definitions" tab to the docs site, intended to serve as an 'encyclopedia' of available scoring methods. It provides formal definitions, explanations in simple terms, and code snippets for all available methods.

Other changes

  • uqlm.scorers has now been refactored with two subfolders: uqlm.scorers.shortform (which contains existing scorer classes as of v0.4) and uqlm.scorers.longform which contains classes to implement the above mentioned scoring methods
  • the readme has been updated to reflect new longform scorers, and a new readme has been added inside the examples/ directory to provide more details on the available tutorials
  • various package upgrades to address security vulnerabilities identified by dependabot

Breaking changes

  • normalized_probability has been deprecated from acceptable white-box scorer list in WhiteBoxUQ and UQEnsemble in favor of sequence_probability with length_normalize=True (default). This also affects the key/column names in the returned UQResult object.

What's Changed

  • v0.3 updates by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/197
  • LLM based NLI + ResponseDecomposer upgrades + restructured prompts by @dskarbrevik in https://github.com/cvs-health/uqlm/pull/199
  • Minor refactor by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/201
  • add aggregation method by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/202
  • Add mode, granularity parameters in place of scorers by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/204
  • Long-form Semantic Entropy by @mohitcek in https://github.com/cvs-health/uqlm/pull/203
  • add factscore grader by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/207
  • Enable more granular score return by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/208
  • Binary style for NLI class by @dskarbrevik in https://github.com/cvs-health/uqlm/pull/206
  • update grader by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/215
  • Longform Feature: evaluate method to compute semantic entropy by @mohitcek in https://github.com/cvs-health/uqlm/pull/217
  • Refactor ClaimQA class by @mohitcek in https://github.com/cvs-health/uqlm/pull/218
  • Patch/v0.3.1 by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/225
  • v0.3.1 updates by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/224
  • update question template by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/227
  • Feat: ClaimQA class - multiple questions per factoid/claim by @mohitcek in https://github.com/cvs-health/uqlm/pull/228
  • Claimqa updates by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/235
  • v0.4.4 updates by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/279
  • Merge develop -> longform UQ branch by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/282
  • v0.4.5 updates by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/286
  • LongForm UQ by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/283
  • Created new directories for short-form and long-form responses by @mohitcek in https://github.com/cvs-health/uqlm/pull/288
  • Refactor uqlm.scorers for shorform vs. longform parent classes by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/289
  • Issue #244 - Added Scorer Definitions on Docs Site by @vgyani in https://github.com/cvs-health/uqlm/pull/287
  • Add long-text definition to docs by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/298
  • Rearrange subpackages by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/300
  • Rename modules, add UAD scorer specification by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/304
  • Update notebooks by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/308
  • Graph based long-form scoring by @dskarbrevik in https://github.com/cvs-health/uqlm/pull/307
  • Fix links and test by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/309
  • Add new unit tests by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/310
  • update uad graphics by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/311
  • update luq graphic and version by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/313
  • add qa unit test by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/314
  • Minor release: v0.5.0 by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/315

Full Changelog: https://github.com/cvs-health/uqlm/compare/v0.4.5...v0.5.0

Breaking Changes

  • Deprecates `normalized_probability` from acceptable white-box scorer list in `WhiteBoxUQ` and `UQEnsemble`; replace with `sequence_probability` using `length_normalize=True` (default). This also changes key/column names in the returned `UQResult` object.

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track UQLM

Get notified when new releases ship.

Sign up free

Related context

Beta — feedback welcome: [email protected]