UQLM

v0.5.0 Breaking

This release includes 1 breaking change for platform teams planning a safe upgrade.

Published 6mo AI Agents & Assistants

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-evaluation ai-safety confidence-estimation confidence-score hallucination hallucination-detection

+8 more

hallucination-evaluation hallucination-mitigation llm llm-evaluation llm-hallucination llm-safety uncertainty-estimation uncertainty-quantification

Affected surfaces

deps

Summary

AI summary

Updates Breaking changes, Other changes, and https://arxiv.org/abs/2403.20279 across a mixed release.

Full changelog

New Methods: Long-Form UQ

Short-form UQ methods have been shown to generalize poorly to long-form LLM outputs. Fine-grained methods for long-form UQ address these limitations by first decomposing responses into granular units (sentences or claims) and then scoring each unit.

Response Decomposition

We enable decomposition of responses into sentences or claims using our ResponseDecomposer class. This class implements claim decomposition using an LLM or sentence decomposition using a rule-based approach.

Scoring methods

We add three families of fine-grained scorers for long-form uncertainty quantification: Unit-Response, Matched-Unit, and Unit-QA

1. Unit-Response (Based on the LUQ/LUQ-Atomic methods)

These scorers measure whether sampled responses entail each unit (sentence or claim) in the original response and average across sampled responses to obtain unit-level confidence scores. This is implemented with the uqlm.scorers.longform.LongTextUQ class.

2. Matched-Unit (Based on the LUQ-pair method)

These scorers work by matching each original sentence or claim to its most similar counterpart in sampled responses before computing entailment scores. Matched scores are then averaged across sampled responses to obtain a confidence score for each unit in the original response. This is implemented with the uqlm.scorers.longform.LongTextUQ class.

3. Unit-QA (Based on the Longform Semantic Entropy method)

These scorers work by decomposing a response into granular units (sentences or claims), generating questions whose answers are the claims given context, sampling multiple answers, and computes black-box UQ scores across these answers. his is implemented with the uqlm.scorers.longform.LongTextQA class.

4. Graph-Based (Based on the Jiang et al., 2024)

Graph-based scorers decompose original and sampled responses into claims, obtain the union of unique claims across all responses, and compute graph centrality metrics on the bipartite graph of claim-response entailment to measure uncertainty. This is implemented with the uqlm.scorers.longform.LongTextGraph class.

These scorer classes all share the same parent class: uqlm.scorers.longform.baseclass.LongFormUQ.

Response Refinement with Uncertainty Aware Decoding

Response refinement works by dropping claims with confidence scores (specified with claim_filtering_scorer parameter) below a specified threshold (specified with response_refinement_threshold parameter) and reconstructing the response from the retained claims. This functionality is available in combination with any of the four methods described above by setting response_refinement=True in the constructor of the corresponding scorer class.

Performance Evaluation

We enable FactScore-based grading using an LLM. This works by comparing units (sentences or claims) in a generated response to a FactScore question against the corresponding text of the subject's wikipedia article.

New docs site pages

We have added a "Scorer Definitions" tab to the docs site, intended to serve as an 'encyclopedia' of available scoring methods. It provides formal definitions, explanations in simple terms, and code snippets for all available methods.

Other changes

uqlm.scorers has now been refactored with two subfolders: uqlm.scorers.shortform (which contains existing scorer classes as of v0.4) and uqlm.scorers.longform which contains classes to implement the above mentioned scoring methods
the readme has been updated to reflect new longform scorers, and a new readme has been added inside the examples/ directory to provide more details on the available tutorials
various package upgrades to address security vulnerabilities identified by dependabot

Breaking changes

normalized_probability has been deprecated from acceptable white-box scorer list in WhiteBoxUQ and UQEnsemble in favor of sequence_probability with length_normalize=True (default). This also affects the key/column names in the returned UQResult object.

What's Changed

v0.3 updates by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/197
LLM based NLI + ResponseDecomposer upgrades + restructured prompts by @dskarbrevik in https://github.com/cvs-health/uqlm/pull/199
Minor refactor by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/201
add aggregation method by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/202
Add mode, granularity parameters in place of scorers by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/204
Long-form Semantic Entropy by @mohitcek in https://github.com/cvs-health/uqlm/pull/203
add factscore grader by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/207
Enable more granular score return by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/208
Binary style for NLI class by @dskarbrevik in https://github.com/cvs-health/uqlm/pull/206
update grader by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/215
Longform Feature: evaluate method to compute semantic entropy by @mohitcek in https://github.com/cvs-health/uqlm/pull/217
Refactor ClaimQA class by @mohitcek in https://github.com/cvs-health/uqlm/pull/218
Patch/v0.3.1 by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/225
v0.3.1 updates by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/224
update question template by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/227
Feat: ClaimQA class - multiple questions per factoid/claim by @mohitcek in https://github.com/cvs-health/uqlm/pull/228
Claimqa updates by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/235
v0.4.4 updates by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/279
Merge develop -> longform UQ branch by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/282
v0.4.5 updates by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/286
LongForm UQ by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/283
Created new directories for short-form and long-form responses by @mohitcek in https://github.com/cvs-health/uqlm/pull/288
Refactor uqlm.scorers for shorform vs. longform parent classes by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/289
Issue #244 - Added Scorer Definitions on Docs Site by @vgyani in https://github.com/cvs-health/uqlm/pull/287
Add long-text definition to docs by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/298
Rearrange subpackages by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/300
Rename modules, add UAD scorer specification by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/304
Update notebooks by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/308
Graph based long-form scoring by @dskarbrevik in https://github.com/cvs-health/uqlm/pull/307
Fix links and test by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/309
Add new unit tests by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/310
update uad graphics by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/311
update luq graphic and version by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/313
add qa unit test by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/314
Minor release: v0.5.0 by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/315

Full Changelog: https://github.com/cvs-health/uqlm/compare/v0.4.5...v0.5.0

Breaking Changes

Deprecates `normalized_probability` from acceptable white-box scorer list in `WhiteBoxUQ` and `UQEnsemble`; replace with `sequence_probability` using `length_normalize=True` (default). This also changes key/column names in the returned `UQResult` object.

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track UQLM

Get notified when new releases ship.

About UQLM

All releases →