UQLM

v0.3.0 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 9mo AI Agents & Assistants

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-evaluation ai-safety confidence-estimation confidence-score hallucination hallucination-detection

+8 more

hallucination-evaluation hallucination-mitigation llm llm-evaluation llm-hallucination llm-safety uncertainty-estimation uncertainty-quantification

Summary

AI summary

Updates 6. Bug Fixes, 4. Benchmark Dataset Extension, and 3. LLM Judge explanations across a mixed release.

Full changelog

1. Dataset-specific confidence score calibration

Introduced the new ScoreCalibrator class for calibrating confidence scores on specific datasets (Platt or Isotonic)
Includes evaluate_calibration function for evaluating score calibration with plots and various metrics, including ECE, MCE, Brier Score, Calibration Gap, and log-loss
For a detailed walkthrough of this feature, please refer to the demo notebook

2. Enabled use of LangChain `BaseMessage` with `prompts` argument

Added support for List[List[BaseMessage]] alongside the existing List[str] format for prompts argument of generate_and_score method in the following classes:
- UQEnsemble
- BlackBoxUQ
- WhiteBoxUQ
- SemanticEntropy
This enhancement enables uncertainty quantification and hallucination detection with:
- Multimodal inputs (e.g. image)
- Chat history
- Various message types (HumanMessage, AIMessage, SystemMessage)
Note: This feature is currently in Beta and is not compatible with LLM judges (LLMPanel or judge components of UQEnsemble)
For a detailed walkthrough of this feature, please refer to the demo notebook

3. LLM Judge explanations

Enhanced the LLMPanel class to provide explanations alongside scores
Judges can now justify their evaluations with detailed reasoning
Specified with boolean parameter explanations

4. Benchmark Dataset Extension

Added support for the FactScore benchmark dataset via the load_example_dataset function
Enables evaluation of long-form question answering capabilities in LLMs

5. Updated utility plotting functions

Added plot_ranked_auc option to compute AUPRC (rather then current AUROC only) and rank them in a color-coded bar plot (as seen in our research paper). Added missing legend to this function.

6. Bug Fixes

Fixed the LiveError issue that occurred with rich progress bars when retrying after code interruption
Removed unused images for docs site
Added missing unit tests for utility plotting functions
Updated demo notebooks to use non-deprecated LLMs (gemini-1.5-flash -> gemini-2.5-flash)

What's Changed

Add score calibration by @jmabry in https://github.com/cvs-health/uqlm/pull/147
v0.2.7 updates by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/171
Feat: Integrate ScoreCalibration class to existing structure by @mohitcek in https://github.com/cvs-health/uqlm/pull/165
Confidence score calibration by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/181
Enable UQ with multimodal inputs by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/182
Bump sphinx from 7.3.7 to 7.4.7 by @dependabot[bot] in https://github.com/cvs-health/uqlm/pull/177
Removing unused images and set correct switcher json url by @doyajii1 in https://github.com/cvs-health/uqlm/pull/184
update URLs in README to use main branch by @vgyani in https://github.com/cvs-health/uqlm/pull/187
Removed a typo from black_box_demo.ipynb by @kaushik-42 in https://github.com/cvs-health/uqlm/pull/188
Update plot_ranked_auc by @zeya30 in https://github.com/cvs-health/uqlm/pull/183
Enable explanations with LLM judge scores by @NamrataWalanj7 in https://github.com/cvs-health/uqlm/pull/178
fix continuous judge output handling by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/189
Adding factscore dataset by @dskarbrevik in https://github.com/cvs-health/uqlm/pull/191
Minor release: v0.3 by @dylanbouchard in https://github.com/cvs-health/uqlm/pull/192

New Contributors

@jmabry made their first contribution in https://github.com/cvs-health/uqlm/pull/147
@vgyani made their first contribution in https://github.com/cvs-health/uqlm/pull/187
@kaushik-42 made their first contribution in https://github.com/cvs-health/uqlm/pull/188

Full Changelog: https://github.com/cvs-health/uqlm/compare/v0.2.7...v0.3.0

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track UQLM

Get notified when new releases ship.

About UQLM

All releases →

UQLM

Summary

1. Dataset-specific confidence score calibration

2. Enabled use of LangChain `BaseMessage` with `prompts` argument

3. LLM Judge explanations

4. Benchmark Dataset Extension

5. Updated utility plotting functions

6. Bug Fixes

What's Changed

New Contributors

Related context

Related tools

UQLM

Summary

1. Dataset-specific confidence score calibration

2. Enabled use of LangChain BaseMessage with prompts argument

3. LLM Judge explanations

4. Benchmark Dataset Extension

5. Updated utility plotting functions

6. Bug Fixes

What's Changed

New Contributors

Related context

Related tools

2. Enabled use of LangChain `BaseMessage` with `prompts` argument