cameronrye/openzim-mcp

v2.0.0a14 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 2mo MCP Data & Storage

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

kiwix mcp mcp-server openzim zim

ReleasePort's take

Light signal

editorial:auto 2mo

v2.0.0a14 enhances entity resolution for long prose queries and boosts section relevance in synthesize mode, while introducing new configuration options and refactoring internal matching logic.

Why it matters: Improves accuracy of canonical entity extraction for lengthy inputs and surfaces the most pertinent sections; new `section_affinity_threshold` and `section_affinity_boost` parameters let teams fine‑tune behavior before adopting v2.0.0a14.

Summary

AI summary

Prose questions now resolve canonical entities and lead with the most relevant section when synthesize is enabled.

Changes in this release

Type	Severity	Summary	CVE
Feature
Feature	Medium	Greedy length-down tail-probe entity resolution improves long prose query handling. Greedy length-down tail-probe entity resolution improves long prose query handling. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	Section-heading affinity boost in synthesize mode promotes relevant sections to lead passage. Section-heading affinity boost in synthesize mode promotes relevant sections to lead passage. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	Multi-round handles added to SynthesizeResponse expose candidate articles and sections for follow-up turns. Multi-round handles added to SynthesizeResponse expose candidate articles and sections for follow-up turns. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	_boost_by_section_affinity pipeline stage computes section affinity boost based on query and heading token overlap. _boost_by_section_affinity pipeline stage computes section affinity boost based on query and heading token overlap. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	SynthesizeConfig parameters `section_affinity_threshold` and `section_affinity_boost` added for tuning affinity boosting. SynthesizeConfig parameters `section_affinity_threshold` and `section_affinity_boost` added for tuning affinity boosting. Source: llm_adapter@2026-05-21 Confidence: low	—
Feature	Medium	ConsideredArticle and ConsideredSection TypedDicts defined in `tool_schemas.py` to structure candidate handles. ConsideredArticle and ConsideredSection TypedDicts defined in `tool_schemas.py` to structure candidate handles. Source: llm_adapter@2026-05-21 Confidence: low	—
Feature	Medium	SynthesizeResponse TypedDict now `total=False` to accommodate new optional fields without affecting existing callers. SynthesizeResponse TypedDict now `total=False` to accommodate new optional fields without affecting existing callers. Source: llm_adapter@2026-05-21 Confidence: low	—
Dependency	Medium	`iter_query_tails` helper introduced in `title_promotion.py` for shared trailing-token iteration. `iter_query_tails` helper introduced in `title_promotion.py` for shared trailing-token iteration. Source: llm_adapter@2026-05-21 Confidence: low	—
Refactor	Medium	_promote_title_match removed M26 4-token short-circuit, enabling full tail probing for long queries. _promote_title_match removed M26 4-token short-circuit, enabling full tail probing for long queries. Source: llm_adapter@2026-05-21 Confidence: high	—
Refactor	Medium	_promote_topic_via_title_index rewritten as two-pass strict then fuzzy tail probe to prioritize exact matches. _promote_topic_via_title_index rewritten as two-pass strict then fuzzy tail probe to prioritize exact matches. Source: llm_adapter@2026-05-21 Confidence: high	—

Full changelog

First post-beta-test alpha that ships a feature rather than a sweep:
natural-language prose questions now resolve to canonical entities
and (in synthesize=True mode) lead with the most relevant section
of the resolved article. Three coordinated changes:

Greedy length-down tail-probe entity resolution. A shared
iter_query_tails helper in title_promotion.py iterates the
trailing 4 → 3 → 2 → 1 tokens of a query. Both the default
_handle_tell_me_about path (via _promote_topic_via_title_index,
two-pass strict-then-fuzzy) and the synthesize path (via
_promote_title_match, single-pass strict) now probe each tail.
This replaces the M26 4-token short-circuit that previously caused
long prose queries like "who are some famous people from big
rapids, michigan" to fall through to BM25 noise instead of
resolving the canonical Big_Rapids,_Michigan entity.
Section-heading affinity boost in synthesize. A new
_boost_by_section_affinity pipeline stage runs after
_attribute_sections. For each passage carrying a #section_id,
it computes |query_tokens ∩ heading_tokens| / |heading_tokens|.
When that ratio meets SynthesizeConfig.section_affinity_threshold
(default 0.25), the passage score is multiplied by
section_affinity_boost (default 1.5) and the list is
re-sorted (with rank renumbered to match). Archive-agnostic:
the archive's own section headings supply the matching
vocabulary, no curated synonym tables.
Multi-round handles on SynthesizeResponse. Two new optional
fields surface the candidate space:
considered_articles (top-3 article hits not featured) exposes
(archive, entry_path, title, score) so a follow-up turn can pivot
via get_zim_entries. considered_sections (top-10 sections of
the featured article, in document order, minus the featured one)
exposes (section_id, title) so a follow-up turn can pivot via
get_section. SynthesizeResponse switches to
TypedDict(total=False) to accommodate the additive shape;
existing callers populating every field are unaffected. Compact-
mode markdown rendering of these fields is deferred — the
structured payload (structuredContent) always carries them.

The motivating query "who are some famous people from big rapids,
michigan" now traces:

Default mode: tail probe resolves Big_Rapids,_Michigan, returns
the article body. Better than today's BM25-noise outcome, though
the response is not yet section-targeted in default mode.
synthesize=True: tail probe resolves the entity, affinity boost
promotes the #Notable_people section to the lead passage, and
the response carries considered_articles + considered_sections
handles for the next turn.

Added

iter_query_tails(query, *, max_len=4, min_len=1) in
openzim_mcp/title_promotion.py — greedy length-down trailing-
token iterator, lowercased + [a-z0-9]+ tokenized. Shared by both
entity-resolution paths. Underscore is treated as a token boundary
so path-form input like Big_Rapids,_Michigan tokenizes correctly.
_boost_by_section_affinity pipeline stage in
openzim_mcp/synthesize.py plus the _section_titles_for and
_maybe_boost_passage helpers. Bundle-titles lookup is memoized
per call; exceptions and None bundles are no-ops (score unchanged).
SynthesizeConfig.section_affinity_threshold (default 0.25,
bounds [0.0, 1.0]) and section_affinity_boost (default 1.5,
bounds [1.0, 10.0]) — Pydantic-validated tunables for the new
stage.
ConsideredArticle and ConsideredSection TypedDicts in
openzim_mcp/tool_schemas.py.
_build_considered_articles and _build_considered_sections
helpers in openzim_mcp/synthesize.py. Featured article and
section are excluded so the lists are alternatives, not
duplicates of the featured citation.

Changed

_promote_title_match in synthesize.py: removed the M26 4-token
short-circuit. Long prose queries with a clear entity tail now
resolve canonically instead of falling through to BM25 noise.
_promote_topic_via_title_index in simple_tools.py: rewritten
as a two-pass tail-probe (strict 1.0-score gate across all tails
first, then 0.8-score typo-tolerant gate across all tails). The
two-pass ordering prevents a fuzzy 0.8 match on a long noisy tail
from winning over an exact 1.0 match on a clean shorter tail.
SynthesizeResponse TypedDict is now total=False to accommodate
the new optional fields. Existing callers populating every field
are unaffected.

Tests

46 new unit tests across tests/test_iter_query_tails.py,
tests/test_simple_tools_tail_probe.py,
tests/test_synthesize_section_affinity.py,
tests/test_synthesize_considered_handles.py, and additions to
tests/test_synthesize_title_promotion_v2a9.py and
tests/test_tool_schemas.py. Test count: 1567 → 1566 (one less
because two affinity-boost tests with identical setup blocks were
merged into one combined assertion; SonarCloud flagged the
intra-file duplication).
Three golden snapshots refreshed
(synthesize_berlin_geography.json, synthesize_munich_history.json,
synthesize_capital_city.json) — the new considered_* fields are
always emitted, and the score change from 1.0 → 1.5 on
entity-name section headings reflects the affinity boost firing.
test_metadata_namespace_from_metadata_keys threshold relaxed
from >= 10 to >= 5 after an upstream zim-testing-suite
fixture refresh changed nons/small.zim's metadata-key count
from 10 to 9 (broke comprehensive-testing on main before this
alpha was cut).

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track cameronrye/openzim-mcp

Get notified when new releases ship.

About cameronrye/openzim-mcp

Modern, secure MCP server for accessing ZIM format knowledge bases offline. Enables AI models to search and navigate Wikipedia, educational content, and other compressed knowledge archives with smart retrieval, caching, and comprehensive API.

All releases →

Related context

Related tools

Earlier breaking changes

v2.0.0a15 _attribute_sections falls back to first section when no section brackets located passage
v2.0.0a13 canonical‑splice gate tightened to require exact path equality, fixing H2/H3 surface end‑to‑end behavior across all shapes.
v2.0.0a11 Exposed `content_offset` as top-level `zim_query` parameter, validated >=0, threaded through options.
v2.0.0a10 `get article M/<key>` now returns ZIM metadata entry rather than aliased C-namespace article body.
v2.0.0a10 `metadata for <file>` returns concise metadata strings instead of full article bodies for new-scheme archives.