Skip to content

cameronrye/openzim-mcp

v2.0.0a14 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 19d MCP Data & Storage
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

kiwix mcp mcp-server openzim zim

ReleasePort's take

Light signal
editorial:auto 9d

v2.0.0a14 enhances entity resolution for long prose queries and boosts section relevance in synthesize mode, while introducing new configuration options and refactoring internal matching logic.

Why it matters: Improves accuracy of canonical entity extraction for lengthy inputs and surfaces the most pertinent sections; new `section_affinity_threshold` and `section_affinity_boost` parameters let teams fine‑tune behavior before adopting v2.0.0a14.

Summary

AI summary

Prose questions now resolve canonical entities and lead with the most relevant section when synthesize is enabled.

Changes in this release

Feature Medium

Greedy length-down tail-probe entity resolution improves long prose query handling.

Greedy length-down tail-probe entity resolution improves long prose query handling.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

Section-heading affinity boost in synthesize mode promotes relevant sections to lead passage.

Section-heading affinity boost in synthesize mode promotes relevant sections to lead passage.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

Multi-round handles added to SynthesizeResponse expose candidate articles and sections for follow-up turns.

Multi-round handles added to SynthesizeResponse expose candidate articles and sections for follow-up turns.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

_boost_by_section_affinity pipeline stage computes section affinity boost based on query and heading token overlap.

_boost_by_section_affinity pipeline stage computes section affinity boost based on query and heading token overlap.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

SynthesizeConfig parameters `section_affinity_threshold` and `section_affinity_boost` added for tuning affinity boosting.

SynthesizeConfig parameters `section_affinity_threshold` and `section_affinity_boost` added for tuning affinity boosting.

Source: llm_adapter@2026-05-21

Confidence: low

Feature Medium

ConsideredArticle and ConsideredSection TypedDicts defined in `tool_schemas.py` to structure candidate handles.

ConsideredArticle and ConsideredSection TypedDicts defined in `tool_schemas.py` to structure candidate handles.

Source: llm_adapter@2026-05-21

Confidence: low

Feature Medium

SynthesizeResponse TypedDict now `total=False` to accommodate new optional fields without affecting existing callers.

SynthesizeResponse TypedDict now `total=False` to accommodate new optional fields without affecting existing callers.

Source: llm_adapter@2026-05-21

Confidence: low

Dependency Medium

`iter_query_tails` helper introduced in `title_promotion.py` for shared trailing-token iteration.

`iter_query_tails` helper introduced in `title_promotion.py` for shared trailing-token iteration.

Source: llm_adapter@2026-05-21

Confidence: low

Refactor Medium

_promote_title_match removed M26 4-token short-circuit, enabling full tail probing for long queries.

_promote_title_match removed M26 4-token short-circuit, enabling full tail probing for long queries.

Source: llm_adapter@2026-05-21

Confidence: high

Refactor Medium

_promote_topic_via_title_index rewritten as two-pass strict then fuzzy tail probe to prioritize exact matches.

_promote_topic_via_title_index rewritten as two-pass strict then fuzzy tail probe to prioritize exact matches.

Source: llm_adapter@2026-05-21

Confidence: high

Full changelog

First post-beta-test alpha that ships a feature rather than a sweep:
natural-language prose questions now resolve to canonical entities
and (in synthesize=True mode) lead with the most relevant section
of the resolved article. Three coordinated changes:

  1. Greedy length-down tail-probe entity resolution. A shared
    iter_query_tails helper in title_promotion.py iterates the
    trailing 4 → 3 → 2 → 1 tokens of a query. Both the default
    _handle_tell_me_about path (via _promote_topic_via_title_index,
    two-pass strict-then-fuzzy) and the synthesize path (via
    _promote_title_match, single-pass strict) now probe each tail.
    This replaces the M26 4-token short-circuit that previously caused
    long prose queries like "who are some famous people from big
    rapids, michigan"
    to fall through to BM25 noise instead of
    resolving the canonical Big_Rapids,_Michigan entity.

  2. Section-heading affinity boost in synthesize. A new
    _boost_by_section_affinity pipeline stage runs after
    _attribute_sections. For each passage carrying a #section_id,
    it computes |query_tokens ∩ heading_tokens| / |heading_tokens|.
    When that ratio meets SynthesizeConfig.section_affinity_threshold
    (default 0.25), the passage score is multiplied by
    section_affinity_boost (default 1.5) and the list is
    re-sorted (with rank renumbered to match). Archive-agnostic:
    the archive's own section headings supply the matching
    vocabulary, no curated synonym tables.

  3. Multi-round handles on SynthesizeResponse. Two new optional
    fields surface the candidate space:
    considered_articles (top-3 article hits not featured) exposes
    (archive, entry_path, title, score) so a follow-up turn can pivot
    via get_zim_entries. considered_sections (top-10 sections of
    the featured article, in document order, minus the featured one)
    exposes (section_id, title) so a follow-up turn can pivot via
    get_section. SynthesizeResponse switches to
    TypedDict(total=False) to accommodate the additive shape;
    existing callers populating every field are unaffected. Compact-
    mode markdown rendering of these fields is deferred — the
    structured payload (structuredContent) always carries them.

The motivating query "who are some famous people from big rapids,
michigan"
now traces:

  • Default mode: tail probe resolves Big_Rapids,_Michigan, returns
    the article body. Better than today's BM25-noise outcome, though
    the response is not yet section-targeted in default mode.
  • synthesize=True: tail probe resolves the entity, affinity boost
    promotes the #Notable_people section to the lead passage, and
    the response carries considered_articles + considered_sections
    handles for the next turn.

Added

  • iter_query_tails(query, *, max_len=4, min_len=1) in
    openzim_mcp/title_promotion.py — greedy length-down trailing-
    token iterator, lowercased + [a-z0-9]+ tokenized. Shared by both
    entity-resolution paths. Underscore is treated as a token boundary
    so path-form input like Big_Rapids,_Michigan tokenizes correctly.
  • _boost_by_section_affinity pipeline stage in
    openzim_mcp/synthesize.py plus the _section_titles_for and
    _maybe_boost_passage helpers. Bundle-titles lookup is memoized
    per call; exceptions and None bundles are no-ops (score unchanged).
  • SynthesizeConfig.section_affinity_threshold (default 0.25,
    bounds [0.0, 1.0]) and section_affinity_boost (default 1.5,
    bounds [1.0, 10.0]) — Pydantic-validated tunables for the new
    stage.
  • ConsideredArticle and ConsideredSection TypedDicts in
    openzim_mcp/tool_schemas.py.
  • _build_considered_articles and _build_considered_sections
    helpers in openzim_mcp/synthesize.py. Featured article and
    section are excluded so the lists are alternatives, not
    duplicates of the featured citation.

Changed

  • _promote_title_match in synthesize.py: removed the M26 4-token
    short-circuit. Long prose queries with a clear entity tail now
    resolve canonically instead of falling through to BM25 noise.
  • _promote_topic_via_title_index in simple_tools.py: rewritten
    as a two-pass tail-probe (strict 1.0-score gate across all tails
    first, then 0.8-score typo-tolerant gate across all tails). The
    two-pass ordering prevents a fuzzy 0.8 match on a long noisy tail
    from winning over an exact 1.0 match on a clean shorter tail.
  • SynthesizeResponse TypedDict is now total=False to accommodate
    the new optional fields. Existing callers populating every field
    are unaffected.

Tests

  • 46 new unit tests across tests/test_iter_query_tails.py,
    tests/test_simple_tools_tail_probe.py,
    tests/test_synthesize_section_affinity.py,
    tests/test_synthesize_considered_handles.py, and additions to
    tests/test_synthesize_title_promotion_v2a9.py and
    tests/test_tool_schemas.py. Test count: 1567 → 1566 (one less
    because two affinity-boost tests with identical setup blocks were
    merged into one combined assertion; SonarCloud flagged the
    intra-file duplication).
  • Three golden snapshots refreshed
    (synthesize_berlin_geography.json, synthesize_munich_history.json,
    synthesize_capital_city.json) — the new considered_* fields are
    always emitted, and the score change from 1.0 → 1.5 on
    entity-name section headings reflects the affinity boost firing.
  • test_metadata_namespace_from_metadata_keys threshold relaxed
    from >= 10 to >= 5 after an upstream zim-testing-suite
    fixture refresh changed nons/small.zim's metadata-key count
    from 10 to 9 (broke comprehensive-testing on main before this
    alpha was cut).

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track cameronrye/openzim-mcp

Get notified when new releases ship.

Sign up free

About cameronrye/openzim-mcp

Modern, secure MCP server for accessing ZIM format knowledge bases offline. Enables AI models to search and navigate Wikipedia, educational content, and other compressed knowledge archives with smart retrieval, caching, and comprehensive API.

All releases →

Related context

Earlier breaking changes

  • v2.0.0a15 _attribute_sections falls back to first section when no section brackets located passage
  • v2.0.0a13 canonical‑splice gate tightened to require exact path equality, fixing H2/H3 surface end‑to‑end behavior across all shapes.
  • v2.0.0a11 Exposed `content_offset` as top-level `zim_query` parameter, validated >=0, threaded through options.
  • v2.0.0a10 `get article M/<key>` now returns ZIM metadata entry rather than aliased C-namespace article body.
  • v2.0.0a10 `metadata for <file>` returns concise metadata strings instead of full article bodies for new-scheme archives.

Beta — feedback welcome: [email protected]