Skip to content

cameronrye/openzim-mcp

v2.0.0a10 Breaking

This release includes 1 breaking change for platform teams planning a safe upgrade.

Published 22d MCP Data & Storage
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

kiwix mcp mcp-server openzim zim

Affected surfaces

auth rbac

ReleasePort's take

Light signal
editorial:auto 13d

Metadata APIs now correctly return ZIM metadata instead of silently serving article bodies, and cursor mismatches are rejected.

Why it matters: Patch to v2.0.0a10 immediately if your service relies on accurate metadata responses; the fix prevents silent data leakage and ensures consistent query handling.

Summary

AI summary

Metadata APIs now return correct ZIM metadata instead of silently aliased article bodies, and cursor mismatches are rejected.

Changes in this release

Security Medium

Route M/<key> paths to get_metadata_item for new-scheme archives.

Route M/<key> paths to get_metadata_item for new-scheme archives.

Source: llm_adapter@2026-05-21

Confidence: high

Security Low

Replace regex-based disambiguation lead detection with simple string endswith check to avoid ReDoS risk.

Replace regex-based disambiguation lead detection with simple string endswith check to avoid ReDoS risk.

Source: granite4.1:30b@2026-05-23-audit

Confidence: low

Breaking Medium

Infobox extraction now emits trailing rows without the preceding "GDP —" label, changing bullet-label strings.

Infobox extraction now emits trailing rows without the preceding "GDP —" label, changing bullet-label strings.

Source: granite4.1:30b@2026-05-23-audit

Confidence: low

Breaking Medium

`metadata for <file>` returns concise metadata strings instead of full article bodies for new-scheme archives.

`metadata for <file>` returns concise metadata strings instead of full article bodies for new-scheme archives.

Source: granite4.1:30b@2026-05-23-audit

Confidence: low

Breaking Medium

`get article M/<key>` now returns ZIM metadata entry rather than aliased C-namespace article body.

`get article M/<key>` now returns ZIM metadata entry rather than aliased C-namespace article body.

Source: granite4.1:30b@2026-05-23-audit

Confidence: low

Feature Low

Add stopword-saturation footer for queries matching many stopwords.

Add stopword-saturation footer for queries matching many stopwords.

Source: granite4.1:30b@2026-05-23-audit

Confidence: low

Feature Low

Update truncation hint to operation-agnostic guidance, removing self-reference.

Update truncation hint to operation-agnostic guidance, removing self-reference.

Source: granite4.1:30b@2026-05-23-audit

Confidence: low

Feature Low

Preserve inline disambiguation list in "tell me about" leads when pattern detected.

Preserve inline disambiguation list in "tell me about" leads when pattern detected.

Source: granite4.1:30b@2026-05-23-audit

Confidence: low

Feature Low

Demote List_of_*, Index_of_*, Outline_of_*, Timeline_of_* articles in synthesize ranking after title promotion.

Demote List_of_*, Index_of_*, Outline_of_*, Timeline_of_* articles in synthesize ranking after title promotion.

Source: granite4.1:30b@2026-05-23-audit

Confidence: low

Bugfix Medium

Probes title index first before literal path in related articles handler.

Probes title index first before literal path in related articles handler.

Source: llm_adapter@2026-05-21

Confidence: high

Bugfix Medium

Retry find_title_match with min_score=0.8 after strict gate failure for typo queries.

Retry find_title_match with min_score=0.8 after strict gate failure for typo queries.

Source: llm_adapter@2026-05-21

Confidence: high

Bugfix Medium

Use get_metadata_item in _extract_zim_metadata for new-scheme archives.

Use get_metadata_item in _extract_zim_metadata for new-scheme archives.

Source: llm_adapter@2026-05-21

Confidence: high

Bugfix Medium

Reset current_section on KV rows with mergedtoprow after first row emitted.

Reset current_section on KV rows with mergedtoprow after first row emitted.

Source: llm_adapter@2026-05-21

Confidence: low

Bugfix Low

Compute closest_match hint locally for "get section X of Y" natural-language path to suggest correct section name.

Compute closest_match hint locally for "get section X of Y" natural-language path to suggest correct section name.

Source: granite4.1:30b@2026-05-23-audit

Confidence: low

Bugfix Low

Append scan_truncated footer when related articles scan cap fires in markdown rendering.

Append scan_truncated footer when related articles scan cap fires in markdown rendering.

Source: granite4.1:30b@2026-05-23-audit

Confidence: low

Bugfix Low

Prepend canonical bare-title article when suggestions miss it, using a single SuggestionSearcher round trip.

Prepend canonical bare-title article when suggestions miss it, using a single SuggestionSearcher round trip.

Source: granite4.1:30b@2026-05-23-audit

Confidence: low

Bugfix Low

Mirror well-known namespace entries in walk_namespaceData to match listNamespaces output.

Mirror well-known namespace entries in walk_namespaceData to match listNamespaces output.

Source: granite4.1:30b@2026-05-23-audit

Confidence: low

Bugfix Low

Reject cursor when s.q shares no meaningful tokens with current query, preventing wrong-query pagination.

Reject cursor when s.q shares no meaningful tokens with current query, preventing wrong-query pagination.

Source: granite4.1:30b@2026-05-23-audit

Confidence: low

Bugfix Low

Trim _splice_title_match_into_search results to requested limit and update returned_count.

Trim _splice_title_match_into_search results to requested limit and update returned_count.

Source: granite4.1:30b@2026-05-23-audit

Confidence: low

Refactor Low

Split five complex functions to reduce cognitive complexity per SonarCloud limits.

Split five complex functions to reduce cognitive complexity per SonarCloud limits.

Source: granite4.1:30b@2026-05-23-audit

Confidence: low

Refactor Low

Extract duplicate literals (MIME prefix, pseudo-namespace strings) into constants with shared helper.

Extract duplicate literals (MIME prefix, pseudo-namespace strings) into constants with shared helper.

Source: granite4.1:30b@2026-05-23-audit

Confidence: low

Other Low

bugfix

bugfix

Source: llm_adapter@2026-05-21

Confidence: low

Other Low

severity

severity

Source: llm_adapter@2026-05-21

Confidence: low

Other Low

40

40

Source: llm_adapter@2026-05-21

Confidence: low

Other Low

text

text

Source: llm_adapter@2026-05-21

Confidence: low

Other Low

metadata aggregator now returns correct metadata strings instead of article bodies

metadata aggregator now returns correct metadata strings instead of article bodies

Source: llm_adapter@2026-05-21

Confidence: low

Other Low

affected_surface

affected_surface

Source: llm_adapter@2026-05-21

Confidence: low

Other Low

metadata for <file>

metadata for <file>

Source: llm_adapter@2026-05-21

Confidence: low

Full changelog

Two-pass beta-test of v2.0.0a9 against a 118 GB Wikipedia ZIM (Feb
2026 snapshot), plus a self-review code-reviewer audit and a
SonarCloud Quality Gate cleanup. The first pass exercised the
markdown surface; the second pass audited the first-pass fixes and
extended live testing to surfaces not covered the first time. Several
recently-shipped backend features turned out to be unreachable from
the natural-language surface, several handlers had silent fall-through
bugs on common phrasings, and one libzim quirk (silent namespace-prefix
stripping) was masking the entire metadata API.

Net: 1425 tests pass (+5 over v2.0.0a9), 50 skipped, 38 deselected.
Live-verified key fixes against the real Wikipedia archive via
in-process ZimOperations calls.

Fixed — Critical (post-a9 beta sweep)

  • D1: infobox section-context leakage on every Wikipedia city /
    country.
    Berlin and Tokyo (and the broad city-template family)
    produced trailing rows labelled **GDP — Time zone:**,
    **GDP — Vehicle registration:**, **GDP — Website:**,
    **GDP — HDI (2022):** — clearly wrong. The post-a8 #2/Op5
    parent-context fix correctly tracked current_section from
    <th class="infobox-header"> rows but never reset it; trailing
    free-floating rows (which Wikipedia marks <tr class="mergedtoprow">)
    inherited the last header. Reset current_section on KV rows whose
    <tr> carries mergedtoprow AND only after at least one row has
    been emitted under the current section — the second guard is the
    third-pass fix, without which the reset stripped section context
    from the first KV row inside a section header (Wikipedia uses
    mergedtoprow on those too as the visual group lead). Both edges
    covered by new regression tests.
  • D7: M/<key> paths silently aliased to C-namespace articles.
    libzim's archive.get_entry_by_path("M/Title") strips the M/
    prefix and resolves to the C-namespace article with that name;
    get article M/Title against a Wikipedia ZIM returned the 172 KB
    disambiguation article on "Title" instead of the metadata entry.
    Route M/<key> paths to archive.get_metadata_item on new-scheme
    archives so the proper metadata API serves these requests. Verified:
    M/Title now returns "Wikipedia", M/Date returns "2026-02-15".

Fixed — High (post-a9 beta sweep)

  • D2: articles related to <topic> failed on natural phrasings.
    The intent parser hands the topic verbatim from the user's query
    (articles related to United StatesUnited States), but the
    underlying entry path stores spaces as underscores
    (United_States). The handler called get_related_articles_data
    with the unresolved string and surfaced "Cannot find entry". Now
    probes the title index via find_title_match(min_score=0.8) first;
    fall through to the literal path only when no canonical resolves.
  • D3: tell me about <typo> skipped the typo-tolerant title
    fallback.
    The first-pass title promotion required score 1.0;
    single-edit typos resolve at score 0.85 via _find_entry_typo_fallback.
    tell me about Photosythesis (missing n) fell through to Xapian
    search and returned International Year of Chemistry
    actively misleading. Retry find_title_match(min_score=0.8) after
    the strict gate fails; same conservative typo chain
    (length-gated at ≥ 5 chars, ≤ 700 variants).
  • DD1: metadata for <file> aggregator returned 172 KB article
    bodies for new-scheme archives.
    D7 fixed the per-entry
    get article M/Title surface but _extract_zim_metadata
    (a separate code path used by the metadata for aggregator) was
    still calling get_entry_by_path("M/Title") and getting the same
    silently-aliased C-namespace article. Now uses get_metadata_item
    for new-scheme archives, with old-scheme get_entry_by_path
    fallback. Verified: Title returns "Wikipedia", Description
    returns "The free encyclopedia", Language returns "eng" (was
    172 K / 60 K / 364 K-char garbage respectively).
  • DD2: tell me about ignored content_offset. The handler
    hard-coded offset = 0 in the body fetch, so callers paginating a
    148 KB Photosynthesis article through zim_query couldn't reach
    the tail without dropping to a separate get article <path> call.
    Threaded options.get("content_offset", 0) through; suppress the
    compact-mode lead-with-TOC step when reading mid-article.

Fixed — Medium (post-a9 beta sweep)

  • D4: get section X of Y natural-language error path dropped the
    closest_match hint.
    The structured get_section operation
    computes a difflib-based closest-match (Op5 from a8) but the
    natural-language handler reimplemented section lookup against the
    headings list and never queried that operation. Compute the same
    hint locally so get section Goegraphy of Berlin now suggests
    "Did you mean Geography?".
  • D5: articles related to <hub> markdown dropped the
    scan_truncated signal.
    The a9 #A5 backend addition surfaced
    scan_truncated / scan_total_internal / _meta.reason for hub
    articles whose 500-link scan cap fired, but compact_renderers.render_related
    ignored all of it. Append a footer when the signal is set.
  • D6: suggestions for X missed the canonical bare-title article.
    suggestions for Photosyn returned 15 results, none of which was
    bare Photosynthesis — both libzim's SuggestionSearcher and
    Xapian rank disambiguator-bearing variants
    (Photosynthesis (song), Photosynthetic_efficiency) above the
    short canonical title. Probe SuggestionSearcher for parenthesised
    siblings (foo_(suffix)) and prepend the un-suffixed root path
    when the archive resolves it. The third-pass refactor restructured
    this to share a single SuggestionSearcher.suggest() round trip
    with Strategy 2, so the cold path stays at one title-index probe.
  • D8: walk namespace W returned zero entries while
    list namespaces claimed W had two.
    The two operations
    contradicted each other on the same archive. The W-namespace
    well-known entries (mainPage, favicon) live on the
    archive.main_entry / has_illustration API, not the iterable
    surface that walk_namespace_data falls back to. Mirror the same
    probe pair _add_new_scheme_well_known_namespace already uses
    for the namespace listing. Also fix the entries 1-0 off-by-one
    in the empty-walk header rendering.
  • D9: cursor s.q field silently ignored — wrong-query
    pagination.
    Cursor reused across queries silently paginated the
    new query at the old offset. Reject with a cursor_decode error
    when s.q shares no meaningful (≥ 3-char) tokens with the current
    query. Falls back to a bidirectional substring check for cursors
    whose stored query has only short tokens. Three regression tests
    cover the unrelated-query reject, the shortened-query accept, and
    the overlapping-tokens accept.
  • DD4: _splice_title_match_into_search returned limit + 1
    results.
    Prepending the canonical synthetic result didn't trim
    back to the requested limit; limit=3 produced 4 results with
    header "showing 1-4". Trim to page_info.limit and update
    page_info.returned_count so the header matches the row count.

Added — Opportunities (post-a9 beta sweep)

  • O2: stopword-saturation footer on search. Queries that match
    ≥ 1 M results (the stopword-only search for the and a is in to
    saturates at ~5 M) now carry a footer noting that top hits are
    ranked by general document importance, not topic relevance — so
    the model doesn't trust the "Found N matches" signal as
    meaningful.
  • O3: truncation hint no longer self-references. The previous
    hint suggested show structure of <path> as the recovery —
    silly when the truncated response IS the show-structure (or
    table-of-contents) output. Replaced with operation-agnostic
    guidance (page via cursor / tighten query / compact=False).
  • O4: disambiguation page leads preserve their inline list.
    tell me about Martin previously truncated to **Martin** may refer to: with no list, forcing a show structure round-trip.
    Detect "X may refer to:" leads and skip the H2 cut so the
    disambig list stays inline.
  • O5: synthesize demotes List_of_* / Index_of_* /
    Outline_of_* / Timeline_of_* etc.
    These articles ranked
    surprisingly high in synthesize because their bodies match many
    query tokens but the actual content is just an enumeration stub.
    Demote to the back of top_n AFTER title promotion runs (demoting
    before regressed the promotion's strong-match guard, which would
    treat Berlin_(disambiguation) as a match for Berlin).
  • O6: docstring notes distinguish show structure (flat heading
    list) from table of contents (nested children tree).

Fixed — Code-reviewer audit findings (post-first-pass)

A feature-dev:code-reviewer agent audited the first-pass commit and
surfaced three real defects in the original fixes:

  • A1 (the second guard on D1, listed under Critical above).
  • A2: D6 ran SuggestionSearcher twice on the cold path. When
    Strategy 1 returned empty, both the canonical probe AND Strategy 2
    opened independent SuggestionSearcher instances against the same
    archive. The first-pass "skip canonical probe when Strategy 1
    empty" fix regressed the empty-Strategy-1 case (the canonical
    probe IS needed when Xapian misses). Restructured to share a
    single SuggestionSearcher.suggest() round trip via an optional
    result_paths= parameter on _find_canonical_prefix_match.
  • A3 (the token-overlap rewrite of D9, listed under Medium above).

Fixed — Quality gate (SonarCloud third-pass cleanup)

  • 5 cognitive-complexity reductions (S3776). Five functions added
    by the beta-test commits crossed SonarCloud's complexity-15 limit.
    Each was split into self-contained helpers without behaviour
    change: _find_canonical_prefix_match (53 → split into 5 helpers
    for path probing, root extraction, entry resolution, and the two
    ranking strategies), _handle_tell_me_about (19 → 17 → ~14 over
    two passes via _promote_topic_via_title_index and
    _fetch_topic_article_body), render_related (17 → ~10 via
    _render_related_link_line + _scan_truncated_footer),
    render_walk_namespace (19 → ~12 via _walk_namespace_header),
    and _get_metadata_entry (18 → ~13 via _decode_metadata_content).
  • 4 duplicate-literal extractions (S1192). The "text/" MIME prefix
    had three call sites in zim/content.py; "File:" / "Category:" /
    "Template:" each had three call sites in zim/search.py. Extracted
    to _TEXT_MIME_PREFIX and a _PSEUDO_NAMESPACE_* constant trio
    with a shared _is_pseudo_namespace_entry(extended=) helper.
  • 1 ReDoS hotspot (S5852). The O4 disambig-lead-detection regex
    \bmay\s+(?:also\s+)?refer\s+to\s*:?\s*$ was flagged for nested
    unbounded quantifiers. Not actually catastrophic on Python's re
    engine, but replaced anyway with a normalised
    str.endswith(("may refer to", "may also refer to")) check — same
    behaviour, no regex engine, and the phrase list is easier to
    extend.

Wire-format / surface changes (alpha-line clean breaks)

  • Infobox extraction labels for trailing rows change. Berlin /
    Tokyo terminal rows that previously emitted as GDP — Time zone
    now emit as Time zone. Callers parsing the bullet-prefix
    structure see different label strings.
  • metadata for <file> now returns short metadata strings
    instead of 172 KB article-body excerpts. Wire-format compatible
    (same keys); content is the actual ZIM metadata (Title =
    "Wikipedia", Date = "2026-02-15", etc.).
  • get article M/<key> now returns the ZIM metadata entry
    instead of the silently-aliased C-namespace article body.
    Wire-format compatible (same response envelope); content differs.
  • _splice_title_match_into_search trims to the requested
    limit. Callers receiving limit + 1 results will now get exactly
    limit.
  • Cursor with mismatched s.q now errors. Callers that
    previously got silent wrong-query results now receive a
    cursor_decode ToolErrorPayload.
  • Synthesize ranking demotes list articles. Citation order for a
    query like Quantum mechanics no longer includes
    List_of_textbooks_… in the top half.
  • Truncation hint footer text changed (O3). Callers parsing the
    trailing prose see different wording.

Investigated and deferred

  • Pseudo-namespace pollution in default search results
    (Portal: / User: / Help:).
    Filtering pseudo-namespace
    articles from default search is too opinionated; some callers
    legitimately want them. The canonical-promotion already pushes
    the real article to rank 1 in the common case (live-verified:
    search for biologyBiology at #1 via (canonical title match)). Revisit if the canonical-promotion fallback proves
    insufficient.

Breaking Changes

  • Cursor validation: mismatched `s.q` now raises a `cursor_decode` error instead of silent wrong‑query pagination.

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track cameronrye/openzim-mcp

Get notified when new releases ship.

Sign up free

About cameronrye/openzim-mcp

Modern, secure MCP server for accessing ZIM format knowledge bases offline. Enables AI models to search and navigate Wikipedia, educational content, and other compressed knowledge archives with smart retrieval, caching, and comprehensive API.

All releases →

Related context

Earlier breaking changes

  • v2.0.0a15 _attribute_sections falls back to first section when no section brackets located passage
  • v2.0.0a13 canonical‑splice gate tightened to require exact path equality, fixing H2/H3 surface end‑to‑end behavior across all shapes.
  • v2.0.0a11 Exposed `content_offset` as top-level `zim_query` parameter, validated >=0, threaded through options.
  • v2.0.0a9 HTTP rate-limiter client_id now derived from token or IP; defaults to "default" fallback.

Beta — feedback welcome: [email protected]