Skip to content

cameronrye/openzim-mcp

v2.0.0a11 Breaking

This release includes 1 breaking change for platform teams planning a safe upgrade.

Published 21d MCP Data & Storage
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

kiwix mcp mcp-server openzim zim

ReleasePort's take

Light signal
editorial:auto 13d

The `content_offset` parameter is now a top‑level option in `zim_query`, validated as >=0 and threaded through the query options.

Why it matters: Update any calls to `zim_query` that need article‑body paging to include the new `content_offset` field; failure will cause unreachable content errors.

Summary

AI summary

content_offset now exposed in zim_query, fixing unreachable paging and infobox concatenation issues.

Changes in this release

Security Medium

Cursor with missing/invalid `s` envelope now errors (`cursor_decode`).

Cursor with missing/invalid `s` envelope now errors (`cursor_decode`).

Source: llm_adapter@2026-05-21

Confidence: high

Breaking Medium

Exposed `content_offset` as top-level `zim_query` parameter, validated >=0, threaded through options.

Exposed `content_offset` as top-level `zim_query` parameter, validated >=0, threaded through options.

Source: llm_adapter@2026-05-21

Confidence: low

Feature Medium

`zim_query` accepts top-level `content_offset` parameter for article-body paging.

`zim_query` accepts top-level `content_offset` parameter for article-body paging.

Source: llm_adapter@2026-05-21

Confidence: low

Feature Medium

`walk namespace M` and `metadata for <file>` agree on metadata via new-scheme archive enumeration.

`walk namespace M` and `metadata for <file>` agree on metadata via new-scheme archive enumeration.

Source: llm_adapter@2026-05-21

Confidence: low

Feature Medium

Synthesize ranking demotes `Lists_of_*` and related suffixes in disambiguation results.

Synthesize ranking demotes `Lists_of_*` and related suffixes in disambiguation results.

Source: llm_adapter@2026-05-21

Confidence: low

Feature Medium

Every markdown response now carries trailing `<!-- intent=... cert=... -->` HTML comment.

Every markdown response now carries trailing `<!-- intent=... cert=... -->` HTML comment.

Source: llm_adapter@2026-05-21

Confidence: low

Feature Medium

Intent-parser chained-query guard returns guidance instead of silently dispatching rightmost intent.

Intent-parser chained-query guard returns guidance instead of silently dispatching rightmost intent.

Source: llm_adapter@2026-05-21

Confidence: low

Feature Low

Synthesized ranking demotes titles matching `Lists_of_*` and suffixes like `_discography`, `_filmography`, etc., in disambiguation results.

Synthesized ranking demotes titles matching `Lists_of_*` and suffixes like `_discography`, `_filmography`, etc., in disambiguation results.

Source: granite4.1:30b@2026-05-22-audit

Confidence: low

Performance Medium

Optimized title‑index probing: now runs once per query by gating canonical probe behind match count checks, reducing duplicate lookups.

Optimized title‑index probing: now runs once per query by gating canonical probe behind match count checks, reducing duplicate lookups.

Source: granite4.1:30b@2026-05-22-audit

Confidence: low

Bugfix Medium

`get section` honors `max_content_length` and appends a truncation footer.

`get section` honors `max_content_length` and appends a truncation footer.

Source: llm_adapter@2026-05-21

Confidence: high

Bugfix Medium

Infobox cells render with intra-cell whitespace at block-tag boundaries only.

Infobox cells render with intra-cell whitespace at block-tag boundaries only.

Source: llm_adapter@2026-05-21

Confidence: low

Bugfix Medium

Infobox text extraction now inserts whitespace only at block‑tag boundaries, fixing silent concatenation issues.

Infobox text extraction now inserts whitespace only at block‑tag boundaries, fixing silent concatenation issues.

Source: granite4.1:30b@2026-05-22-audit

Confidence: low

Bugfix Medium

`walk namespace` and `metadata for` now agree on metadata counts by enumerating new‑scheme archive keys.

`walk namespace` and `metadata for` now agree on metadata counts by enumerating new‑scheme archive keys.

Source: granite4.1:30b@2026-05-22-audit

Confidence: low

Bugfix Medium

Auto‑pick canonical when strong matches are exactly `Foo` and `Foo (disambiguation)`, surfacing disambig as a footer hint.

Auto‑pick canonical when strong matches are exactly `Foo` and `Foo (disambiguation)`, surfacing disambig as a footer hint.

Source: granite4.1:30b@2026-05-22-audit

Confidence: low

Bugfix Medium

Prefer exact‑topic canonical before disambig check, ensuring queries like `Apollo 11` return the correct article.

Prefer exact‑topic canonical before disambig check, ensuring queries like `Apollo 11` return the correct article.

Source: granite4.1:30b@2026-05-22-audit

Confidence: low

Bugfix Medium

`show structure of <multi‑word title>` normalizes titles via `_resolve_natural_language_path` helper.

`show structure of <multi‑word title>` normalizes titles via `_resolve_natural_language_path` helper.

Source: granite4.1:30b@2026-05-22-audit

Confidence: low

Bugfix Medium

Intent parser strips politeness tails (`to me`, `for me`, `please`) from phrases like `explain X to me`.

Intent parser strips politeness tails (`to me`, `for me`, `please`) from phrases like `explain X to me`.

Source: granite4.1:30b@2026-05-22-audit

Confidence: low

Bugfix Medium

`walk namespace` now uses per‑namespace entry count when available, falling back only if unknown.

`walk namespace` now uses per‑namespace entry count when available, falling back only if unknown.

Source: granite4.1:30b@2026-05-22-audit

Confidence: low

Bugfix Low

Fixed ReDoS hotspot in `_search_query_tail` regex by splitting into three single‑token patterns with plain Python slicing.

Fixed ReDoS hotspot in `_search_query_tail` regex by splitting into three single‑token patterns with plain Python slicing.

Source: granite4.1:30b@2026-05-22-audit

Confidence: low

Bugfix Low

Orphan bullet rows now inherit virtual parent context from preceding KV row when no active section exists.

Orphan bullet rows now inherit virtual parent context from preceding KV row when no active section exists.

Source: granite4.1:30b@2026-05-22-audit

Confidence: low

Bugfix Low

Trailing whitespace in `tell me about` now yields a clear “Topic Required” error instead of ambiguous article matches.

Trailing whitespace in `tell me about` now yields a clear “Topic Required” error instead of ambiguous article matches.

Source: granite4.1:30b@2026-05-22-audit

Confidence: low

Bugfix Low

`search for` with no terms now returns “Search Terms Required” error.

`search for` with no terms now returns “Search Terms Required” error.

Source: granite4.1:30b@2026-05-22-audit

Confidence: low

Bugfix Low

Reject non‑positive `limit` and negative `offset` in pagination, preventing infinite loops.

Reject non‑positive `limit` and negative `offset` in pagination, preventing infinite loops.

Source: granite4.1:30b@2026-05-22-audit

Confidence: low

Bugfix Low

Wrap backend “Cannot find entry” errors for `articles related to <nonexistent>` with structured guidance message.

Wrap backend “Cannot find entry” errors for `articles related to <nonexistent>` with structured guidance message.

Source: granite4.1:30b@2026-05-22-audit

Confidence: low

Refactor Medium

CodeQL fix: lifted assignment of `full_len` above conditional branch to avoid uninitialized warnings.

CodeQL fix: lifted assignment of `full_len` above conditional branch to avoid uninitialized warnings.

Source: llm_adapter@2026-05-21

Confidence: low

Refactor Low

Moved assignment of `full_len` above conditional to satisfy CodeQL uninitialized variable warning.

Moved assignment of `full_len` above conditional to satisfy CodeQL uninitialized variable warning.

Source: granite4.1:30b@2026-05-22-audit

Confidence: low

Refactor Low

Unified metadata aggregation so `metadata for` and `walk namespace` report consistent totals across archive schemes.

Unified metadata aggregation so `metadata for` and `walk namespace` report consistent totals across archive schemes.

Source: granite4.1:30b@2026-05-22-audit

Confidence: low

Full changelog

Three-pass beta-test of v2.0.0a10 against a 118 GB Wikipedia ZIM
(Feb 2026 snapshot) via the simple-mode zim_query MCP surface. The
first pass surfaced 22 defects + 7 opportunities from live use; the
second and third passes were self-audits of the prior commit, each
finding fewer issues than the last (22 → 6 → 3). Every fix here was
first observed live; the existing 1425-test suite covered none of
them.

The single most user-visible regression is silent text concatenation
inside Wikipedia infoboxes — every city / country article had at
least one corrupted number (5th in Europe1st in Germany,
Berliner(s) (English)Berliner (m), 0.967very high,
TokyoTamaNorthern Izu Islands). A small LLM reading this would
emit those as single tokens. Plus one critical: the a10 DD2 fix
threaded content_offset through the article-paging handler, but
the parameter was never exposed on the MCP tool — the truncation
footer told callers to "pass content_offset=N" via a channel that
didn't exist.

Net: 1463 tests pass (+38 over v2.0.0a10), 50 skipped, 38
deselected. black / isort / flake8 / mypy / CodeQL /
SonarCloud all clean.

Fixed — Critical (post-a10 beta sweep)

  • C1: content_offset unreachable from zim_query. A10's DD2
    threaded options["content_offset"] through
    _fetch_topic_article_body, but the zim_query MCP signature
    never exposed the parameter. The top-level offset arg routes to
    options["offset"] (search / browse pagination), not
    options["content_offset"] (article-body paging). Result: every
    tell me about Photosynthesis truncation footer pointed to a
    paging channel that returned the same page 1. Exposed
    content_offset as a top-level zim_query parameter, validated
    >= 0, threaded through options. Truncation footers on
    truncate_content now report the correct next-page offset
    (Opp4 implemented inline).

Fixed — High (post-a10 beta sweep)

  • H2: tell me about Berlin non-determinism. Berlin and
    Berlin (disambiguation) both strong-matched by the candidate-
    extends-topic rule, so the disambig set fired 2+ → fork between
    the city article and the disambig page. Auto-pick the canonical
    when the strong-match set is exactly Foo + Foo (disambiguation);
    the disambig twin is surfaced as a footer hint on the returned
    body. Genuine multi-meaning topics (Apollo / Mercury / Java) still
    fork as before (Opp1 implemented inline).
  • H3: disambig hides the canonical it should be helping pick.
    tell me about Apollo 11 forked between Apollo_11_anniversaries,
    Apollo_11_lunar_sample_display, Apollo_11_goodwill_messages
    none of which is the canonical Apollo_11. Probe the title index
    for the exact-topic canonical BEFORE the disambig check; prepend
    it to the strong-match list when absent.
  • H4: infobox text-extraction silently concatenates adjacent
    block-level children.
    td.get_text() joined <br>, <li>,
    <span> runs without whitespace. Three-pass evolution:
    first-pass separator=" " mangled inline span groups
    (3,913,6443 , 913 , 644); second-pass _join_cell_text
    helper inserts whitespace at block-tag boundaries only and
    concatenates inline tags directly; third-pass filters
    Comment instances (a NavigableString subclass) so invisible
    formatnum/microformat comments stop leaking as visible text.
  • H5: intent parser preempts on later-occurring keywords.
    tell me about berlin then list namespaces silently ran only
    list namespaces (highest-confidence intent wins). New
    _chained_intent_guidance splits on then / ; / and then
    connectors; if both halves start with a recognised operation
    prefix, return a "split into separate calls" guidance message.
  • H6: orphan bullet rows lose parent context. Berlin's
    **• Summer (DST):** UTC+02:00 rendered without a parent
    because Time zone: was a regular KV (not an infobox-header).
    When a KV row's label starts with a bullet char AND there's no
    active section, treat the previous KV row's label as a virtual
    parent — applied for that row only, doesn't persist into the
    next non-bullet row.

Fixed — Medium (post-a10 beta sweep)

  • M7: show structure of <multi-word title> doesn't normalize.
    D2 in a10 added find_title_match(min_score=0.8) to
    _handle_related; M7 extends the same pattern via the new
    _resolve_natural_language_path helper applied to structure,
    table of contents, links, summary, get section, and
    get article (when the path contains spaces and no namespace
    separator — direct-path lookups stay zero-cost).
  • M8: get section ignores max_content_length. Section text
    was returned in full regardless of the cap. Honor the cap and
    append a one-line truncation footer reporting the original
    length.
  • M9: malformed cursor silent no-op. A base64+JSON token that
    decodes but lacks the expected s envelope (or whose s.o is
    missing/invalid) used to silently degrade to page 1. The contract
    now mirrors the totally-garbled-token case: structured
    cursor_decode error.
  • M10: trailing-whitespace tell me about produces an empty
    topic.
    A query of tell me about with a trailing space fell
    through to a topic of "tell me about" and disambiguated to
    articles titled "Tell Me About Tomorrow". The
    _extract_tell_me_about regex now uses \b + (.*?) so empty
    topics resolve to empty strings; simple_tools rejects with a
    clear "Topic Required" error.
  • M11: explain X to me parses incorrectly. "explain Berlin
    to me" extracted topic "Berlin to me" and returned a memorial
    article. Topic extractor strips to me / for me / please
    politeness tails, loop-until-idempotent so wrapping cases
    (DNA for me please) collapse cleanly.

Fixed — Low (post-a10 beta sweep)

  • L12: trailing-whitespace search for with no terms. Used to
    fall through to searching for the literal word "for". Validate
    the extracted tail before dispatch; surface "Search Terms
    Required".
  • L13: limit=0 nonsensical pagination. Showing 1-0 of N — pass offset=0 for the next page looped on itself. Reject
    non-positive limit and negative offset at the MCP boundary.
  • L15: articles related to <nonexistent> raw error. Wrap the
    backend's "Cannot find entry" with a structured guidance message
    pointing to suggestions for / find article titled /
    search for. (Second-pass F3 added the same hint trio to the
    outbound_error branch in render_related that the live case
    actually surfaces through.)
  • L16: walk namespace denominator misleading. walk namespace M
    with 13 entries used to render of ~27,199,904 archive-wide entries. Prefer a per-namespace denominator when available; fall
    through to the archive total only when no per-namespace count is
    known. Second-pass F4 plumbed namespace_entry_count through
    _build_walk_result so the new of N in namespace X shape
    actually renders.
  • L17 / L18: list namespaces total mismatch, metadata aggregator
    underreports.
    Header now annotates "X archive entries (per-
    namespace sum: Y)" when the two differ. _extract_zim_metadata
    enumerates archive.metadata_keys on new-scheme archives (filtering
    Illustration_* binaries) so metadata for and
    walk namespace M agree on what counts as metadata. Second-pass
    F6 replaced the first-pass hardcoded probe-list extension with
    the enumeration so future archive additions don't reopen the
    disagreement.

Added — Opportunities (post-a10 beta sweep)

  • Opp1: auto-fall-through twin. Implemented inline with H2.
  • Opp2: expanded demote patterns. _LIST_ARTICLE_PREFIX_RE
    picks up Lists_of_* (plural); two new patterns demote
    Listed_* stems and *_discography / *_filmography /
    *_videography / *_bibliography / *_albums / *_singles
    suffixes. tell me about cats returning a Rephlex Records
    discography at rank 2 is the canonical failure this fixes.
  • Opp3: synthesize relevance threshold. New
    _drop_low_relevance_tail cuts hits whose Xapian score is below
    25% of the top hit's. Only applied in xapian_score fallback
    (single-archive); multi-archive RRF keeps all hits because RRF
    normalizes scores. Always keeps at least one hit.
  • Opp4: content_offset in truncation footers. Implemented
    inline with C1 — truncate_content accepts current_offset so
    paginated reads compute the next offset relative to where the
    slice started in the original article. Third-pass F2 added a
    paginatable: bool = True kwarg so the three main-page call sites
    switch to operation-accurate guidance (the main-page surface
    doesn't accept content_offset).
  • Opp5: canonical-exists hint in disambig auto-pick. When the
    H2 auto-fall-through fires, append a _Note: this topic also has a disambiguation page — see ``get article <path>`` for alternate meanings._ footer so the disambiguation stays discoverable.
  • Opp6: intent telemetry on all responses. Every markdown
    response now carries a trailing <!-- intent=foo cert=0.85 -->
    HTML comment. Invisible to humans (HTML comments aren't rendered)
    but visible in the token stream so calling LLMs can branch on the
    parser's classification certainty without parsing the body.
  • Opp7: link-count rank on related articles. When the related-
    articles backend supplies a mention_count, surface it inline as
    - **Title** (path) · N× so a small LLM can rank which related
    article is most central to the source. (Second-pass H1 fixed the
    first-pass typo that read the wrong field name — link_count
    vs mention_count.)

Fixed — Second-pass self-audit findings

A self-audit of the first-pass commit surfaced six defects in the
fixes themselves:

  • D1 second-pass (folded into H4 above). get_text(separator=" ")
    mangled inline-span numeric groups.
  • F3 second-pass (folded into L15 above). Wrapped the wrong
    error path — backend serialises rather than re-raising.
  • F4 second-pass (folded into L16 above). namespace_entry_count
    was renderer-only and never plumbed through the data payload.
  • F6 second-pass (folded into L17/L18 above). Hardcoded probe-
    list extension still drifts; replaced with metadata_keys
    enumeration on new-scheme archives.
  • H1 second-pass (folded into Opp7 above). First-pass read
    link_count from the related-articles result; the backend stores
    the frequency-rank signal as mention_count.
  • C2 perf: title-index probe ran twice on the weak-top-hit path.
    Gated the H3 canonical-probe behind len(strong_matches) >= 2
    (the only condition under which the disambig page would otherwise
    render). Strong-top-hit and weak-then-promoted paths skip the
    second probe entirely. Third-pass extended the gate to also fire
    when the single strong match is itself the disambig twin.

Fixed — Third-pass self-audit findings

A second self-audit found three more defects in the second-pass
commit:

  • D1 third-pass (folded into H4 above). Comment is a
    NavigableString subclass; second-pass _join_cell_text caught
    comments and rendered their bodies as visible text.
  • C2 third-pass. Lone disambig-twin search case bypassed the
    second-pass >= 2 gate; extended the gate to also fire when the
    one strong match is Foo (disambiguation) itself.
  • F2 third-pass (folded into Opp4 above). The second-pass
    truncation hint pointed at a content_offset parameter the
    main-page operation doesn't accept; added a paginatable=False
    kwarg on the three main-page call sites and routed them to
    operation-accurate guidance.

Fixed — Quality gate (PR CI cleanup)

  • CodeQL: full_len may be uninitialized. In the M8 truncation-
    footer code path, full_len = len(text) was assigned only inside
    the truncation if block but referenced in a different (correlated)
    if truncated: block. The correlation was opaque to CodeQL.
    Lifted the assignment above the branch so the variable is always
    defined. No behaviour change.
  • SonarCloud python:S5852 (ReDoS hotspot). The
    _search_query_tail regex had adjacent \s* quantifiers that
    the heuristic flagged as polynomial-backtracking. Split into
    three single-token regexes (verb, optional up for look up,
    optional for connector) with plain-Python tail slicing between
    matches. Each individual pattern has at most one whitespace
    quantifier so the heuristic has nothing to flag. Behaviour
    verified identical across all 1463 tests.

Wire-format / surface changes (alpha-line clean breaks)

  • zim_query accepts a top-level content_offset parameter.
    Existing callers passing only the previous parameters are
    unaffected; new callers paginating long article bodies should
    use content_offset instead of the legacy offset (the latter
    remains the search / browse pagination knob).
  • Every markdown response now carries a trailing
    <!-- intent=... cert=... --> HTML comment
    (Opp6). Invisible
    to humans; callers that token-count or post-process the trailing
    bytes will see two extra tokens per response.
  • Intent-parser chained-query guard returns guidance instead of
    silently dispatching the rightmost intent.
    Callers sending
    X then Y queries that previously got Y's result silently now
    receive a structured "split into separate calls" message.
  • get section honors max_content_length and appends a
    truncation footer.
    Callers that previously got full section
    bodies now receive at most max_content_length bytes plus a
    one-line footer reporting the original length.
  • Cursor with missing/invalid s envelope now errors
    (cursor_decode).
    Callers that previously got silent page-1
    fall-through now receive a structured error.
  • Infobox cells render with intra-cell whitespace at block-tag
    boundaries only.
    Most callers see strictly better text (no
    5th in Europe1st in Germany-style concatenation); inline
    numeric / unit / coordinate microformats remain intact.
  • Synthesize ranking demotes Lists_of_* and *_discography /
    *_filmography / *_albums / *_singles suffixes.
    Citation
    order for queries like cats no longer surfaces a Rephlex
    Records discography in the top half.
  • walk namespace M and metadata for <file> agree on what
    counts as metadata
    (new-scheme archives enumerate
    metadata_keys directly). Old-scheme archives keep the
    hardcoded probe list as a fallback.

Breaking Changes

  • `content_offset` previously unreachable from `zim_query`; now added as a top‑level parameter and validated.

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track cameronrye/openzim-mcp

Get notified when new releases ship.

Sign up free

About cameronrye/openzim-mcp

Modern, secure MCP server for accessing ZIM format knowledge bases offline. Enables AI models to search and navigate Wikipedia, educational content, and other compressed knowledge archives with smart retrieval, caching, and comprehensive API.

All releases →

Related context

Earlier breaking changes

  • v2.0.0a15 _attribute_sections falls back to first section when no section brackets located passage
  • v2.0.0a13 canonical‑splice gate tightened to require exact path equality, fixing H2/H3 surface end‑to‑end behavior across all shapes.
  • v2.0.0a10 `get article M/<key>` now returns ZIM metadata entry rather than aliased C-namespace article body.
  • v2.0.0a10 `metadata for <file>` returns concise metadata strings instead of full article bodies for new-scheme archives.
  • v2.0.0a10 Infobox extraction now emits trailing rows without the preceding "GDP —" label, changing bullet-label strings.

Beta — feedback welcome: [email protected]