Skip to content

cameronrye/openzim-mcp

v2.0.0a16 Bugfix

This release fixes issues for SREs watching stability and regressions.

Published 17d MCP Data & Storage
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

kiwix mcp mcp-server openzim zim

Summary

AI summary

Fixed multiple parsing, validation, and response‑format defects affecting disambiguation handling, intent extraction, namespace walking, browsing, suggestions, chained intents, and politeness stripping.

Full changelog

The multi-pass live sweep of a15 against
wikipedia_en_all_maxi_2026-02.zim (~118 GB, ~27.2 M entries) ran
across seven passes. Pass 1 surfaced four user-facing defects (D4 in
the tell_me_about disambig-page handling for Mercury-class bare
titles; D5 in the intent parser's politeness-prefix regex; D6 in
find_by_title's response to namespace-prefixed input; D7 a
schema-consistency gap in walk_namespace). Pass 2 self-audited
every D-fix in both verbose and compact rendering modes and
exercised the canonical-article paths (Berlin / Apollo 11 / Java)
the disambig-detection logic must not regress. Pass 3 re-tested
across a broader disambig set (Mars, Sun, Moon, Paris, Apollo bare),
walked empty namespaces B / X / Z, and exercised cross-fix
interactions (could you find article titled M/Title); both
passes 2 and 3 found zero new defects. Pass 4 then deliberately
stress-tested the four landed D-fixes from angles the earlier
passes hadn't probed (more bare-title disambigs, pathological
politeness combinations, find_by_title edge cases, walk_namespace
malformed args) AND exercised the intent paths the earlier passes
had barely touched (synthesize, browse namespace, show structure
of, links in, suggestions for, search in namespace); it surfaced
three more defects (P4-D1 / P4-D2 / P4-D3). Pass 5 verified those
three fixes; zero new defects. Pass 6 went deeper — a source-level
audit of every intent handler for the silent-default pattern
P4-D3 fixed (params.get("X", DEFAULT)) caught the same shape in
_handle_browse, and a parallel audit of every intent extractor
for the trigger-word-capture pattern P4-D1 fixed caught a sibling
extractor permissiveness in _extract_browse; plus a leading-
politeness probe surfaced a third defect (P6-D3) — please tell me about X leaks the leading politeness into the parsed topic
just like the original D5 did for modal verbs. Pass 7 verified
all ten fixes and audited cumulative regressions across the three
commits; zero new defects.

Fixed

  • D4: tell me about Mercury no longer attaches a misleading
    _May also refer to: Mercury_Monterey — use tell me about <full title>_ footer to the disambiguation-page body.
    Two cooperating
    bugs: SimpleToolsHandler._is_disambig_lead returned False
    whenever pre_h2 exceeded 400 chars — Mercury's 628-char pre-H2
    (the "most commonly refers to" preamble, three top-level entries,
    and the "may also refer to" header) blew past the cap, so the
    existing disambig-page detection in _lead_with_toc never fired;
    AND the trailing-footer block in _handle_tell_me_about had no
    way to suppress the disambig_twin_path / related_extends_paths
    hints when the resolved body was itself a disambig page. Fixed
    by checking only the trailing 400 characters of pre_h2 (the
    regex-free endswith stays bounded, but long preambles now
    trigger) and by gating both trailing footers on a fresh
    body_is_disambig_page check on the fetched body. Canonical
    pages with disambig twins (Berlin) keep their footer; canonical
    pages with extends-topic siblings (Apollo 11 → anniversaries /
    lunar sample display / goodwill messages) keep their footer.
  • D5: could you tell me about Photosynthesis now parses
    topic = "Photosynthesis" instead of leaking the modal lead-in
    into the topic.
    The verb-prefix regex in
    _extract_tell_me_about anchored at ^\s* and never matched
    "could you" / "can you" / "would you" / "will you", so the whole
    query fell through to the topic = query.strip() fallback and
    downstream relied on the tail-probe entity rescue to find the
    article anyway. Fixed by stripping the modal scaffold
    ((?:could|can|would|will)\s+(?:you|we|i)\s+(?:please\s+)?) before
    the verb regex runs. Leaves non-modal queries unchanged; combines
    cleanly with the existing trailing-politeness strip
    (could you tell me about X please → topic=X).
  • D6: find article titled M/Title now redirects to get article M/Title instead of returning a silent 0_hits. The title index
    only stores titles (M/Title's title is "Title"), so passing a ZIM
    namespace path through the title-lookup backend was guaranteed to
    return nothing — with no signal to the caller that the wrong tool
    was in use. _handle_find_by_title now detects the
    uppercase-letter + slash + non-empty-suffix shape upfront and
    returns a structured Namespace Path, Not a Title message that
    points at both get article <path> (direct lookup) and find article titled <stripped> (title-only fallback). Lowercase
    prefixes (a/b) and titles without the namespace shape pass
    through to the backend unchanged.
  • D7: walk namespace A (and any other empty new-scheme
    namespace) now includes namespace_entry_count: 0 in the
    response.
    The short-circuit at
    openzim_mcp/zim/namespace.py for new-scheme non-C/M/W namespaces
    built an empty result without passing namespace_entry_count to
    _build_walk_result, so the field was omitted entirely while
    walk-M and walk-W (which surface their bounded totals) included
    it. Downstream consumers had to special-case "missing" vs "zero".
    Fixed by passing namespace_entry_count=0 in the short-circuit.
    Updated the walk_A_10 golden to reflect the new schema; walk-M
    and walk-W goldens are unchanged (already carried the field).
  • P4-D1: suggestions for (no actual prefix) now returns the
    structured "Missing Search Term" error instead of silently
    autocompleting against the literal word "for".
    The regex's
    optional (?:for\s+)? group failed to match without trailing
    whitespace, so the mandatory capture greedily swallowed "for"
    itself; the handler's existing missing-arg guard then saw a
    non-empty partial_query and ran the suggestion fallback (which
    spent ~70 s scanning for "for" — a high-frequency English token).
    Fixed in _extract_suggestions by discarding a bare-"for"
    capture so the guard takes over. Legitimate prefixes that happen
    to start with "for" (e.g., suggestions for forest) still work.
  • P4-D2: chained-intent detector no longer bypassed by a modal
    lead-in.
    _chained_intent_guidance's
    _CHAINED_OPERATION_PREFIX_RE is anchored at ^ and only
    recognised operation verbs at position 0, so could you tell me about Photosynthesis then list namespaces shifted the verb past
    the anchor — left_is_op evaluated False, the chain gate failed,
    and the query fell through to normal intent classification where
    the higher-confidence list_namespaces won and silently dropped
    the tell me about half. The D5 modal-strip lives inside
    _extract_tell_me_about; it only runs AFTER the chain detector
    has already decided. Fixed by pre-stripping the same modal
    scaffold ((?:could|can|would|will)\s+(?:you|we|i)\s+ (?:please\s+)?) at the top of _chained_intent_guidance so
    detection sees the cleaned query.
  • P4-D3: walk namespace with a malformed argument now returns
    a structured "Missing or Invalid Namespace" error instead of
    silently walking C.
    Multi-char (AB), digit (1), special
    (_), and missing-argument forms all fell through to
    params.get("namespace", "C") in _handle_walk_namespace with
    no signal to the caller that the input was rejected. Sibling
    tools (find_by_title, links_in, suggestions,
    tell_me_about) already return structured missing-arg errors;
    this one didn't. Fixed by adding an upfront guard that mirrors
    their shape (rule / examples) before the C-default kicks in.
  • P6-D1 + P6-D2: browse namespace now reaches input-validation
    parity with walk namespace.
    Two cooperating gaps — the
    handler _handle_browse had the same
    params.get("namespace", "C") silent-default that P4-D3 fixed
    for walk; AND the extractor _extract_browse accepted multi-char,
    digit, and special-character namespace arguments
    (browse namespace AB / 1 / _) without uppercasing lowercase
    input — diverging from the strict
    _extract_walk_namespace. The two siblings now agree: regex
    tightened to namespace\s+['"]?([A-Za-z])\b['"]? with .upper()
    on the captured letter, and the handler returns a structured
    "Missing or Invalid Namespace" error when the extractor produces
    nothing.
  • P6-D3: leading please / kindly now strip cleanly from the
    parsed topic.
    please tell me about Photosynthesis and
    kindly describe Photosynthesis previously parsed with the
    politeness phrase leaking into the topic — same shape as the
    pass-1 D5 defect but for non-modal politeness words. The article
    still resolved via tail-probe rescue, but the parsed topic was
    wrong. Fix extends the leading-strip in _extract_tell_me_about
    to cover please / kindly AND wraps both the modal-strip and
    the politeness-strip in a loop so composite phrases
    (please could you tell me about X, please please tell me about X) peel cleanly. Same loop also applied to the chain-
    detector's _chained_intent_guidance pre-strip so leading
    politeness doesn't bypass chain detection (mirror of P4-D2).
    Leaves the existing trailing-politeness strip alone, so
    tell me about X please still works, and the leading-only
    anchor (^\s*) prevents stripping mid-query mentions of
    please / kindly that are legitimately part of the topic.

Tests

  • tests/test_post_a15_beta_fixes.py — 80 regression tests
    pinning all ten defects. Each defect gets:
    • The fix-case test (Mercury body has no misleading trailer;
      could you tell me about X parses topic=X; find article titled M/Title returns redirect; _build_walk_result exposes the
      zero-count field; suggestions for triggers the missing-arg
      guard; could you tell me about X then list namespaces is
      detected as chained; walk namespace AB returns the missing-
      namespace error; browse namespace AB returns the same error
      and browse namespace c lowercases to "C"; please tell me about X strips cleanly).
    • Negative self-audit cases (Berlin keeps its disambig-twin
      footer; non-modal queries unchanged; lowercase a/b not
      redirected by find_by_title; namespace_entry_count omitted
      when caller passes None; legitimate suggestions for forest
      still captures the prefix; non-chained could you tell me about X not tripped by the chain detector; trailing please still
      works; mid-query please in linguistics not stripped).
    • Cross-defect probes (Java disambig body suppresses
      disambig_twin_path footer too; please could you tell me about X peels both layers; please tell me about X then list namespaces trips chain detector).

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track cameronrye/openzim-mcp

Get notified when new releases ship.

Sign up free

About cameronrye/openzim-mcp

Modern, secure MCP server for accessing ZIM format knowledge bases offline. Enables AI models to search and navigate Wikipedia, educational content, and other compressed knowledge archives with smart retrieval, caching, and comprehensive API.

All releases →

Related context

Earlier breaking changes

  • v2.0.0a15 _attribute_sections falls back to first section when no section brackets located passage
  • v2.0.0a13 canonical‑splice gate tightened to require exact path equality, fixing H2/H3 surface end‑to‑end behavior across all shapes.
  • v2.0.0a11 Exposed `content_offset` as top-level `zim_query` parameter, validated >=0, threaded through options.
  • v2.0.0a10 `get article M/<key>` now returns ZIM metadata entry rather than aliased C-namespace article body.
  • v2.0.0a10 `metadata for <file>` returns concise metadata strings instead of full article bodies for new-scheme archives.

Beta — feedback welcome: [email protected]