cameronrye/openzim-mcp

v2.0.0a10 Breaking

This release includes 1 breaking change for platform teams planning a safe upgrade.

Published 2mo MCP Data & Storage

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

kiwix mcp mcp-server openzim zim

Affected surfaces

auth rbac

ReleasePort's take

Light signal

editorial:auto 2mo

Metadata APIs now correctly return ZIM metadata instead of silently serving article bodies, and cursor mismatches are rejected.

Why it matters: Patch to v2.0.0a10 immediately if your service relies on accurate metadata responses; the fix prevents silent data leakage and ensures consistent query handling.

Summary

AI summary

Metadata APIs now return correct ZIM metadata instead of silently aliased article bodies, and cursor mismatches are rejected.

Changes in this release

Type	Severity	Summary	CVE
Security	Medium	Route M/<key> paths to get_metadata_item for new-scheme archives. Route M/<key> paths to get_metadata_item for new-scheme archives. Source: llm_adapter@2026-05-21 Confidence: high	—
Security	Low	Replace regex-based disambiguation lead detection with simple string endswith check to avoid ReDoS risk. Replace regex-based disambiguation lead detection with simple string endswith check to avoid ReDoS risk. Source: granite4.1:30b@2026-05-23-audit Confidence: low	—
Breaking
Breaking	Medium	Infobox extraction now emits trailing rows without the preceding "GDP —" label, changing bullet-label strings. Infobox extraction now emits trailing rows without the preceding "GDP —" label, changing bullet-label strings. Source: granite4.1:30b@2026-05-23-audit Confidence: low	—
Breaking	Medium	`metadata for <file>` returns concise metadata strings instead of full article bodies for new-scheme archives. `metadata for <file>` returns concise metadata strings instead of full article bodies for new-scheme archives. Source: granite4.1:30b@2026-05-23-audit Confidence: low	—
Breaking	Medium	`get article M/<key>` now returns ZIM metadata entry rather than aliased C-namespace article body. `get article M/<key>` now returns ZIM metadata entry rather than aliased C-namespace article body. Source: granite4.1:30b@2026-05-23-audit Confidence: low	—
Feature
Feature	Low	Add stopword-saturation footer for queries matching many stopwords. Add stopword-saturation footer for queries matching many stopwords. Source: granite4.1:30b@2026-05-23-audit Confidence: low	—
Feature	Low	Update truncation hint to operation-agnostic guidance, removing self-reference. Update truncation hint to operation-agnostic guidance, removing self-reference. Source: granite4.1:30b@2026-05-23-audit Confidence: low	—
Feature	Low	Preserve inline disambiguation list in "tell me about" leads when pattern detected. Preserve inline disambiguation list in "tell me about" leads when pattern detected. Source: granite4.1:30b@2026-05-23-audit Confidence: low	—
Feature	Low	Demote List_of_, Index_of_, Outline_of_, Timeline_of_ articles in synthesize ranking after title promotion. Demote List_of_, Index_of_, Outline_of_, Timeline_of_ articles in synthesize ranking after title promotion. Source: granite4.1:30b@2026-05-23-audit Confidence: low	—
Bugfix
Bugfix	Medium	Probes title index first before literal path in related articles handler. Probes title index first before literal path in related articles handler. Source: llm_adapter@2026-05-21 Confidence: high	—
Bugfix	Medium	Retry find_title_match with min_score=0.8 after strict gate failure for typo queries. Retry find_title_match with min_score=0.8 after strict gate failure for typo queries. Source: llm_adapter@2026-05-21 Confidence: high	—
Bugfix	Medium	Use get_metadata_item in _extract_zim_metadata for new-scheme archives. Use get_metadata_item in _extract_zim_metadata for new-scheme archives. Source: llm_adapter@2026-05-21 Confidence: high	—
Bugfix	Medium	Reset current_section on KV rows with mergedtoprow after first row emitted. Reset current_section on KV rows with mergedtoprow after first row emitted. Source: llm_adapter@2026-05-21 Confidence: low	—
Bugfix	Low	Compute closest_match hint locally for "get section X of Y" natural-language path to suggest correct section name. Compute closest_match hint locally for "get section X of Y" natural-language path to suggest correct section name. Source: granite4.1:30b@2026-05-23-audit Confidence: low	—
Bugfix	Low	Append scan_truncated footer when related articles scan cap fires in markdown rendering. Append scan_truncated footer when related articles scan cap fires in markdown rendering. Source: granite4.1:30b@2026-05-23-audit Confidence: low	—
Bugfix	Low	Prepend canonical bare-title article when suggestions miss it, using a single SuggestionSearcher round trip. Prepend canonical bare-title article when suggestions miss it, using a single SuggestionSearcher round trip. Source: granite4.1:30b@2026-05-23-audit Confidence: low	—
Bugfix	Low	Mirror well-known namespace entries in walk_namespaceData to match listNamespaces output. Mirror well-known namespace entries in walk_namespaceData to match listNamespaces output. Source: granite4.1:30b@2026-05-23-audit Confidence: low	—
Bugfix	Low	Reject cursor when s.q shares no meaningful tokens with current query, preventing wrong-query pagination. Reject cursor when s.q shares no meaningful tokens with current query, preventing wrong-query pagination. Source: granite4.1:30b@2026-05-23-audit Confidence: low	—
Bugfix	Low	Trim _splice_title_match_into_search results to requested limit and update returned_count. Trim _splice_title_match_into_search results to requested limit and update returned_count. Source: granite4.1:30b@2026-05-23-audit Confidence: low	—
Refactor	Low	Split five complex functions to reduce cognitive complexity per SonarCloud limits. Split five complex functions to reduce cognitive complexity per SonarCloud limits. Source: granite4.1:30b@2026-05-23-audit Confidence: low	—
Refactor	Low	Extract duplicate literals (MIME prefix, pseudo-namespace strings) into constants with shared helper. Extract duplicate literals (MIME prefix, pseudo-namespace strings) into constants with shared helper. Source: granite4.1:30b@2026-05-23-audit Confidence: low	—
Other
Other	Low	bugfix bugfix Source: llm_adapter@2026-05-21 Confidence: low	—
Other	Low	severity severity Source: llm_adapter@2026-05-21 Confidence: low	—
Other	Low	40 40 Source: llm_adapter@2026-05-21 Confidence: low	—
Other	Low	text text Source: llm_adapter@2026-05-21 Confidence: low	—
Other	Low	metadata aggregator now returns correct metadata strings instead of article bodies metadata aggregator now returns correct metadata strings instead of article bodies Source: llm_adapter@2026-05-21 Confidence: low	—
Other	Low	affected_surface affected_surface Source: llm_adapter@2026-05-21 Confidence: low	—
Other	Low	metadata for <file> metadata for <file> Source: llm_adapter@2026-05-21 Confidence: low	—

Full changelog

Two-pass beta-test of v2.0.0a9 against a 118 GB Wikipedia ZIM (Feb
2026 snapshot), plus a self-review code-reviewer audit and a
SonarCloud Quality Gate cleanup. The first pass exercised the
markdown surface; the second pass audited the first-pass fixes and
extended live testing to surfaces not covered the first time. Several
recently-shipped backend features turned out to be unreachable from
the natural-language surface, several handlers had silent fall-through
bugs on common phrasings, and one libzim quirk (silent namespace-prefix
stripping) was masking the entire metadata API.

Net: 1425 tests pass (+5 over v2.0.0a9), 50 skipped, 38 deselected.
Live-verified key fixes against the real Wikipedia archive via
in-process ZimOperations calls.

Fixed — Critical (post-a9 beta sweep)

D1: infobox section-context leakage on every Wikipedia city /
country. Berlin and Tokyo (and the broad city-template family)
produced trailing rows labelled **GDP — Time zone:**,
**GDP — Vehicle registration:**, **GDP — Website:**,
**GDP — HDI (2022):** — clearly wrong. The post-a8 #2/Op5
parent-context fix correctly tracked current_section from
<th class="infobox-header"> rows but never reset it; trailing
free-floating rows (which Wikipedia marks <tr class="mergedtoprow">)
inherited the last header. Reset current_section on KV rows whose
<tr> carries mergedtoprow AND only after at least one row has
been emitted under the current section — the second guard is the
third-pass fix, without which the reset stripped section context
from the first KV row inside a section header (Wikipedia uses
mergedtoprow on those too as the visual group lead). Both edges
covered by new regression tests.
D7: M/<key> paths silently aliased to C-namespace articles.
libzim's archive.get_entry_by_path("M/Title") strips the M/
prefix and resolves to the C-namespace article with that name;
get article M/Title against a Wikipedia ZIM returned the 172 KB
disambiguation article on "Title" instead of the metadata entry.
Route M/<key> paths to archive.get_metadata_item on new-scheme
archives so the proper metadata API serves these requests. Verified:
M/Title now returns "Wikipedia", M/Date returns "2026-02-15".

Fixed — High (post-a9 beta sweep)

D2: articles related to <topic> failed on natural phrasings.
The intent parser hands the topic verbatim from the user's query
(articles related to United States → United States), but the
underlying entry path stores spaces as underscores
(United_States). The handler called get_related_articles_data
with the unresolved string and surfaced "Cannot find entry". Now
probes the title index via find_title_match(min_score=0.8) first;
fall through to the literal path only when no canonical resolves.
D3: tell me about <typo> skipped the typo-tolerant title
fallback. The first-pass title promotion required score 1.0;
single-edit typos resolve at score 0.85 via _find_entry_typo_fallback.
tell me about Photosythesis (missing n) fell through to Xapian
search and returned International Year of Chemistry —
actively misleading. Retry find_title_match(min_score=0.8) after
the strict gate fails; same conservative typo chain
(length-gated at ≥ 5 chars, ≤ 700 variants).
DD1: metadata for <file> aggregator returned 172 KB article
bodies for new-scheme archives. D7 fixed the per-entry
get article M/Title surface but _extract_zim_metadata
(a separate code path used by the metadata for aggregator) was
still calling get_entry_by_path("M/Title") and getting the same
silently-aliased C-namespace article. Now uses get_metadata_item
for new-scheme archives, with old-scheme get_entry_by_path
fallback. Verified: Title returns "Wikipedia", Description
returns "The free encyclopedia", Language returns "eng" (was
172 K / 60 K / 364 K-char garbage respectively).
DD2: tell me about ignored content_offset. The handler
hard-coded offset = 0 in the body fetch, so callers paginating a
148 KB Photosynthesis article through zim_query couldn't reach
the tail without dropping to a separate get article <path> call.
Threaded options.get("content_offset", 0) through; suppress the
compact-mode lead-with-TOC step when reading mid-article.

Fixed — Medium (post-a9 beta sweep)

D4: get section X of Y natural-language error path dropped the
closest_match hint. The structured get_section operation
computes a difflib-based closest-match (Op5 from a8) but the
natural-language handler reimplemented section lookup against the
headings list and never queried that operation. Compute the same
hint locally so get section Goegraphy of Berlin now suggests
"Did you mean Geography?".
D5: articles related to <hub> markdown dropped the
scan_truncated signal. The a9 #A5 backend addition surfaced
scan_truncated / scan_total_internal / _meta.reason for hub
articles whose 500-link scan cap fired, but compact_renderers.render_related
ignored all of it. Append a footer when the signal is set.
D6: suggestions for X missed the canonical bare-title article.
suggestions for Photosyn returned 15 results, none of which was
bare Photosynthesis — both libzim's SuggestionSearcher and
Xapian rank disambiguator-bearing variants
(Photosynthesis (song), Photosynthetic_efficiency) above the
short canonical title. Probe SuggestionSearcher for parenthesised
siblings (foo_(suffix)) and prepend the un-suffixed root path
when the archive resolves it. The third-pass refactor restructured
this to share a single SuggestionSearcher.suggest() round trip
with Strategy 2, so the cold path stays at one title-index probe.
D8: walk namespace W returned zero entries while
list namespaces claimed W had two. The two operations
contradicted each other on the same archive. The W-namespace
well-known entries (mainPage, favicon) live on the
archive.main_entry / has_illustration API, not the iterable
surface that walk_namespace_data falls back to. Mirror the same
probe pair _add_new_scheme_well_known_namespace already uses
for the namespace listing. Also fix the entries 1-0 off-by-one
in the empty-walk header rendering.
D9: cursor s.q field silently ignored — wrong-query
pagination. Cursor reused across queries silently paginated the
new query at the old offset. Reject with a cursor_decode error
when s.q shares no meaningful (≥ 3-char) tokens with the current
query. Falls back to a bidirectional substring check for cursors
whose stored query has only short tokens. Three regression tests
cover the unrelated-query reject, the shortened-query accept, and
the overlapping-tokens accept.
DD4: _splice_title_match_into_search returned limit + 1
results. Prepending the canonical synthetic result didn't trim
back to the requested limit; limit=3 produced 4 results with
header "showing 1-4". Trim to page_info.limit and update
page_info.returned_count so the header matches the row count.

Added — Opportunities (post-a9 beta sweep)

O2: stopword-saturation footer on search. Queries that match
≥ 1 M results (the stopword-only search for the and a is in to
saturates at ~5 M) now carry a footer noting that top hits are
ranked by general document importance, not topic relevance — so
the model doesn't trust the "Found N matches" signal as
meaningful.
O3: truncation hint no longer self-references. The previous
hint suggested show structure of <path> as the recovery —
silly when the truncated response IS the show-structure (or
table-of-contents) output. Replaced with operation-agnostic
guidance (page via cursor / tighten query / compact=False).
O4: disambiguation page leads preserve their inline list.
tell me about Martin previously truncated to **Martin** may refer to: with no list, forcing a show structure round-trip.
Detect "X may refer to:" leads and skip the H2 cut so the
disambig list stays inline.
O5: synthesize demotes List_of_* / Index_of_* /
Outline_of_* / Timeline_of_* etc. These articles ranked
surprisingly high in synthesize because their bodies match many
query tokens but the actual content is just an enumeration stub.
Demote to the back of top_n AFTER title promotion runs (demoting
before regressed the promotion's strong-match guard, which would
treat Berlin_(disambiguation) as a match for Berlin).
O6: docstring notes distinguish show structure (flat heading
list) from table of contents (nested children tree).

Fixed — Code-reviewer audit findings (post-first-pass)

A feature-dev:code-reviewer agent audited the first-pass commit and
surfaced three real defects in the original fixes:

A1 (the second guard on D1, listed under Critical above).
A2: D6 ran SuggestionSearcher twice on the cold path. When
Strategy 1 returned empty, both the canonical probe AND Strategy 2
opened independent SuggestionSearcher instances against the same
archive. The first-pass "skip canonical probe when Strategy 1
empty" fix regressed the empty-Strategy-1 case (the canonical
probe IS needed when Xapian misses). Restructured to share a
single SuggestionSearcher.suggest() round trip via an optional
result_paths= parameter on _find_canonical_prefix_match.
A3 (the token-overlap rewrite of D9, listed under Medium above).

Fixed — Quality gate (SonarCloud third-pass cleanup)

5 cognitive-complexity reductions (S3776). Five functions added
by the beta-test commits crossed SonarCloud's complexity-15 limit.
Each was split into self-contained helpers without behaviour
change: _find_canonical_prefix_match (53 → split into 5 helpers
for path probing, root extraction, entry resolution, and the two
ranking strategies), _handle_tell_me_about (19 → 17 → ~14 over
two passes via _promote_topic_via_title_index and
_fetch_topic_article_body), render_related (17 → ~10 via
_render_related_link_line + _scan_truncated_footer),
render_walk_namespace (19 → ~12 via _walk_namespace_header),
and _get_metadata_entry (18 → ~13 via _decode_metadata_content).
4 duplicate-literal extractions (S1192). The "text/" MIME prefix
had three call sites in zim/content.py; "File:" / "Category:" /
"Template:" each had three call sites in zim/search.py. Extracted
to _TEXT_MIME_PREFIX and a _PSEUDO_NAMESPACE_* constant trio
with a shared _is_pseudo_namespace_entry(extended=) helper.
1 ReDoS hotspot (S5852). The O4 disambig-lead-detection regex
\bmay\s+(?:also\s+)?refer\s+to\s*:?\s*$ was flagged for nested
unbounded quantifiers. Not actually catastrophic on Python's re
engine, but replaced anyway with a normalised
str.endswith(("may refer to", "may also refer to")) check — same
behaviour, no regex engine, and the phrase list is easier to
extend.

Wire-format / surface changes (alpha-line clean breaks)

Infobox extraction labels for trailing rows change. Berlin /
Tokyo terminal rows that previously emitted as GDP — Time zone
now emit as Time zone. Callers parsing the bullet-prefix
structure see different label strings.
metadata for <file> now returns short metadata strings
instead of 172 KB article-body excerpts. Wire-format compatible
(same keys); content is the actual ZIM metadata (Title =
"Wikipedia", Date = "2026-02-15", etc.).
get article M/<key> now returns the ZIM metadata entry
instead of the silently-aliased C-namespace article body.
Wire-format compatible (same response envelope); content differs.
_splice_title_match_into_search trims to the requested
limit. Callers receiving limit + 1 results will now get exactly
limit.
Cursor with mismatched s.q now errors. Callers that
previously got silent wrong-query results now receive a
cursor_decode ToolErrorPayload.
Synthesize ranking demotes list articles. Citation order for a
query like Quantum mechanics no longer includes
List_of_textbooks_… in the top half.
Truncation hint footer text changed (O3). Callers parsing the
trailing prose see different wording.

Investigated and deferred

Pseudo-namespace pollution in default search results
(Portal: / User: / Help:). Filtering pseudo-namespace
articles from default search is too opinionated; some callers
legitimately want them. The canonical-promotion already pushes
the real article to rank 1 in the common case (live-verified:
search for biology → Biology at #1 via (canonical title match)). Revisit if the canonical-promotion fallback proves
insufficient.

Breaking Changes

Cursor validation: mismatched `s.q` now raises a `cursor_decode` error instead of silent wrong‑query pagination.

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track cameronrye/openzim-mcp

Get notified when new releases ship.

About cameronrye/openzim-mcp

Modern, secure MCP server for accessing ZIM format knowledge bases offline. Enables AI models to search and navigate Wikipedia, educational content, and other compressed knowledge archives with smart retrieval, caching, and comprehensive API.

All releases →

Related context

Related tools

Earlier breaking changes

v2.0.0a15 _attribute_sections falls back to first section when no section brackets located passage
v2.0.0a13 canonical‑splice gate tightened to require exact path equality, fixing H2/H3 surface end‑to‑end behavior across all shapes.
v2.0.0a11 Exposed `content_offset` as top-level `zim_query` parameter, validated >=0, threaded through options.
v2.0.0a9 HTTP rate-limiter client_id now derived from token or IP; defaults to "default" fallback.