This release includes 1 breaking change for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
Affected surfaces
ReleasePort's take
Light signalMetadata APIs now correctly return ZIM metadata instead of silently serving article bodies, and cursor mismatches are rejected.
Why it matters: Patch to v2.0.0a10 immediately if your service relies on accurate metadata responses; the fix prevents silent data leakage and ensures consistent query handling.
Summary
AI summaryMetadata APIs now return correct ZIM metadata instead of silently aliased article bodies, and cursor mismatches are rejected.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Security | Medium |
Route M/<key> paths to get_metadata_item for new-scheme archives. Route M/<key> paths to get_metadata_item for new-scheme archives. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Security | Low |
Replace regex-based disambiguation lead detection with simple string endswith check to avoid ReDoS risk. Replace regex-based disambiguation lead detection with simple string endswith check to avoid ReDoS risk. Source: granite4.1:30b@2026-05-23-audit Confidence: low |
— |
| Breaking | Medium |
Infobox extraction now emits trailing rows without the preceding "GDP —" label, changing bullet-label strings. Infobox extraction now emits trailing rows without the preceding "GDP —" label, changing bullet-label strings. Source: granite4.1:30b@2026-05-23-audit Confidence: low |
— |
| Breaking | Medium |
`metadata for <file>` returns concise metadata strings instead of full article bodies for new-scheme archives. `metadata for <file>` returns concise metadata strings instead of full article bodies for new-scheme archives. Source: granite4.1:30b@2026-05-23-audit Confidence: low |
— |
| Breaking | Medium |
`get article M/<key>` now returns ZIM metadata entry rather than aliased C-namespace article body. `get article M/<key>` now returns ZIM metadata entry rather than aliased C-namespace article body. Source: granite4.1:30b@2026-05-23-audit Confidence: low |
— |
| Feature | Low |
Add stopword-saturation footer for queries matching many stopwords. Add stopword-saturation footer for queries matching many stopwords. Source: granite4.1:30b@2026-05-23-audit Confidence: low |
— |
| Feature | Low |
Update truncation hint to operation-agnostic guidance, removing self-reference. Update truncation hint to operation-agnostic guidance, removing self-reference. Source: granite4.1:30b@2026-05-23-audit Confidence: low |
— |
| Feature | Low |
Preserve inline disambiguation list in "tell me about" leads when pattern detected. Preserve inline disambiguation list in "tell me about" leads when pattern detected. Source: granite4.1:30b@2026-05-23-audit Confidence: low |
— |
| Feature | Low |
Demote List_of_*, Index_of_*, Outline_of_*, Timeline_of_* articles in synthesize ranking after title promotion. Demote List_of_*, Index_of_*, Outline_of_*, Timeline_of_* articles in synthesize ranking after title promotion. Source: granite4.1:30b@2026-05-23-audit Confidence: low |
— |
| Bugfix | Medium |
Probes title index first before literal path in related articles handler. Probes title index first before literal path in related articles handler. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Bugfix | Medium |
Retry find_title_match with min_score=0.8 after strict gate failure for typo queries. Retry find_title_match with min_score=0.8 after strict gate failure for typo queries. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Bugfix | Medium |
Use get_metadata_item in _extract_zim_metadata for new-scheme archives. Use get_metadata_item in _extract_zim_metadata for new-scheme archives. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Bugfix | Medium |
Reset current_section on KV rows with mergedtoprow after first row emitted. Reset current_section on KV rows with mergedtoprow after first row emitted. Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Bugfix | Low |
Compute closest_match hint locally for "get section X of Y" natural-language path to suggest correct section name. Compute closest_match hint locally for "get section X of Y" natural-language path to suggest correct section name. Source: granite4.1:30b@2026-05-23-audit Confidence: low |
— |
| Bugfix | Low |
Append scan_truncated footer when related articles scan cap fires in markdown rendering. Append scan_truncated footer when related articles scan cap fires in markdown rendering. Source: granite4.1:30b@2026-05-23-audit Confidence: low |
— |
| Bugfix | Low |
Prepend canonical bare-title article when suggestions miss it, using a single SuggestionSearcher round trip. Prepend canonical bare-title article when suggestions miss it, using a single SuggestionSearcher round trip. Source: granite4.1:30b@2026-05-23-audit Confidence: low |
— |
| Bugfix | Low |
Mirror well-known namespace entries in walk_namespaceData to match listNamespaces output. Mirror well-known namespace entries in walk_namespaceData to match listNamespaces output. Source: granite4.1:30b@2026-05-23-audit Confidence: low |
— |
| Bugfix | Low |
Reject cursor when s.q shares no meaningful tokens with current query, preventing wrong-query pagination. Reject cursor when s.q shares no meaningful tokens with current query, preventing wrong-query pagination. Source: granite4.1:30b@2026-05-23-audit Confidence: low |
— |
| Bugfix | Low |
Trim _splice_title_match_into_search results to requested limit and update returned_count. Trim _splice_title_match_into_search results to requested limit and update returned_count. Source: granite4.1:30b@2026-05-23-audit Confidence: low |
— |
| Refactor | Low |
Split five complex functions to reduce cognitive complexity per SonarCloud limits. Split five complex functions to reduce cognitive complexity per SonarCloud limits. Source: granite4.1:30b@2026-05-23-audit Confidence: low |
— |
| Refactor | Low |
Extract duplicate literals (MIME prefix, pseudo-namespace strings) into constants with shared helper. Extract duplicate literals (MIME prefix, pseudo-namespace strings) into constants with shared helper. Source: granite4.1:30b@2026-05-23-audit Confidence: low |
— |
| Other | Low |
bugfix bugfix Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Other | Low |
severity severity Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Other | Low |
40 40 Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Other | Low |
text text Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Other | Low |
metadata aggregator now returns correct metadata strings instead of article bodies metadata aggregator now returns correct metadata strings instead of article bodies Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Other | Low |
affected_surface affected_surface Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Other | Low |
metadata for <file> metadata for <file> Source: llm_adapter@2026-05-21 Confidence: low |
— |
Full changelog
Two-pass beta-test of v2.0.0a9 against a 118 GB Wikipedia ZIM (Feb
2026 snapshot), plus a self-review code-reviewer audit and a
SonarCloud Quality Gate cleanup. The first pass exercised the
markdown surface; the second pass audited the first-pass fixes and
extended live testing to surfaces not covered the first time. Several
recently-shipped backend features turned out to be unreachable from
the natural-language surface, several handlers had silent fall-through
bugs on common phrasings, and one libzim quirk (silent namespace-prefix
stripping) was masking the entire metadata API.
Net: 1425 tests pass (+5 over v2.0.0a9), 50 skipped, 38 deselected.
Live-verified key fixes against the real Wikipedia archive via
in-process ZimOperations calls.
Fixed — Critical (post-a9 beta sweep)
- D1: infobox section-context leakage on every Wikipedia city /
country. Berlin and Tokyo (and the broad city-template family)
produced trailing rows labelled**GDP — Time zone:**,
**GDP — Vehicle registration:**,**GDP — Website:**,
**GDP — HDI (2022):**— clearly wrong. The post-a8 #2/Op5
parent-context fix correctly trackedcurrent_sectionfrom
<th class="infobox-header">rows but never reset it; trailing
free-floating rows (which Wikipedia marks<tr class="mergedtoprow">)
inherited the last header. Resetcurrent_sectionon KV rows whose
<tr>carriesmergedtoprowAND only after at least one row has
been emitted under the current section — the second guard is the
third-pass fix, without which the reset stripped section context
from the first KV row inside a section header (Wikipedia uses
mergedtoprowon those too as the visual group lead). Both edges
covered by new regression tests. - D7:
M/<key>paths silently aliased to C-namespace articles.
libzim'sarchive.get_entry_by_path("M/Title")strips theM/
prefix and resolves to the C-namespace article with that name;
get article M/Titleagainst a Wikipedia ZIM returned the 172 KB
disambiguation article on "Title" instead of the metadata entry.
RouteM/<key>paths toarchive.get_metadata_itemon new-scheme
archives so the proper metadata API serves these requests. Verified:
M/Titlenow returns"Wikipedia",M/Datereturns"2026-02-15".
Fixed — High (post-a9 beta sweep)
- D2:
articles related to <topic>failed on natural phrasings.
The intent parser hands the topic verbatim from the user's query
(articles related to United States→United States), but the
underlying entry path stores spaces as underscores
(United_States). The handler calledget_related_articles_data
with the unresolved string and surfaced "Cannot find entry". Now
probes the title index viafind_title_match(min_score=0.8)first;
fall through to the literal path only when no canonical resolves. - D3:
tell me about <typo>skipped the typo-tolerant title
fallback. The first-pass title promotion required score 1.0;
single-edit typos resolve at score 0.85 via_find_entry_typo_fallback.
tell me about Photosythesis(missingn) fell through to Xapian
search and returnedInternational Year of Chemistry—
actively misleading. Retryfind_title_match(min_score=0.8)after
the strict gate fails; same conservative typo chain
(length-gated at ≥ 5 chars, ≤ 700 variants). - DD1:
metadata for <file>aggregator returned 172 KB article
bodies for new-scheme archives. D7 fixed the per-entry
get article M/Titlesurface but_extract_zim_metadata
(a separate code path used by themetadata foraggregator) was
still callingget_entry_by_path("M/Title")and getting the same
silently-aliased C-namespace article. Now usesget_metadata_item
for new-scheme archives, with old-schemeget_entry_by_path
fallback. Verified:Titlereturns"Wikipedia",Description
returns"The free encyclopedia",Languagereturns"eng"(was
172 K / 60 K / 364 K-char garbage respectively). - DD2:
tell me aboutignoredcontent_offset. The handler
hard-coded offset = 0 in the body fetch, so callers paginating a
148 KB Photosynthesis article throughzim_querycouldn't reach
the tail without dropping to a separateget article <path>call.
Threadedoptions.get("content_offset", 0)through; suppress the
compact-mode lead-with-TOC step when reading mid-article.
Fixed — Medium (post-a9 beta sweep)
- D4:
get section X of Ynatural-language error path dropped the
closest_matchhint. The structuredget_sectionoperation
computes adifflib-based closest-match (Op5 from a8) but the
natural-language handler reimplemented section lookup against the
headings list and never queried that operation. Compute the same
hint locally soget section Goegraphy of Berlinnow suggests
"Did you mean Geography?". - D5:
articles related to <hub>markdown dropped the
scan_truncatedsignal. The a9 #A5 backend addition surfaced
scan_truncated/scan_total_internal/_meta.reasonfor hub
articles whose 500-link scan cap fired, butcompact_renderers.render_related
ignored all of it. Append a footer when the signal is set. - D6:
suggestions for Xmissed the canonical bare-title article.
suggestions for Photosynreturned 15 results, none of which was
barePhotosynthesis— both libzim'sSuggestionSearcherand
Xapian rank disambiguator-bearing variants
(Photosynthesis (song),Photosynthetic_efficiency) above the
short canonical title. ProbeSuggestionSearcherfor parenthesised
siblings (foo_(suffix)) and prepend the un-suffixed root path
when the archive resolves it. The third-pass refactor restructured
this to share a singleSuggestionSearcher.suggest()round trip
with Strategy 2, so the cold path stays at one title-index probe. - D8:
walk namespace Wreturned zero entries while
list namespacesclaimed W had two. The two operations
contradicted each other on the same archive. The W-namespace
well-known entries (mainPage,favicon) live on the
archive.main_entry/has_illustrationAPI, not the iterable
surface thatwalk_namespace_datafalls back to. Mirror the same
probe pair_add_new_scheme_well_known_namespacealready uses
for the namespace listing. Also fix theentries 1-0off-by-one
in the empty-walk header rendering. - D9: cursor
s.qfield silently ignored — wrong-query
pagination. Cursor reused across queries silently paginated the
new query at the old offset. Reject with acursor_decodeerror
whens.qshares no meaningful (≥ 3-char) tokens with the current
query. Falls back to a bidirectional substring check for cursors
whose stored query has only short tokens. Three regression tests
cover the unrelated-query reject, the shortened-query accept, and
the overlapping-tokens accept. - DD4:
_splice_title_match_into_searchreturnedlimit + 1
results. Prepending the canonical synthetic result didn't trim
back to the requested limit;limit=3produced 4 results with
header"showing 1-4". Trim topage_info.limitand update
page_info.returned_countso the header matches the row count.
Added — Opportunities (post-a9 beta sweep)
- O2: stopword-saturation footer on search. Queries that match
≥ 1 M results (the stopword-onlysearch for the and a is in to
saturates at ~5 M) now carry a footer noting that top hits are
ranked by general document importance, not topic relevance — so
the model doesn't trust the "Found N matches" signal as
meaningful. - O3: truncation hint no longer self-references. The previous
hint suggestedshow structure of <path>as the recovery —
silly when the truncated response IS the show-structure (or
table-of-contents) output. Replaced with operation-agnostic
guidance (page via cursor / tighten query /compact=False). - O4: disambiguation page leads preserve their inline list.
tell me about Martinpreviously truncated to**Martin** may refer to:with no list, forcing ashow structureround-trip.
Detect "X may refer to:" leads and skip the H2 cut so the
disambig list stays inline. - O5: synthesize demotes
List_of_*/Index_of_*/
Outline_of_*/Timeline_of_*etc. These articles ranked
surprisingly high in synthesize because their bodies match many
query tokens but the actual content is just an enumeration stub.
Demote to the back oftop_nAFTER title promotion runs (demoting
before regressed the promotion's strong-match guard, which would
treatBerlin_(disambiguation)as a match forBerlin). - O6: docstring notes distinguish
show structure(flat heading
list) fromtable of contents(nested children tree).
Fixed — Code-reviewer audit findings (post-first-pass)
A feature-dev:code-reviewer agent audited the first-pass commit and
surfaced three real defects in the original fixes:
- A1 (the second guard on D1, listed under Critical above).
- A2: D6 ran
SuggestionSearchertwice on the cold path. When
Strategy 1 returned empty, both the canonical probe AND Strategy 2
opened independentSuggestionSearcherinstances against the same
archive. The first-pass "skip canonical probe when Strategy 1
empty" fix regressed the empty-Strategy-1 case (the canonical
probe IS needed when Xapian misses). Restructured to share a
singleSuggestionSearcher.suggest()round trip via an optional
result_paths=parameter on_find_canonical_prefix_match. - A3 (the token-overlap rewrite of D9, listed under Medium above).
Fixed — Quality gate (SonarCloud third-pass cleanup)
- 5 cognitive-complexity reductions (S3776). Five functions added
by the beta-test commits crossed SonarCloud's complexity-15 limit.
Each was split into self-contained helpers without behaviour
change:_find_canonical_prefix_match(53 → split into 5 helpers
for path probing, root extraction, entry resolution, and the two
ranking strategies),_handle_tell_me_about(19 → 17 → ~14 over
two passes via_promote_topic_via_title_indexand
_fetch_topic_article_body),render_related(17 → ~10 via
_render_related_link_line+_scan_truncated_footer),
render_walk_namespace(19 → ~12 via_walk_namespace_header),
and_get_metadata_entry(18 → ~13 via_decode_metadata_content). - 4 duplicate-literal extractions (S1192). The "text/" MIME prefix
had three call sites inzim/content.py; "File:" / "Category:" /
"Template:" each had three call sites inzim/search.py. Extracted
to_TEXT_MIME_PREFIXand a_PSEUDO_NAMESPACE_*constant trio
with a shared_is_pseudo_namespace_entry(extended=)helper. - 1 ReDoS hotspot (S5852). The O4 disambig-lead-detection regex
\bmay\s+(?:also\s+)?refer\s+to\s*:?\s*$was flagged for nested
unbounded quantifiers. Not actually catastrophic on Python'sre
engine, but replaced anyway with a normalised
str.endswith(("may refer to", "may also refer to"))check — same
behaviour, no regex engine, and the phrase list is easier to
extend.
Wire-format / surface changes (alpha-line clean breaks)
- Infobox extraction labels for trailing rows change. Berlin /
Tokyo terminal rows that previously emitted asGDP — Time zone
now emit asTime zone. Callers parsing the bullet-prefix
structure see different label strings. metadata for <file>now returns short metadata strings
instead of 172 KB article-body excerpts. Wire-format compatible
(same keys); content is the actual ZIM metadata (Title=
"Wikipedia",Date="2026-02-15", etc.).get article M/<key>now returns the ZIM metadata entry
instead of the silently-aliased C-namespace article body.
Wire-format compatible (same response envelope); content differs._splice_title_match_into_searchtrims to the requested
limit. Callers receivinglimit + 1results will now get exactly
limit.- Cursor with mismatched
s.qnow errors. Callers that
previously got silent wrong-query results now receive a
cursor_decodeToolErrorPayload. - Synthesize ranking demotes list articles. Citation order for a
query likeQuantum mechanicsno longer includes
List_of_textbooks_…in the top half. - Truncation hint footer text changed (O3). Callers parsing the
trailing prose see different wording.
Investigated and deferred
- Pseudo-namespace pollution in default search results
(Portal:/User:/Help:). Filtering pseudo-namespace
articles from default search is too opinionated; some callers
legitimately want them. The canonical-promotion already pushes
the real article to rank 1 in the common case (live-verified:
search for biology→Biologyat #1 via(canonical title match)). Revisit if the canonical-promotion fallback proves
insufficient.
Breaking Changes
- Cursor validation: mismatched `s.q` now raises a `cursor_decode` error instead of silent wrong‑query pagination.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About cameronrye/openzim-mcp
Modern, secure MCP server for accessing ZIM format knowledge bases offline. Enables AI models to search and navigate Wikipedia, educational content, and other compressed knowledge archives with smart retrieval, caching, and comprehensive API.
Related context
Related tools
Earlier breaking changes
- v2.0.0a15 _attribute_sections falls back to first section when no section brackets located passage
- v2.0.0a13 canonical‑splice gate tightened to require exact path equality, fixing H2/H3 surface end‑to‑end behavior across all shapes.
- v2.0.0a11 Exposed `content_offset` as top-level `zim_query` parameter, validated >=0, threaded through options.
- v2.0.0a9 HTTP rate-limiter client_id now derived from token or IP; defaults to "default" fallback.
Beta — feedback welcome: [email protected]