This release includes 1 breaking change for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
ReleasePort's take
Light signalThe `content_offset` parameter is now a top‑level option in `zim_query`, validated as >=0 and threaded through the query options.
Why it matters: Update any calls to `zim_query` that need article‑body paging to include the new `content_offset` field; failure will cause unreachable content errors.
Summary
AI summarycontent_offset now exposed in zim_query, fixing unreachable paging and infobox concatenation issues.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Security | Medium |
Cursor with missing/invalid `s` envelope now errors (`cursor_decode`). Cursor with missing/invalid `s` envelope now errors (`cursor_decode`). Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Breaking | Medium |
Exposed `content_offset` as top-level `zim_query` parameter, validated >=0, threaded through options. Exposed `content_offset` as top-level `zim_query` parameter, validated >=0, threaded through options. Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Feature | Medium |
`zim_query` accepts top-level `content_offset` parameter for article-body paging. `zim_query` accepts top-level `content_offset` parameter for article-body paging. Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Feature | Medium |
`walk namespace M` and `metadata for <file>` agree on metadata via new-scheme archive enumeration. `walk namespace M` and `metadata for <file>` agree on metadata via new-scheme archive enumeration. Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Feature | Medium |
Synthesize ranking demotes `Lists_of_*` and related suffixes in disambiguation results. Synthesize ranking demotes `Lists_of_*` and related suffixes in disambiguation results. Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Feature | Medium |
Every markdown response now carries trailing `<!-- intent=... cert=... -->` HTML comment. Every markdown response now carries trailing `<!-- intent=... cert=... -->` HTML comment. Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Feature | Medium |
Intent-parser chained-query guard returns guidance instead of silently dispatching rightmost intent. Intent-parser chained-query guard returns guidance instead of silently dispatching rightmost intent. Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Feature | Low |
Synthesized ranking demotes titles matching `Lists_of_*` and suffixes like `_discography`, `_filmography`, etc., in disambiguation results. Synthesized ranking demotes titles matching `Lists_of_*` and suffixes like `_discography`, `_filmography`, etc., in disambiguation results. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Performance | Medium |
Optimized title‑index probing: now runs once per query by gating canonical probe behind match count checks, reducing duplicate lookups. Optimized title‑index probing: now runs once per query by gating canonical probe behind match count checks, reducing duplicate lookups. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Bugfix | Medium |
`get section` honors `max_content_length` and appends a truncation footer. `get section` honors `max_content_length` and appends a truncation footer. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Bugfix | Medium |
Infobox cells render with intra-cell whitespace at block-tag boundaries only. Infobox cells render with intra-cell whitespace at block-tag boundaries only. Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Bugfix | Medium |
Infobox text extraction now inserts whitespace only at block‑tag boundaries, fixing silent concatenation issues. Infobox text extraction now inserts whitespace only at block‑tag boundaries, fixing silent concatenation issues. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Bugfix | Medium |
`walk namespace` and `metadata for` now agree on metadata counts by enumerating new‑scheme archive keys. `walk namespace` and `metadata for` now agree on metadata counts by enumerating new‑scheme archive keys. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Bugfix | Medium |
Auto‑pick canonical when strong matches are exactly `Foo` and `Foo (disambiguation)`, surfacing disambig as a footer hint. Auto‑pick canonical when strong matches are exactly `Foo` and `Foo (disambiguation)`, surfacing disambig as a footer hint. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Bugfix | Medium |
Prefer exact‑topic canonical before disambig check, ensuring queries like `Apollo 11` return the correct article. Prefer exact‑topic canonical before disambig check, ensuring queries like `Apollo 11` return the correct article. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Bugfix | Medium |
`show structure of <multi‑word title>` normalizes titles via `_resolve_natural_language_path` helper. `show structure of <multi‑word title>` normalizes titles via `_resolve_natural_language_path` helper. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Bugfix | Medium |
Intent parser strips politeness tails (`to me`, `for me`, `please`) from phrases like `explain X to me`. Intent parser strips politeness tails (`to me`, `for me`, `please`) from phrases like `explain X to me`. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Bugfix | Medium |
`walk namespace` now uses per‑namespace entry count when available, falling back only if unknown. `walk namespace` now uses per‑namespace entry count when available, falling back only if unknown. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Bugfix | Low |
Fixed ReDoS hotspot in `_search_query_tail` regex by splitting into three single‑token patterns with plain Python slicing. Fixed ReDoS hotspot in `_search_query_tail` regex by splitting into three single‑token patterns with plain Python slicing. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Bugfix | Low |
Orphan bullet rows now inherit virtual parent context from preceding KV row when no active section exists. Orphan bullet rows now inherit virtual parent context from preceding KV row when no active section exists. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Bugfix | Low |
Trailing whitespace in `tell me about` now yields a clear “Topic Required” error instead of ambiguous article matches. Trailing whitespace in `tell me about` now yields a clear “Topic Required” error instead of ambiguous article matches. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Bugfix | Low |
`search for` with no terms now returns “Search Terms Required” error. `search for` with no terms now returns “Search Terms Required” error. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Bugfix | Low |
Reject non‑positive `limit` and negative `offset` in pagination, preventing infinite loops. Reject non‑positive `limit` and negative `offset` in pagination, preventing infinite loops. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Bugfix | Low |
Wrap backend “Cannot find entry” errors for `articles related to <nonexistent>` with structured guidance message. Wrap backend “Cannot find entry” errors for `articles related to <nonexistent>` with structured guidance message. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Refactor | Medium |
CodeQL fix: lifted assignment of `full_len` above conditional branch to avoid uninitialized warnings. CodeQL fix: lifted assignment of `full_len` above conditional branch to avoid uninitialized warnings. Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Refactor | Low |
Moved assignment of `full_len` above conditional to satisfy CodeQL uninitialized variable warning. Moved assignment of `full_len` above conditional to satisfy CodeQL uninitialized variable warning. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Refactor | Low |
Unified metadata aggregation so `metadata for` and `walk namespace` report consistent totals across archive schemes. Unified metadata aggregation so `metadata for` and `walk namespace` report consistent totals across archive schemes. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
Full changelog
Three-pass beta-test of v2.0.0a10 against a 118 GB Wikipedia ZIM
(Feb 2026 snapshot) via the simple-mode zim_query MCP surface. The
first pass surfaced 22 defects + 7 opportunities from live use; the
second and third passes were self-audits of the prior commit, each
finding fewer issues than the last (22 → 6 → 3). Every fix here was
first observed live; the existing 1425-test suite covered none of
them.
The single most user-visible regression is silent text concatenation
inside Wikipedia infoboxes — every city / country article had at
least one corrupted number (5th in Europe1st in Germany,
Berliner(s) (English)Berliner (m), 0.967very high,
TokyoTamaNorthern Izu Islands). A small LLM reading this would
emit those as single tokens. Plus one critical: the a10 DD2 fix
threaded content_offset through the article-paging handler, but
the parameter was never exposed on the MCP tool — the truncation
footer told callers to "pass content_offset=N" via a channel that
didn't exist.
Net: 1463 tests pass (+38 over v2.0.0a10), 50 skipped, 38
deselected. black / isort / flake8 / mypy / CodeQL /
SonarCloud all clean.
Fixed — Critical (post-a10 beta sweep)
- C1:
content_offsetunreachable fromzim_query. A10's DD2
threadedoptions["content_offset"]through
_fetch_topic_article_body, but thezim_queryMCP signature
never exposed the parameter. The top-leveloffsetarg routes to
options["offset"](search / browse pagination), not
options["content_offset"](article-body paging). Result: every
tell me about Photosynthesistruncation footer pointed to a
paging channel that returned the same page 1. Exposed
content_offsetas a top-levelzim_queryparameter, validated
>= 0, threaded throughoptions. Truncation footers on
truncate_contentnow report the correct next-page offset
(Opp4 implemented inline).
Fixed — High (post-a10 beta sweep)
- H2:
tell me about Berlinnon-determinism.Berlinand
Berlin (disambiguation)both strong-matched by the candidate-
extends-topic rule, so the disambig set fired 2+ → fork between
the city article and the disambig page. Auto-pick the canonical
when the strong-match set is exactlyFoo+Foo (disambiguation);
the disambig twin is surfaced as a footer hint on the returned
body. Genuine multi-meaning topics (Apollo / Mercury / Java) still
fork as before (Opp1 implemented inline). - H3: disambig hides the canonical it should be helping pick.
tell me about Apollo 11forked betweenApollo_11_anniversaries,
Apollo_11_lunar_sample_display,Apollo_11_goodwill_messages—
none of which is the canonicalApollo_11. Probe the title index
for the exact-topic canonical BEFORE the disambig check; prepend
it to the strong-match list when absent. - H4: infobox text-extraction silently concatenates adjacent
block-level children.td.get_text()joined<br>,<li>,
<span>runs without whitespace. Three-pass evolution:
first-passseparator=" "mangled inline span groups
(3,913,644→3 , 913 , 644); second-pass_join_cell_text
helper inserts whitespace at block-tag boundaries only and
concatenates inline tags directly; third-pass filters
Commentinstances (aNavigableStringsubclass) so invisible
formatnum/microformat comments stop leaking as visible text. - H5: intent parser preempts on later-occurring keywords.
tell me about berlin then list namespacessilently ran only
list namespaces(highest-confidence intent wins). New
_chained_intent_guidancesplits onthen/;/and then
connectors; if both halves start with a recognised operation
prefix, return a "split into separate calls" guidance message. - H6: orphan bullet rows lose parent context. Berlin's
**• Summer (DST):** UTC+02:00rendered without a parent
becauseTime zone:was a regular KV (not aninfobox-header).
When a KV row's label starts with a bullet char AND there's no
active section, treat the previous KV row's label as a virtual
parent — applied for that row only, doesn't persist into the
next non-bullet row.
Fixed — Medium (post-a10 beta sweep)
- M7:
show structure of <multi-word title>doesn't normalize.
D2 in a10 addedfind_title_match(min_score=0.8)to
_handle_related; M7 extends the same pattern via the new
_resolve_natural_language_pathhelper applied tostructure,
table of contents,links,summary,get section, and
get article(when the path contains spaces and no namespace
separator — direct-path lookups stay zero-cost). - M8:
get sectionignoresmax_content_length. Section text
was returned in full regardless of the cap. Honor the cap and
append a one-line truncation footer reporting the original
length. - M9: malformed cursor silent no-op. A base64+JSON token that
decodes but lacks the expectedsenvelope (or whoses.ois
missing/invalid) used to silently degrade to page 1. The contract
now mirrors the totally-garbled-token case: structured
cursor_decodeerror. - M10: trailing-whitespace
tell me aboutproduces an empty
topic. A query oftell me aboutwith a trailing space fell
through to a topic of"tell me about"and disambiguated to
articles titled "Tell Me About Tomorrow". The
_extract_tell_me_aboutregex now uses\b+(.*?)so empty
topics resolve to empty strings; simple_tools rejects with a
clear "Topic Required" error. - M11:
explain X to meparses incorrectly. "explain Berlin
to me" extracted topic"Berlin to me"and returned a memorial
article. Topic extractor stripsto me/for me/please
politeness tails, loop-until-idempotent so wrapping cases
(DNA for me please) collapse cleanly.
Fixed — Low (post-a10 beta sweep)
- L12: trailing-whitespace
search forwith no terms. Used to
fall through to searching for the literal word "for". Validate
the extracted tail before dispatch; surface "Search Terms
Required". - L13:
limit=0nonsensical pagination.Showing 1-0 of N — pass offset=0 for the next pagelooped on itself. Reject
non-positivelimitand negativeoffsetat the MCP boundary. - L15:
articles related to <nonexistent>raw error. Wrap the
backend's "Cannot find entry" with a structured guidance message
pointing tosuggestions for/find article titled/
search for. (Second-pass F3 added the same hint trio to the
outbound_errorbranch inrender_relatedthat the live case
actually surfaces through.) - L16: walk namespace denominator misleading.
walk namespace M
with 13 entries used to renderof ~27,199,904 archive-wide entries. Prefer a per-namespace denominator when available; fall
through to the archive total only when no per-namespace count is
known. Second-pass F4 plumbednamespace_entry_countthrough
_build_walk_resultso the newof N in namespace Xshape
actually renders. - L17 / L18: list namespaces total mismatch, metadata aggregator
underreports. Header now annotates "X archive entries (per-
namespace sum: Y)" when the two differ._extract_zim_metadata
enumeratesarchive.metadata_keyson new-scheme archives (filtering
Illustration_*binaries) sometadata forand
walk namespace Magree on what counts as metadata. Second-pass
F6 replaced the first-pass hardcoded probe-list extension with
the enumeration so future archive additions don't reopen the
disagreement.
Added — Opportunities (post-a10 beta sweep)
- Opp1: auto-fall-through twin. Implemented inline with H2.
- Opp2: expanded demote patterns.
_LIST_ARTICLE_PREFIX_RE
picks upLists_of_*(plural); two new patterns demote
Listed_*stems and*_discography/*_filmography/
*_videography/*_bibliography/*_albums/*_singles
suffixes.tell me about catsreturning a Rephlex Records
discography at rank 2 is the canonical failure this fixes. - Opp3: synthesize relevance threshold. New
_drop_low_relevance_tailcuts hits whose Xapian score is below
25% of the top hit's. Only applied inxapian_scorefallback
(single-archive); multi-archive RRF keeps all hits because RRF
normalizes scores. Always keeps at least one hit. - Opp4:
content_offsetin truncation footers. Implemented
inline with C1 —truncate_contentacceptscurrent_offsetso
paginated reads compute the next offset relative to where the
slice started in the original article. Third-pass F2 added a
paginatable: bool = Truekwarg so the three main-page call sites
switch to operation-accurate guidance (the main-page surface
doesn't acceptcontent_offset). - Opp5: canonical-exists hint in disambig auto-pick. When the
H2 auto-fall-through fires, append a_Note: this topic also has a disambiguation page — see ``get article <path>`` for alternate meanings._footer so the disambiguation stays discoverable. - Opp6: intent telemetry on all responses. Every markdown
response now carries a trailing<!-- intent=foo cert=0.85 -->
HTML comment. Invisible to humans (HTML comments aren't rendered)
but visible in the token stream so calling LLMs can branch on the
parser's classification certainty without parsing the body. - Opp7: link-count rank on related articles. When the related-
articles backend supplies amention_count, surface it inline as
- **Title** (path) · N×so a small LLM can rank which related
article is most central to the source. (Second-pass H1 fixed the
first-pass typo that read the wrong field name —link_count
vsmention_count.)
Fixed — Second-pass self-audit findings
A self-audit of the first-pass commit surfaced six defects in the
fixes themselves:
- D1 second-pass (folded into H4 above).
get_text(separator=" ")
mangled inline-span numeric groups. - F3 second-pass (folded into L15 above). Wrapped the wrong
error path — backend serialises rather than re-raising. - F4 second-pass (folded into L16 above).
namespace_entry_count
was renderer-only and never plumbed through the data payload. - F6 second-pass (folded into L17/L18 above). Hardcoded probe-
list extension still drifts; replaced withmetadata_keys
enumeration on new-scheme archives. - H1 second-pass (folded into Opp7 above). First-pass read
link_countfrom the related-articles result; the backend stores
the frequency-rank signal asmention_count. - C2 perf: title-index probe ran twice on the weak-top-hit path.
Gated the H3 canonical-probe behindlen(strong_matches) >= 2
(the only condition under which the disambig page would otherwise
render). Strong-top-hit and weak-then-promoted paths skip the
second probe entirely. Third-pass extended the gate to also fire
when the single strong match is itself the disambig twin.
Fixed — Third-pass self-audit findings
A second self-audit found three more defects in the second-pass
commit:
- D1 third-pass (folded into H4 above).
Commentis a
NavigableStringsubclass; second-pass_join_cell_textcaught
comments and rendered their bodies as visible text. - C2 third-pass. Lone disambig-twin search case bypassed the
second-pass>= 2gate; extended the gate to also fire when the
one strong match isFoo (disambiguation)itself. - F2 third-pass (folded into Opp4 above). The second-pass
truncation hint pointed at acontent_offsetparameter the
main-page operation doesn't accept; added apaginatable=False
kwarg on the three main-page call sites and routed them to
operation-accurate guidance.
Fixed — Quality gate (PR CI cleanup)
- CodeQL:
full_lenmay be uninitialized. In the M8 truncation-
footer code path,full_len = len(text)was assigned only inside
the truncationifblock but referenced in a different (correlated)
if truncated:block. The correlation was opaque to CodeQL.
Lifted the assignment above the branch so the variable is always
defined. No behaviour change. - SonarCloud python:S5852 (ReDoS hotspot). The
_search_query_tailregex had adjacent\s*quantifiers that
the heuristic flagged as polynomial-backtracking. Split into
three single-token regexes (verb, optionalupforlook up,
optionalforconnector) with plain-Python tail slicing between
matches. Each individual pattern has at most one whitespace
quantifier so the heuristic has nothing to flag. Behaviour
verified identical across all 1463 tests.
Wire-format / surface changes (alpha-line clean breaks)
zim_queryaccepts a top-levelcontent_offsetparameter.
Existing callers passing only the previous parameters are
unaffected; new callers paginating long article bodies should
usecontent_offsetinstead of the legacyoffset(the latter
remains the search / browse pagination knob).- Every markdown response now carries a trailing
<!-- intent=... cert=... -->HTML comment (Opp6). Invisible
to humans; callers that token-count or post-process the trailing
bytes will see two extra tokens per response. - Intent-parser chained-query guard returns guidance instead of
silently dispatching the rightmost intent. Callers sending
X then Yqueries that previously got Y's result silently now
receive a structured "split into separate calls" message. get sectionhonorsmax_content_lengthand appends a
truncation footer. Callers that previously got full section
bodies now receive at mostmax_content_lengthbytes plus a
one-line footer reporting the original length.- Cursor with missing/invalid
senvelope now errors
(cursor_decode). Callers that previously got silent page-1
fall-through now receive a structured error. - Infobox cells render with intra-cell whitespace at block-tag
boundaries only. Most callers see strictly better text (no
5th in Europe1st in Germany-style concatenation); inline
numeric / unit / coordinate microformats remain intact. - Synthesize ranking demotes
Lists_of_*and*_discography/
*_filmography/*_albums/*_singlessuffixes. Citation
order for queries likecatsno longer surfaces a Rephlex
Records discography in the top half. walk namespace Mandmetadata for <file>agree on what
counts as metadata (new-scheme archives enumerate
metadata_keysdirectly). Old-scheme archives keep the
hardcoded probe list as a fallback.
Breaking Changes
- `content_offset` previously unreachable from `zim_query`; now added as a top‑level parameter and validated.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About cameronrye/openzim-mcp
Modern, secure MCP server for accessing ZIM format knowledge bases offline. Enables AI models to search and navigate Wikipedia, educational content, and other compressed knowledge archives with smart retrieval, caching, and comprehensive API.
Related context
Related tools
Earlier breaking changes
- v2.0.0a15 _attribute_sections falls back to first section when no section brackets located passage
- v2.0.0a13 canonical‑splice gate tightened to require exact path equality, fixing H2/H3 surface end‑to‑end behavior across all shapes.
- v2.0.0a10 `get article M/<key>` now returns ZIM metadata entry rather than aliased C-namespace article body.
- v2.0.0a10 `metadata for <file>` returns concise metadata strings instead of full article bodies for new-scheme archives.
- v2.0.0a10 Infobox extraction now emits trailing rows without the preceding "GDP —" label, changing bullet-label strings.
Beta — feedback welcome: [email protected]