This release fixes issues for SREs watching stability and regressions.
✓ No known CVEs patched in this version
Topics
Affected surfaces
Summary
AI summaryUpdates Deferred, pass-2, and P1-D1 across a mixed release.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Feature | Medium |
Added title-spans-connector suppression for prefixed topics like notable people from Big Rapids. Added title-spans-connector suppression for prefixed topics like notable people from Big Rapids. Source: granite4.1:8b-q6_K@2026-05-19 Confidence: high |
— |
| Feature | Medium |
Switched _TAIL_TOKEN_RE to Unicode-aware [^ \W_]+ to preserve non-Latin characters in topics. Switched _TAIL_TOKEN_RE to Unicode-aware [^ \W_]+ to preserve non-Latin characters in topics. Source: granite4.1:8b-q6_K@2026-05-19 Confidence: high |
— |
| Dependency | Medium |
Updated regular expression patterns to use Unicode-aware \w minus underscore for token boundaries. Updated regular expression patterns to use Unicode-aware \w minus underscore for token boundaries. Source: granite4.1:8b-q6_K@2026-05-19 Confidence: low |
— |
| Bugfix | Medium |
_soft_connector_footer false-fires on titles spanning connector with subject-attribute prefix. _soft_connector_footer false-fires on titles spanning connector with subject-attribute prefix. Source: granite4.1:8b-q6_K@2026-05-19 Confidence: high |
— |
| Bugfix | Medium |
Non-Latin topic strings resolved to wrong articles at cert=0.85 due to ASCII-only tokenisation. Non-Latin topic strings resolved to wrong articles at cert=0.85 due to ASCII-only tokenisation. Source: granite4.1:8b-q6_K@2026-05-19 Confidence: high |
— |
| Bugfix | Medium |
walk namespace M cursor round-trip false-failed with missing archive-identity field. walk namespace M cursor round-trip false-failed with missing archive-identity field. Source: granite4.1:8b-q6_K@2026-05-19 Confidence: high |
— |
| Refactor | Medium |
Stashed archive-identity (ai) from cursor state into options and included it in rebuilt cursor_state for walk_namespace. Stashed archive-identity (ai) from cursor state into options and included it in rebuilt cursor_state for walk_namespace. Source: granite4.1:8b-q6_K@2026-05-19 Confidence: high |
— |
Full changelog
Pass 1 (live-MCP, against the freshly-shipped v2.0.0a17 build on
wikipedia_en_all_maxi_2026-02.zim) surfaced three user-facing
defects. Pass 2 source-level self-audit (sibling grep for the
landed fix shapes + edge-case unit tests) found zero new defects.
A live-MCP pass-3 reprobe is deferred until this release deploys — the
MCP server in the sweep environment couldn't be restarted mid-session
to load the new build. The recent post-a16 methodology refinement
(live-MCP catches a defect class unit tests structurally cannot)
should still apply for that follow-up pass.
Fixed
_soft_connector_footerfalse-fires on titles that
structurally span the connector (P1-D1). Queries like
notable people from Big Rapids, Michiganresolved correctly to
theBig_Rapids,_Michiganarticle (a single entity whose title
literally contains the comma) but the footer claimed the article
forMichiganwas returned and told the caller to query
separately fornotable people from Big Rapids. Same shape for
musicians from Romeo and Juliet→ "for Juliet". The existing
left_in == right_insuppression only catches the
both-halves-in-title case; a subject-attribute prefix
(notable people from,musicians from) leaves the left half
longer than the title and defeats it. Fix adds an earlier
title-spans-connector suppression: whentop_titlematches the
same connector regex as the topic, the connector is structural
to the title and the footer is suppressed. The docstring already
namedVienna, Austriaas a case this should fire for; the new
guard makes it work in the prefixed-topic shape too.- Non-Latin topic strings resolved to wrong articles at
cert=0.85 (P1-D2 — critical).tell me about Münchenreturned
theMletter article;tell me about Zürichreturned the
Richdisambig;tell me about Kölnreturned theLN
abbreviation. Root cause:_TAIL_TOKEN_RE = [a-z0-9]+in
openzim_mcp/title_promotion.pystripped non-ASCII characters,
soiter_query_tails("München")yielded["m", "nchen"]and
iter_query_windowsthen yielded"m", which
find_title_match("m")cleanly resolved to theMletter
article at score 1.0. The backendfind_entry_by_title_data
natively handles Unicode topics (find article titled München
resolves to Munich at score 1.00) — only the tokenisation layer
destroyed the topic before the backend saw it. Fix: switch
_TAIL_TOKEN_REto[^\W_]+(Unicode-aware\wminus
underscore, so underscore still acts as a token boundary for
path-form input likeBig_Rapids,_Michigan). walk namespace Mcursor round-trip false-failed with
"missing archive-identity field" (P1-D3). Paging walk_namespace
by passing back thenext_cursorit just emitted produced
Error: Cursor for 'walk_namespace' missing archive-identity field. Re-issue the request without a cursor.even though the
cursor (decoded) carried{"v":2,"t":"walk_namespace","s": {"o":3,"l":3,"ns":"M","ai":"e048666a9e92"}}. The simple-tools
cursor dispatcher decoded the cursor and stashed only
state["o"](asoptions["offset"]) andstate["ns"](as
options["_cursor_ns"]), droppingai.
_handle_walk_namespacethen rebuilt cursor_state as
{scan_at, l}withoutai; downstreamwalk_namespace_data
calledverify_archive_identityunconditionally and raised
"missing" because the field was gone. Fix: stashstate["ai"]
(and re-stashstate["ns"]) into options at decode time;
_handle_walk_namespaceincludes them in the rebuilt
cursor_state when present. The data-layer guard now has the real
aito compare against and properly distinguishes "missing"
from "cross-archive mismatch". Browse_namespace didn't surface
the same failure because its handler passesoffsetdirectly
(no cursor_state envelope) and the browse data layer only
verifies archive identity when an explicit
cursor_archive_identitykwarg is passed — which the
simple-tools handler doesn't pass.
Tests
21 regression tests in tests/test_post_a17_beta_fixes.py:
- P1-D1 (6): comma title with subject-attribute prefix
suppresses;andtitle with subject-attribute prefix
suppresses; genuine two-entity query still emits the footer;
pre-fix both-halves-in-title still suppresses; slash-connector
title-spans suppression (pass-2); no-connector-in-title still
fires (pass-2). - P1-D2 (11): München / Zürich / Köln tokenise as single
Unicode tokens; multi-word Unicode topic preserved; ASCII path
unchanged (regression guard for the originalbig rapids michiganexample); underscore boundary preserved; digits
preserved; empty topic (pass-2); mixed Latin + non-Latin
(pass-2); single non-Latin char (pass-2); punctuation as
boundary (pass-2). - P1-D3 (4): end-to-end cursor round-trip carries
ai;
dispatcher stashes_cursor_aiinto options; no-cursor case
preserved (cursor_state stays None); cross-archiveai
mismatch propagated correctly (pass-2 — preservingaimust
not weaken the cross-archive enforcement guard).
Full test suite: 1814 passed, 50 skipped.
Deferred
- P1-D4 (lower priority):
browse_namespacesilently accepts
cursors emitted bywalk_namespace(cross-tool reuse at the
simple-tools dispatcher layer; the advanced tools already
enforce). Not user-facing critical — simple-tools reads
state["o"]and walks browse from that offset, which for the
metadata namespace coincidentally produces a continuation page.
A defence-in-depth follow-up would stashstate["t"]and add a
_cursor_t_mismatchcheck alongside the existing
_cursor_ns_mismatch. Filed as follow-up rather than bundled
here to keep the sweep tight.
Methodology
Two passes (rather than the recent 3–7) because the three landed
fixes were narrow, well-characterised, and had no live-only
surfaces that source-level self-audit couldn't cover.
_AFFINITY_TOKEN_RE in synthesize.py and
_tokenize_for_relevance in zim/search.py use the same ASCII
pattern as _TAIL_TOKEN_RE but are symmetric tokenisers (same
regex applied to both sides of the comparison) — the P1-D2 shape
is a unidirectional probe that destroys the topic before the
backend sees it, which is structurally different. No siblings.
verify_archive_identity is also called from
browse_namespace_data, extract_article_links_data, search
cursor paths, and structure cursors, but all gate on an explicit
cursor_archive_identity kwarg that the simple-tools handlers
don't pass; only walk_namespace builds a cursor_state envelope
whose ai the data layer unconditionally checks. No siblings.
PR: #145.
Commits on the sweep branch: d42213b (pass-1 fixes + 14 tests),
8f8a44e (pass-2 self-audit + 7 edge-case tests), e59b953 /
2f71bba (CI lint fixes — F401 unused-imports / isort).
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About cameronrye/openzim-mcp
Modern, secure MCP server for accessing ZIM format knowledge bases offline. Enables AI models to search and navigate Wikipedia, educational content, and other compressed knowledge archives with smart retrieval, caching, and comprehensive API.
Related context
Related tools
Earlier breaking changes
- v2.0.0a15 _attribute_sections falls back to first section when no section brackets located passage
- v2.0.0a13 canonical‑splice gate tightened to require exact path equality, fixing H2/H3 surface end‑to‑end behavior across all shapes.
- v2.0.0a11 Exposed `content_offset` as top-level `zim_query` parameter, validated >=0, threaded through options.
- v2.0.0a10 `get article M/<key>` now returns ZIM metadata entry rather than aliased C-namespace article body.
- v2.0.0a10 `metadata for <file>` returns concise metadata strings instead of full article bodies for new-scheme archives.
Beta — feedback welcome: [email protected]