This release adds 3 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
Affected surfaces
Summary
AI summaryUpdates Testing, P1-D3, and P1-D1 across a mixed release.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Feature | Medium |
New _looks_like_slashed_compound helper protects slashed compounds like TCP/IP, AC/DC. New _looks_like_slashed_compound helper protects slashed compounds like TCP/IP, AC/DC. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high |
— |
| Feature | Medium |
New ALL-CAPS clause in _is_substantive_topic accepts HTTP, TCP, IP, USA, EU, R&B etc. New ALL-CAPS clause in _is_substantive_topic accepts HTTP, TCP, IP, USA, EU, R&B etc. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high |
— |
| Feature | Medium |
Expanded politeness regex to include second‑wave SMS variants (tx, txs, tyvm, thnks, etc.). Expanded politeness regex to include second‑wave SMS variants (tx, txs, tyvm, thnks, etc.). Source: granite4.1:8b-q6_K@2026-05-20 Confidence: low |
— |
| Feature | Medium |
Added 73 regression tests covering new fixes and edge cases. Added 73 regression tests covering new fixes and edge cases. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: low |
— |
| Performance | Medium |
Lint discipline now catches static‑analyzer noise pre‑merge, reducing CI noise. Lint discipline now catches static‑analyzer noise pre‑merge, reducing CI noise. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: low |
— |
| Bugfix | Medium |
Fixes silent abandonment of short ALL‑CAPS acronym chains (TCP/IP, AC/DC). Fixes silent abandonment of short ALL‑CAPS acronym chains (TCP/IP, AC/DC). Source: granite4.1:8b-q6_K@2026-05-20 Confidence: high |
— |
| Bugfix | Medium |
New IntentParser._strip_param_leaks removes <param>=<value> suffixes before title promotion. New IntentParser._strip_param_leaks removes <param>=<value> suffixes before title promotion. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: low |
— |
| Refactor | Medium |
Switched q‑emitting drift scanner glob from Path.glob to rglob for recursive file matching. Switched q‑emitting drift scanner glob from Path.glob to rglob for recursive file matching. Source: granite4.1:8b-q6_K@2026-05-20 Confidence: low |
— |
Full changelog
Live-MCP beta sweep against wikipedia_en_all_maxi_2026-02.zim on
the freshly-deployed v2.0.0a23 build. Smoke gates 4/4 green
pre-fix; the live MCP session dropped after ~30 probes (same long-
session connection timeout pattern observed in the post-a22 sweep)
so pass-2 ran as a source-level sibling audit per the post-a17
methodology refinement — zero new defects surfaced. "Narrow-scope
sibling" pattern holds for the 4th sweep running: 3 of 4 defects
this sweep are narrower-than-needed enumerations on the matching
a23 fix shape (politeness regex missed a second wave of SMS /
multi-word / multilingual tokens, q-emitting drift-guard used
non-recursive glob, substantive filter rejected short ALL-CAPS
acronyms). The fourth defect (P1-D3) is a new defect class — the
title-promotion path silently resolves leaked <param>=<value>
suffixes to wildly unrelated articles.
Multi-entity chain ALL-CAPS / slashed-acronym silent abandonment (P1-D1)
The post-a22 _split_multi_entity / _is_substantive_topic pair
correctly handles long bare-topic chains (Berlin / Munich / Köln)
and non-Latin shorts (東京 / Köln, post-a19 P1-D3). But two
interacting failures left short ALL-CAPS acronym chains silently
abandoned:
- The slash splitter (
\s*/\s*in_SOFT_CHAIN_CONNECTOR_PATS)
fragmented slashed acronyms —TCP/IP→["TCP", "IP"],
AC/DC→["AC", "DC"],Either/Or→["Either", "Or"]. _is_substantive_topicrejected the fragments because they fail
every existing clause (HTTP=4, TCP=3, IP=2, AC=2, DC=2 — none
≥5 chars, no digit, no non-ASCII letter). With every half
failing substantive,_split_multi_entityreturned None and
the chain rejection silently abandoned.
Two live failures observed:
tell me about TCP/IP and HTTP and HTTPS→ silently returned
theHTTPSarticle (matching-tail short-circuit picks the
longest substantive ASCII tail), dropping TCP/IP and HTTP.tell me about AC/DC and Iron Maiden and Metallica→ silently
returnedMetallica. Same path.
Two coordinated fixes:
- New
_looks_like_slashed_compoundhelper protects slashed
compounds whose halves are letter-only with a ≤2-char half
(TCP/IP, AC/DC, Either/Or, A/B).Berlin / Munich(min half 6
chars) still splits as a genuine 2-entity chain. - New ALL-CAPS clause in
_is_substantive_topic:
isupper() and len ≥ 2accepts HTTP, TCP, IP, USA, EU, R&B
etc. Mirrors the post-a19 P1-D3 non-Latin clause — short tokens
with a clear proper-noun signal aren't English sentence-words.
Mixed-caseNow/Both/Here/Thenstay rejected.
Politeness regex second-wave family (P1-D2)
The post-a22 P1-D2 SMS extension added thnx / thanx / tysm
/ kthx / kthxbai but missed a second wave of live-observed
variants:
- 1-2 char compressions:
tx,txs - longer SMS:
tyvm,thnks,thxx,kthxbye - multi-word:
thanks a million,thank (you|u) (so|very) much - multilingual second tier:
obrigado(a)(Portuguese),
arigato(u)(Japanese romaji),spasibo(Russian)
Same narrow-scope sibling pattern as a22 P1-D2 → a23 P1-D2 — each
sweep so far has shipped narrower than the natural politeness
family. The word-boundary anchor (post-a21) already protects
short tokens from mid-word matches (manta / pasta / vista /
cantata stay intact).
<param>=<value> query-suffix silent fragmentation (P1-D3) — NEW defect class
Live: tell me about Photosynthesis limit=10 returns the article
for the number 10 (Wikipedia's number article). Same shape for
compact_budget=200 (returns the year 200 article),
content_offset=100 (returns 100), offset=5 (returns 5). Root
cause: a small model that doesn't know to pass limit as the
typed MCP parameter occasionally leaks limit=N INTO the query
text; the title-promotion tokeniser sees "10" as a clean ASCII
digit tail and scores it cleanly against the number-article
title index, returning a wildly unrelated body that masks the
model's actual topic.
Distinct from a23 P1-D5 (docstring nudge for atomic intents that
ignore limit). The docstring tells the model not to pass
limit as text on atomic intents, but it can't prevent a model
that's confused about parameter-passing semantics from typing
limit=10 as text anyway. Fix: new
IntentParser._strip_param_leaks peels \s+<param>=<value>
shapes BEFORE the politeness loop runs. Token list covers every
zim_query argument (limit, offset, content_offset,
max_content_length, max_words, compact_budget, compact,
synthesize, cursor, zim_file_path, entry_path,
namespace, partial_query). Idempotent loop handles multiple
leaks in one call. The \s+ leading anchor protects prose
mentions (offset printing, cursor algorithms, the compact disc) from accidental strip.
Q-emitting drift scanner non-recursive glob (P1-D4)
The post-a22 P1-D3 widening from zim/search.py to all of
zim/*.py used Path.glob (direct children only). The current
openzim_mcp/zim/ tree is flat (no subdirectories) so behaviour
is unchanged today, but a future contributor adding
openzim_mcp/zim/cursor/encoder.py or any subdirectory with
q-emitting Cursor.encode callsites would have those silently
missed by the scan, breaking the drift guard's promise. Same
narrow-scope sibling shape as the a22 P1-D3 widening from one
file to all direct-child files in the directory — the next
widening is naturally to all files in the tree. Fix: switch to
rglob.
Methodology
The recurring "fix unlocks new paths" + "narrow-scope
sibling" pair held for the 4th sweep running. Three of four
defects this sweep are narrow-scope siblings of a23's own fixes
(P1-D1 narrow substantive filter + narrow slash split, P1-D2
narrow politeness enumeration, P1-D4 narrow scanner glob). The
fourth (P1-D3) is a new defect class — a small-model-leaked
parameter shape that silently fragments to an unrelated article
via title-promotion. The post-a22 lint-leak refinement (make lint locally before push; check SonarCloud findings via API
before merging; avoid [\s\S]+? + literal regex shapes) was
followed cleanly — only one SonarCloud finding emerged
(implicit-concat strings from black auto-format, S6571), fixed
in a single follow-up commit before merge. The methodology is
stabilising: the structural defect classes the sweep catches
remain consistent across alphas, and the lint discipline now
catches static-analyzer noise pre-merge rather than letting it
leak to CI.
Testing
- 73 regression tests in
tests/test_post_a23_beta_fixes.py:
TestP1D1MultiEntityAllCapsAndSlashedAcronyms(12 cases —
short ALL-CAPS substantive, R&B with ampersand, mixed-case
rejected, slashed-compound helper identifies acronyms, rejects
proper-noun pairs / 3-part slashes / digit halves, end-to-end
split for TCP/IP, AC/DC, Berlin / Munich, ALL-CAPS chain);
TestP1D2PolitenessSecondWave(~30 parameterized cases — every
new token + chained + word-boundary safety + case-insensitive +
regression guards on every post-a22 token);
TestP1D3ParamLeakSuffix(~20 cases — every param-name × value
shape strips, end-to-end parse_intent, multi-param chains, mix
with politeness, prose-mention preservation, idempotence);
TestP1D4QEmittingScannerRecursive(3 cases — source-level
rglob check + scanner returns expected pinned set);
TestLiveMcpReproduction(6 end-to-end probes mirroring the
live-MCP queries the sweep observed);TestRegressionGuards
(6 cases pinning post-a17 / a18 / a19 / a22 fixes that share
code with the changed paths). - Full suite: 2053 passed, 50 skipped. mypy clean across all
45 source files.make lint(flake8 + isort + black) clean.
SonarCloud quality gate passed with 0 open issues post-merge.
Release process
After this changelog lands on main, push the v2.0.0a24 tag
on main to trigger .github/workflows/release.yml — PyPI
publish + GitHub release notes auto-extracted from the matching
CHANGELOG section.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About cameronrye/openzim-mcp
Modern, secure MCP server for accessing ZIM format knowledge bases offline. Enables AI models to search and navigate Wikipedia, educational content, and other compressed knowledge archives with smart retrieval, caching, and comprehensive API.
Related context
Related tools
Earlier breaking changes
- v2.0.0a15 _attribute_sections falls back to first section when no section brackets located passage
- v2.0.0a13 canonical‑splice gate tightened to require exact path equality, fixing H2/H3 surface end‑to‑end behavior across all shapes.
- v2.0.0a11 Exposed `content_offset` as top-level `zim_query` parameter, validated >=0, threaded through options.
- v2.0.0a10 `get article M/<key>` now returns ZIM metadata entry rather than aliased C-namespace article body.
- v2.0.0a10 `metadata for <file>` returns concise metadata strings instead of full article bodies for new-scheme archives.
Beta — feedback welcome: [email protected]