This release adds 3 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
Affected surfaces
ReleasePort's take
Light signalv2.0.0a12 fixes query disambiguation (France returns country article, not football team), multi-word title extraction, and punctuation preservation in title lookups.
Why it matters: Fixes improve query disambiguation and title matching in openzim-mcp. This is a pre-release (v2.0.0a12); test in dev before production indexing.
Summary
AI summaryFixed tell me about France silently returning the football‑team article.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Security | High |
SonarCloud ReDoS vulnerability in L2 orphan‑trim regex has been fixed by replacing it with safe string operations. SonarCloud ReDoS vulnerability in L2 orphan‑trim regex has been fixed by replacing it with safe string operations. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Bugfix | Medium |
_extract_entry_path_keyworded regex now captures multi-word titles correctly. _extract_entry_path_keyworded regex now captures multi-word titles correctly. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Bugfix | Medium |
title-index lookups preserve punctuation for topics like C++. title-index lookups preserve punctuation for topics like C++. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Bugfix | Medium |
tell me about France returns correct country article instead of football team article. tell me about France returns correct country article instead of football team article. Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Bugfix | Medium |
Filtered‑search responses now include canonical title‑match hits with a distinct badge instead of dropping them silently. Filtered‑search responses now include canonical title‑match hits with a distinct badge instead of dropping them silently. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Bugfix | Medium |
Walk‑namespace **M** and `metadata for` now report identical sets of metadata keys (13 vs 12 previously). Walk‑namespace **M** and `metadata for` now report identical sets of metadata keys (13 vs 12 previously). Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Bugfix | Low |
`get article M/Illustration_48x48@1` preserves the `@` suffix during extraction, fixing truncation of metadata paths. `get article M/Illustration_48x48@1` preserves the `@` suffix during extraction, fixing truncation of metadata paths. Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Bugfix | Low |
Truncation footer denominator stays stable across pagination, showing "showing chars X–Y of N‑char body". Truncation footer denominator stays stable across pagination, showing "showing chars X–Y of N‑char body". Source: granite4.1:30b@2026-05-22-audit Confidence: low |
— |
| Other | Low |
affected_surface affected_surface Source: llm_adapter@2026-05-21 Confidence: low |
— |
Full changelog
Three-pass beta-test of v2.0.0a11 against the same 118 GB Wikipedia
ZIM (Feb 2026 snapshot) the a8 → a11 cuts targeted, via the simple-
mode zim_query MCP surface. The first pass surfaced 11 live defects
- a handful of opportunities; the second pass self-audited the first-
pass commit and found 3 more; the third pass self-audited the second-
pass commit and found 1 deeper case. The 22 → 6 → 3 a10 → a11 shape
repeats at 11 → 3 → 1.
The single most user-visible defect was tell me about France
silently returning France_national_football_team_results_(2000– 2019) while Germany / Italy / Spain / Brazil / Mexico all returned
the correct country article — Xapian's top hit was the football
article and the existing H3 canonical-prepend gate explicitly skipped
the len(strong_matches) == 1 non-twin case. The same root-cause
shape (silent fall-through to a wrong-but-similar article) drove most
of this sweep's catches.
The two structural root causes — _extract_entry_path_keyworded
regex character class and the early-return suffix-bypass pattern —
each accounted for multiple defects in different surfaces.
Net: 1493 tests pass (+30 over v2.0.0a11), 50 skipped, 38
deselected. black / isort / flake8 / mypy / CodeQL /
SonarCloud all clean.
Fixed — Critical (post-a11 beta sweep)
- C1:
tell me about Francereturned the football-team article.
Xapian's #1 hit wasFrance_national_football_team_results_(2000– 2019), which strong-matched topic=Francevia the candidate-
extends-topic rule, leavinglen(strong_matches) == 1non-twin —
the H3 canonical-prepend gate explicitly skipped that case. Gate
now also fires when the lone strong match's tokens differ from the
topic's, and a sibling auto-pick_auto_pick_canonical_over_extends_topic
prefers the canonical when the strong-match set is exactly
[canonical-with-topic-tokens, ..._extends-topic-only]. Mercury /
Apollo / Java / DNA forks unchanged. Apollo 11 and similar hub
topics now auto-resolve to the canonical with variants surfaced as
a_May also refer to: ..._footer hint. - C2: multi-word entry-path extraction silently dropped the second
word on five operations. The shared
_extract_entry_path_keywordedregex used[A-Za-z0-9_/.-]+for
the capture, soshow structure of United Statesmatched
of Unitedand capturedUnited. New extractor anchors at the
LAST keyword and captures everything that follows, so
World War II,Albert Einstein,Quantum mechanicsall flow
through correctly onstructure/summary/links/
get_article/toc.
Fixed — High (post-a11 beta sweep)
- H1: title-index lookups for punctuated topics smeared to drop-
the-punctuation candidates.tell me about C++resolved past the
title index toC(the letter); paired with the C2 fix that now
preserves++through extraction, the punctuation-count guard
(_punctuation_smear_detected) rejects candidates that drop a
+/#count present in the topic. Known limitation: topic →
candidate pairs that preserve the punctuation count (C++→
C/C++) require redirect-target inspection and are deferred. - H2: filtered-search dropped the canonical title-match hit.
_handle_filtered_searchwas a one-call delegate to
search_with_filters(legacy markdown path), so the splice
_handle_searchruns at offset=0 never fired. New
search_with_filters_with_canonical_spliceruns the same probe +
prepend as the basic-search path, gated to canonical hits whose
path lives in the requested namespace. - H3: Opp2 list / discography demote was synthesize-layer-only.
_demote_list_articleslived insidesynthesize_query; basic
searchleft catalog-shape hits in place at their BM25 rank.
Lifted the predicate_is_list_articlefor cross-call use and
applied it inside_splice_title_match_into_search(basic search)
and the new H2 filtered-search splice.
Fixed — Medium (post-a11 beta sweep)
- M1:
walk namespace Mandmetadata fordisagreed (13 vs 12
keys). Sharedis_human_readable_metadata_keypredicate now
consulted from both sites. - M2:
get article M/Illustration_48x48@1stripped@1. Same
root cause as C2 — fixed by the C2 extractor change. - M3:
walk namespace Creported "archive total" instead of per-
namespace count. L16'snamespace_entry_countplumbing now
applies to new-scheme C (the count equalsarchive.entry_count). - M4: truncation footer reported remaining-after-offset chars as
"total". Addedoriginal_totalkwarg, plumbed from the three
callers inzim/content.py. Mid-article reads now switch to
showing chars X–Y of N-char bodyso the denominator stays stable
across pagination.
Fixed — Low (post-a11 beta sweep)
- L1: structured guidance / error responses skipped the Opp6 intent
telemetry comment. Three early-return paths (Topic Required,
Search Terms Required,Chained Operations Detected) now carry
their own deterministic telemetry comments atcert=1.00. - L2: chained-intent splitter left the connector word attached to
the left op. Strip trailing connectors / orphan punctuation so
the suggested split-up call is cleanly pasteable. - L3: canonical-title-match snippet rendered as snippet text. Now
surfaced as a distinctMatch type: canonical title matchbadge
in both_format_search_textand_format_filtered_response.
Fixed — Second-pass self-audit (post-a11 sweep)
L1 covered three of the six structured early-return paths in the same
code section but missed the other three:
Query Required(empty / whitespace query) →
intent=query_required cert=1.00_meta_query_guidance(meta-only filler queries likedo both/try again/ok) →intent=meta_only_guidance cert=1.00No ZIM File Specified(no archive selectable) →
intent=no_zim_file_specified cert=1.00
Fixed — Third-pass self-audit (post-a11 sweep)
- H2 splice silently dropped the canonical when
search_with_filters_datareturned 0 hits butfind_title_match
reported the canonical exists in the requested namespace.
Symmetric to the bug the first-pass H2 fix addressed (canonical
missing from a non-empty result page) — same wrong silent-fall-
through, different shape. Hoisted the synthetic-canonical row
construction above the populated-vs-empty branch so both paths
share the same prepend logic. The empty-results path now lands the
canonical as a single-result page with the post-a11 L3 badge.
Fixed — Quality gate (post-a11 sweep)
- SonarCloud S5852 ReDoS on the L2 orphan-trim regex
\s+(?:and|or|but)\s*$|\s*[;,]\s*$(multiple unbounded\s*/
\s+quantifiers in alternation). Replaced with string ops that
mirror the original "strip one of: trailing connector word OR
trailing;/," semantics, same approach as the existing
_is_disambig_leadworkaround in the same file.
Wire-format / surface changes (alpha-line clean breaks)
tell me aboutauto-resolves to the canonical when the strong-
match set is[canonical-with-topic-tokens, ..._extends-topic-only]
(Apollo 11, Pride and Prejudice, hub topics with parenthesized
siblings). Variants are surfaced as a_May also refer to: ..._
footer hint instead of the prior disambig fork. Genuine multi-
meaning topics (Apollo / Mercury / Java / DNA) still fork as
before.show structure of(andsummary/links/get article/
table of contentsof) actually accept multi-word titles. Pre-
fix these silently truncated to the first word and rendered the
wrong article.- Filtered-search responses include canonical title-match hits
with a distinctMatch type: canonical title matchbadge instead
of dropping them silently. tell me about C++(or any topic with+/#) no longer
resolves to a candidate that dropped the punctuation. Falls
through to search-fallback where canonical-title-match can find
the actual_programming_language-suffixed article.get article M/Illustration_48x48@1(or any path with@)
preserves the suffix through extraction. Pre-fix the regex
character class stripped@1before the metadata API saw it.- Walk-namespace M and metadata-for now agree on the metadata-
key set (filteredIllustration_*binaries on both sides). - Walk-namespace C reports
(of N in namespace C)instead of
(archive total: ~N entries)for new-scheme archives. - Truncation footer denominator stays stable across pagination.
Mid-article reads switch toshowing chars X–Y of N-char bodyso
a caller paging through a 146 KB article doesn't see the "total"
decrease with every page. - Every structured guidance / error response carries an intent
telemetry comment so callers branching on
<!-- intent=... cert=... -->see the rejection class.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About cameronrye/openzim-mcp
Modern, secure MCP server for accessing ZIM format knowledge bases offline. Enables AI models to search and navigate Wikipedia, educational content, and other compressed knowledge archives with smart retrieval, caching, and comprehensive API.
Related context
Related tools
Earlier breaking changes
- v2.0.0a15 _attribute_sections falls back to first section when no section brackets located passage
- v2.0.0a13 canonical‑splice gate tightened to require exact path equality, fixing H2/H3 surface end‑to‑end behavior across all shapes.
- v2.0.0a11 Exposed `content_offset` as top-level `zim_query` parameter, validated >=0, threaded through options.
- v2.0.0a10 `get article M/<key>` now returns ZIM metadata entry rather than aliased C-namespace article body.
- v2.0.0a10 `metadata for <file>` returns concise metadata strings instead of full article bodies for new-scheme archives.
Beta — feedback welcome: [email protected]