cameronrye/openzim-mcp

v2.0.0b1 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 2mo MCP Data & Storage

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

kiwix mcp mcp-server openzim zim

Affected surfaces

auth rbac

ReleasePort's take

Light signal

editorial:auto 2mo

v2.0.0b1 introduces a configurable cross‑encoder reranker, four idempotent query rewrites, and ReDoS hardening while fixing model ID defaults and lowercasing regressions.

Why it matters: Patch to v2.0.0b1 immediately; the new ReDoS safeguards require deployment before handling untrusted input.

Summary

AI summary

Updates Stats, Install + test ```bash, and cost-controlled across a mixed release.

Changes in this release

Type	Severity	Summary	CVE
Security	Medium	Adds ReDoS hardening via bounded quantifiers and token-count limits. Adds ReDoS hardening via bounded quantifiers and token-count limits. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature
Feature	Medium	Adds optional cross-encoder reranker via [reranker] extra installation. Adds optional cross-encoder reranker via [reranker] extra installation. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	Adds four idempotent rule-based query rewrites to base install. Adds four idempotent rule-based query rewrites to base install. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	Enforces 5-second wall-clock timeout on first-call reranker model load. Enforces 5-second wall-clock timeout on first-call reranker model load. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	Reranker sets persistent kill switch after first load failure. Reranker sets persistent kill switch after first load failure. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	Decomposes X-of-Y and possessive queries into structured entity+attribute hints. Decomposes X-of-Y and possessive queries into structured entity+attribute hints. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	Applies token-level misspelling substitution from bundled ~40-entry curated map. Applies token-level misspelling substitution from bundled ~40-entry curated map. Source: llm_adapter@2026-05-21 Confidence: low	—
Feature	Medium	Strips leading articles from queries unless matching canonical entity titles. Strips leading articles from queries unless matching canonical entity titles. Source: llm_adapter@2026-05-21 Confidence: low	—
Feature	Medium	Adds master kill switch to disable all four query rewrite rules atomically. Adds master kill switch to disable all four query rewrite rules atomically. Source: llm_adapter@2026-05-21 Confidence: low	—
Bugfix
Bugfix	Medium	Fixes reranker default model ID from unsupported Xenova to BAAI/bge-reranker-base. Fixes reranker default model ID from unsupported Xenova to BAAI/bge-reranker-base. Source: llm_adapter@2026-05-21 Confidence: high	—
Bugfix	Medium	Fixes four code paths broken by unconditional query lowercasing. Fixes four code paths broken by unconditional query lowercasing. Source: llm_adapter@2026-05-21 Confidence: low	—
Bugfix	Low	Fixes several code paths broken by unconditional query lowercasing (e.g., command keywords, section extraction). Fixes several code paths broken by unconditional query lowercasing (e.g., command keywords, section extraction). Source: granite4.1:30b@2026-05-21-audit Confidence: low	—
Refactor	Medium	Adds 68 new tests: 43 query rewrite, 25 reranker; updates 13 pre-existing. Adds 68 new tests: 43 query rewrite, 25 reranker; updates 13 pre-existing. Source: llm_adapter@2026-05-21 Confidence: low	—
Refactor	Medium	CI expands to ubuntu/macos/windows with Python 3.12 and 3.13; all pass. CI expands to ubuntu/macos/windows with Python 3.12 and 3.13; all pass. Source: llm_adapter@2026-05-21 Confidence: low	—

Full changelog

First b-series release. Two new Phase D features ship together in the alpha → beta cutover: a cross-encoder reranker behind an opt-in extra (sub-D-1, originally PR #163) and four idempotent rule-based query rewrites in the base install (sub-D-2, PR #164). Combined: every user sees a cleaner query reaching Xapian; users who opt into [reranker] also see semantic reranking on the BM25 results. Pre-release pending the live-MCP smoke sweep.

Sub-D-1 — Cross-encoder reranker behind `[reranker]` extra (#163)

Optional dependency. pip install openzim-mcp[reranker] pulls in FastEmbed (~150 MB Python packages) and the first rerank-eligible query lazily downloads BAAI/bge-reranker-base (~1.1 GB) from HuggingFace into the FastEmbed cache. Air-gapped operators run openzim-mcp download-models once on a network-connected machine to pre-stage the model. The base install (pip install openzim-mcp) is untouched — reranker code lazy-imports inside openzim_mcp/ml/reranker.py and is never loaded when the extra is absent.

Wired into all four search surfaces: _handle_search, _handle_filtered_search, _handle_search_all, and synthesize._collect_passages. Each surface emits per-call telemetry:

reranker_engaged — cross-encoder actually scored the candidates (the returned list carries a rerank_score field).
reranker_skipped.not_installed — [reranker] absent OR OPENZIM_RERANKER_DISABLE=1 OR config.ml.reranker.enabled=false.
reranker_skipped.no_results — Xapian returned zero candidates; nothing to rerank.
reranker_skipped.passthrough — reranker ran but returned without scoring (skip-on-short-query gate fired below min_query_tokens=4, OR mid-inference failure tripped @ml_fallback).

Skip-on-short-query gate: queries with fewer than 4 word tokens bypass rerank because single-word entity queries (Berlin, Photosynthesis) already get a Xapian-score-1.0 canonical-title hit — the cross-encoder adds cost without value there.

Risk mitigations baked in:

5-second wall-clock timeout on first-call model load with non-blocking shutdown — operators don't hang waiting on a stuck download. (The original with ThreadPoolExecutor pattern from the plan had a latent bug where __exit__ blocked on shutdown(wait=True) despite the timeout firing; the fix manages the executor lifecycle manually with shutdown(wait=False).)
Persistent kill switch on load failure — BGEReranker._load_failed is set after the first failure; every subsequent call returns None immediately. No retry storms.
@ml_fallback on rerank() — mid-inference exceptions (OOM, garbled UTF-8 tokens) degrade to Xapian-only ordering via _rerank_passthrough, log WARNING once + DEBUG on retry.
Production model ID guarded by integration test — test_production_default_model_is_supported_by_fastembed checks BAAI/bge-reranker-base is still in TextCrossEncoder.list_supported_models(). Catches FastEmbed registry drift before users do.
Cross-archive rerank in _handle_search_all redistributes globally-top-K hits back to per-archive buckets; ordering test pins this contract.
synthesize rerank propagates rerank_score into p["score"] so the downstream _boost_by_section_affinity sort doesn't silently revert to BM25 order.

Reality-check correction from the integration test: the original plan's default Xenova/bge-reranker-base-onnx model ID does not exist in FastEmbed 0.8.0. Caught by test_reranker_live.py on first run; fixed in config.py to BAAI/bge-reranker-base.

Sub-D-2 — Tier 1 query rewriting (#164)

Four idempotent rule-based rewrites in the base install (no extras, no models). Run before the existing _strip_* chain in IntentParser.parse_intent, so every downstream pipeline — Xapian search, intent regex matching, the sub-D-1 reranker — inherits a cleaner query.

| Rule | Method | Behavior |
|---|---|---|
| 1 | _normalize_topic_case | Lowercase the query. Consolidates scattered .lower() calls into a single named pass. No telemetry (fires on every query). |
| 2 | _apply_misspelling_map | Token-level substitution from a bundled dict[str, str] (~40 starter entries from Wikipedia's "List of common misspellings for machines"). An optional title-index probe suppresses substitution when the original token is a canonical entity name. Telemetry: query_rewrite.misspelling. |
| 3 | _detect_stopword_phrase | Strip leading articles (the, a, an, of) unless the full query is a canonical title (The Beatles, Of Mice and Men stay intact when the probe is provided). Telemetry: query_rewrite.stopword_phrase. |
| 4 | _decompose_x_of_y | Decompose population of berlin and berlin's population shapes. Emits BOTH a cleaner query string (berlin population) AND a structured {"entity": ..., "attribute": ...} hint that rides inside params["decomposition_hint"]. _handle_tell_me_about consumes the hint and uses the structured entity directly, skipping its own topic-extraction. Telemetry: query_rewrite.x_of_y. |

Rule order is fixed: 1 → 2 → 3 → 4. Each rule is idempotent (running twice produces no further change). All four are pure-Python; no I/O on the hot path (the bundled data files load once at module init via @functools.lru_cache).

Risk mitigations baked in:

Master kill switch (QueryRewriteConfig.enabled = False) skips all four rules — not just telemetry. Verified by test_query_rewrite_disabled_skips_all_rules.
Title-index probe (when an archive is in scope) suppresses false-positive rewrites of real proper nouns. Bilogy is in the misspelling list but is also a surname; the probe checks for a canonical hit before substituting.
Hard 500-entry cap on the misspellings map; starter file ships with ~40 high-confidence entries and grows reactively from beta-test observations.
Exclusions file (empty seed) lets operators pin specific tokens as "never rewrite."
ReDoS hardening — both regex compile sites use bounded quantifiers ({1,200}) and token-count bounds ((?:\s+\S+){0,8}) so adversarial inputs cannot induce polynomial backtracking. Confirmed by SonarCloud's regex analyzer.

Reality-check ripples discovered during integration: Rule 1 unconditionally lowercases every query, which broke code paths that depended on case-preserving inputs. Compensating fixes shipped in the same PR:

_DECOMPOSE_SKIP_ATTRS frozenset prevents Rule 4 from decomposing intent command keywords (structure of X, links of Y, table of contents, etc.).
_RULE3_SECTION_COMMAND_RE prevents Rule 3 from stripping the in get_section command phrases (the evolution section of biology stays parseable).
_extract_get_zim_entries regex widened from [A-Z]/ to [A-Za-z]/ for lowercase namespace paths (m/image.png instead of M/Image.png).
_looks_like_bare_topic length threshold lowered from 5→2 chars (so post-lowercased dna, pi, ai still qualify); filler-token set expanded with common 2-char prepositions.
_handle_find_by_title D6 redirect switched from isupper() to isalpha(); the dead not title[0].isupper() post-zero-results sub-condition removed.

13 pre-existing test files in the a-series regression suites updated to expect lowercase params (assert params["topic"] == "Berlin" → "berlin"). No tests were deleted or weakened.

What's NOT in this release

Multi-hop questions (what year did the inventor of X die) — deferred to a potential sub-D-3 if live evidence warrants.
HyDE / hypothetical document synthesis — locked-in non-goal.
Algorithmic spell-correction libraries (pyspellchecker, autocorrect, symspellpy) — wrong precision/recall tradeoff for encyclopedia search; the curated map + title-index probe is the right tool here.
Embeddings sidecar (hnswlib, deferred sub-D-4) — gated on live evidence that reranker engagement rate ≥15% AND operator-reported semantic-divergent misses.
Hybrid intent parser (deferred sub-D-3 alternative) — gated on live evidence of ≥5% low-confidence parse_intent calls OR multi-hop transcript failures.

Stats

Tests: 2245 passing, 54 skipped on the no-extras path. ~43 new tests in tests/test_query_rewrite_tier1.py (per-rule fix/no-op/boundary triads, integration, composition, hint handoff). ~25 new tests in tests/ml/ (reranker unit + integration). 13 pre-existing test files updated for the lowercase ripple.
CI: Full matrix passes — ubuntu/macos/windows × Python 3.12/3.13. New test-reranker job runs FastEmbed integration tests on Linux only (cost-controlled).
SonarCloud: Quality Gate OK — 0 open issues, 0 unreviewed hotspots.
Commits since a25: 27 (sub-D-1 squash + sub-D-2 squash; sub-D-1 originally shipped as 23 commits, sub-D-2 as 14).

Install + test

# Base install (sub-D-2 query rewriting only):
uv tool install --force --reinstall openzim-mcp==2.0.0b1

# With reranker (sub-D-1 + sub-D-2):
uv tool install --force --reinstall 'openzim-mcp[reranker]==2.0.0b1'

# Pre-stage the reranker model offline (only needed if using [reranker]):
openzim-mcp download-models

A two-pass live-MCP test prompt for this build is bundled in the repo at docs/superpowers/specs/2026-05-21-v2-b1-live-test-prompt.md — paste it into a fresh Claude conversation that has openzim-mcp connected to your archive, and it'll walk through both per-rule probes and cross-feature integration with explicit telemetry verification.

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track cameronrye/openzim-mcp

Get notified when new releases ship.

About cameronrye/openzim-mcp

Modern, secure MCP server for accessing ZIM format knowledge bases offline. Enables AI models to search and navigate Wikipedia, educational content, and other compressed knowledge archives with smart retrieval, caching, and comprehensive API.

All releases →

Related context

Related tools

Earlier breaking changes

v2.0.0a15 _attribute_sections falls back to first section when no section brackets located passage
v2.0.0a13 canonical‑splice gate tightened to require exact path equality, fixing H2/H3 surface end‑to‑end behavior across all shapes.
v2.0.0a11 Exposed `content_offset` as top-level `zim_query` parameter, validated >=0, threaded through options.
v2.0.0a10 `get article M/<key>` now returns ZIM metadata entry rather than aliased C-namespace article body.
v2.0.0a10 `metadata for <file>` returns concise metadata strings instead of full article bodies for new-scheme archives.