Skip to content

Release history

cameronrye/openzim-mcp releases

Modern, secure MCP server for accessing ZIM format knowledge bases offline. Enables AI models to search and navigate Wikipedia, educational content, and other compressed knowledge archives with smart retrieval, caching, and comprehensive API.

All releases

54 shown

Upgrade now
v2.1.6 Security relevant
Auth Breaking upgrade

pyjwt security bump

No immediate action
v2.1.4 Security relevant

Tail‑hijack fix

No immediate action
v2.1.3 Bug fix

cross‑archive leak fix

No immediate action
v2.1.2 Bug fix

HTTP allowed‑hosts fix

No immediate action
v2.1.1 Bug fix

Empty‑link fix & media exclusion

No immediate action
v2.1.0 New feature

Native libzim reader capabilities

No immediate action
v2.0.5 Maintenance

Routine maintenance and dependency updates.

No immediate action
v2.0.4 Maintenance

Routine maintenance and dependency updates.

No immediate action
v2.0.2 Bug fix

Dispatcher fixes & audit

No immediate action
v2.0.1 Bug fix

Bug fixes + documentation updates

No immediate action
v2.0.0 Maintenance

Stage E results & low dispatch accuracy on new ops

No immediate action
v2.0.0b13 Bug fix

Disambig phrase extension

Review required
v2.0.0b12 New feature
Auth RBAC

Z4 check + disambig rejection

Review required
v2.0.0b9 Mixed
Auth RBAC

Tail‑hijack fix + possessive rule relaxation

Review required
v2.0.0b8 Bug fix
Auth

Possessive redirect fix

Review required
v2.0.0b7 Bug fix
Auth RBAC

Possessive redirect fix + synthesize insert

Upgrade now
v2.0.0b6 Security relevant
Dependencies

starlette CVE patch

Review required
v2.0.0b4 Bug fix
Auth

Fixes possessive auto-fetch bug

No immediate action
v2.0.0b3 Bug fix

Trailing politeness + rerank + possessive + filter fix

Review required
v2.0.0b2 Mixed
Auth RBAC

CLI env fix + OTP cooldowns + timeout raise

Review required
v2.0.0b1 Breaking risk
Auth RBAC

Reranker + query rewrites

Review required
v2.0.0a25 Bug fix
Auth RBAC

Slashed‑compound widening + politeness expansion + param‑leak fixes

Review required
v2.0.0a24 Mixed
Auth RBAC

Query‑param leak + ALL-CAPS acronym fix

No immediate action
v2.0.0a23 Mixed

Multi‑entity parse fix + SMS politeness + drift guard

Review required
v2.0.0a22 Mixed
Auth RCE / SSRF

Multi‑entity chains + politeness strip

No immediate action
v2.0.0a21 Bug fix

Cursor, alias, politeness, docstring, path, error fixes

Review required
v2.0.0a20 Bug fix
Auth RBAC

Cursor guard fixes + Unicode footer

Monitor
v2.0.0a19 Bug fix

Table fallback, footer fix, cursor rejection

Review required
v2.0.0a18 Bug fix
Auth RCE / SSRF

Connector footer + Unicode tokenisation + Cursor ai

No immediate action
v2.0.0a17 Breaking risk

Section routing + empty‑lead fallback

No immediate action
v2.0.0a16 Bug fix

Defect fixes

No immediate action
v2.0.0a15 Bug fix

Citation attribution + bold handling fixes

No immediate action
v2.0.0a14 Breaking risk

Entity resolution + section affinity

No immediate action
v2.0.0a13 Bug fix

Canonical splice fix

Review required
v2.0.0a12 Bug fix
Auth RBAC

France query fix

Review required
v2.0.0a11 Breaking risk

`content_offset` exposure + infobox fixes

Review required
v2.0.0a10 Breaking risk
Auth RBAC

Metadata correctness + cursor validation

Review required
v2.0.0a9 Breaking risk
Auth Breaking upgrade

Cache accounting fixes + search error fix

v2.0.0a8 Security relevant
Security fixes
  • dep: CVE-2026-44431 — fixed by upgrading urllib3 to 2.7.0
  • dep: CVE-2026-44432 — fixed by upgrading urllib3 to 2.7.0
Notable features
  • make security passes --skip-editable to avoid pip-audit failure on local package
Full changelog

Re-cut of v2.0.0a7 — the v2.0.0a7 tag exists but its GitHub Release
failed to publish because pip-audit surfaced two upstream urllib3
CVEs (CVE-2026-44431 / 44432) that landed in the audit database
between the v2.0.0a6 and v2.0.0a7 builds. v2.0.0a8 carries the same
v2.0.0a7 content plus the urllib3 → 2.7.0 bump that closes the CVEs.
Also adjusts make security to pass --skip-editable so pip-audit
doesn't fail looking for the local package on PyPI mid-release.

Defect + opportunity batch on top of v2.0.0a6, found by end-to-end
testing against a real Wikipedia ZIM (118 GB, 27.2M entries,
Feb 2026 snapshot). 14 defects fixed, 8 opportunities added.
1388 tests pass (+13 from new test modules); no regressions.

Fixed — Phase A (snippets, infobox, typo fallback)

  • #14: _typo_variants now reaches "Photosythesis""Photosynthesis".
    v2.0.0a4 shipped only transposition + deletion edits — mathematically
    unable to recover the missing 'n' (insertion). Added insertion +
    substitution against the full a-z alphabet, length-gated at ≥ 5 chars
    to bound cost (~700 variants for a 13-char input; ≤ 10 ms/call).
  • #1: snippet highlighter no longer produces malformed markdown.
    _highlight_terms previously wrapped query terms verbatim, producing
    **Artificial **photosynthesis****, _****Berlin****_, and
    [**Photosynthesis**](**Photosynthesis** "**Photosynthesis**") when
    the match landed inside existing bold / italic / link constructs.
    Added a skip regex covering paired emphasis runs and full
    [text](href "tooltip") link constructs (deliberately not bare
    parens, so prose like (also called assimilation) keeps its
    highlighting).
  • #1: snippet fallback to stem-prefix substring match. When no
    whole-word match existed, the snippet used to drop to the lead
    paragraph. Now it falls back to a stem-prefix substring (first ⅔ of
    the query term) so "photosynthesis" catches paragraphs mentioning
    "photosynthetic" instead of returning the article's unrelated lead.
  • Op1: snippets drop the duplicate # <Title> H1. create_snippet
    accepts an optional title=; _get_entry_snippet forwards the
    entry title so the heading that already appears in the result row
    doesn't burn 5–15 tokens per result.
  • #2 / Op5: infobox extraction tracks parent-section context.
    extract_infobox now prefixes labels with their parent
    <th colspan> heading row, so a Berlin infobox renders
    Area — City/State / Population — City/State instead of three
    identical City/State rows. Also skips rows whose nearest table
    ancestor isn't the infobox (handles nested chronology / coords
    microformats) and rejects <th> / <td> candidates borrowed from
    inside nested tables.
  • Op6: strip image-caption / hatnote / sidebar / navbox / inline
    citation noise.
    UNWANTED_HTML_SELECTORS now drops figure,
    figcaption, .thumb, .thumbcaption, .gallery, .hatnote,
    .sidebar, .navbox, .metadata.mbox-small, sup.reference,
    .reference, .mw-collapsible-toggle, and the .geo-* coordinate
    microformats. Article leads now start with the actual prose, not
    Schematic of … For other uses, see X (disambiguation). Part of a series on … 52°31'07"N 13°24'16"E ….

Fixed — Phase B (response contract)

  • #3 / Op8: zim_query accepts a cursor parameter. Tools advertised
    opaque base64 cursors in their responses, but the simple-mode
    zim_query tool only took an integer offset — the cursors were
    decorative. Now decoded; s.o populates options["offset"] and the
    per-tool state is preserved. Length-capped at 2 KB
    defense-in-depth.

Fixed — Phase C (primitives)

  • #9 / #7: get_section table rendering now matches get_zim_entry.
    The bundle's rendered_markdown was built with compact=False while
    get_zim_entry rendered with compact=True. Result: get_section "Geography" returned pipe-soup tables while the surrounding article
    fetch path showed [Table N: M rows x P cols - pass compact=False to expand] placeholders. Bundle and search-snippet rendering paths now
    both apply compact=True, so the markdown is consistent everywhere.
  • #10 / D8: synthesize attribution carries the #section_id suffix.
    _locate_passage couldn't find passages containing **bold**
    highlight markers inside the bundle's plain markdown — every citation
    fell back to entry-level (section_id: null). Now strips **
    markers before locating so attribution resolves correctly.
  • #10 / D5: synthesize strips natural-language interrogative prefix.
    synthesize=True with "tell me about Berlin" previously fed the
    entire phrase to BM25 — returning Irving Berlin songs, Nat King Cole
    albums, and a graffiti article instead of the canonical Berlin
    entry. Intent-parses first, hands only the topic to the search
    stage; preserves the original query for response echo.
  • #10 / D8 / Op4: response dedupe + link-strip in compact mode.
    passages[].text_markdown previously duplicated answer_markdown
    verbatim (~50% token bloat on every synthesize call). In compact
    mode, passages now omit the body text. Wikipedia link-soup
    ([text](href "tooltip")) is also stripped from passages — small
    models can't follow inline links from inside tool responses anyway.
  • Op3: get_section supports narrow scoping. New
    include_subsections=False parameter on get_section_data (and the
    narrow section X of Y / just section X of Y query syntax in
    simple mode) ends the slice at the next heading of any level, so a
    caller can fetch just the H2 lead paragraphs without the cascading
    H3 sub-tree.
  • Op2: compact structure response carries per-heading summaries.
    The 80-char summary field is derived from each section's body
    preview so a small model can choose which section to drill into,
    not just see which exist.

Fixed — namespace / metadata / tell me about

  • D2: browse namespace C no longer crashes on new-scheme archives.
    Legacy code built a full 27 M-entry list before slicing 50 rows out
    of it — slow, memory-hostile, and triggered "session expired" errors
    on real Wikipedia archives. New _browse_new_scheme_c_paginated
    pages directly through the entry-id range.
  • D3: browse namespace W returns the actual W entries. New-scheme
    archives keep W off libzim's iterable surface, but the well-known
    paths (W/mainPage, W/favicon, ...) are reachable via
    has_entry_by_path. New _browse_new_scheme_w_paginated probes
    them so the response matches list_namespaces' count.
  • D11: metadata previews cap at 800 chars. Wikipedia ZIMs store
    M/Title as a full HTML document (~1 MB) rather than the bare title
    string. The metadata for <archive> call previously returned 980 KB,
    starving every other metadata field. Each entry is now capped with
    a [truncated, N chars total] marker.
  • D6 / Op7: tell me about <topic> auto-fetches on title-index hit.
    When the top BM25 result wasn't a strong-title match (Xapian ranked
    List of songs about Berlin above the canonical Berlin article),
    the response used to render the search list. Now falls back to
    find_entry_by_title_data; promotes any score-1.0 result past the
    BM25 ranking and inlines the article body.

CI / quality

  • 3 new test modules, 47 additional assertions covering each fix:
    test_typo_variants_v2a7.py, test_content_processor_fixes_v2a7.py,
    test_v2a7_fixes_helpers.py. End-to-end proof that "Photosythesis"
    resolves through the full call path (mock archive + suggester); perf
    guard against quadratic regressions in _typo_variants; cursor
    garbage-rejection; metadata cap on both long and short values.
  • Goldens regenerated (all strict improvements): pipe-soup infobox
    snippet → clean lead-paragraph snippet for Einstein; H1 dedup +
    section attribution on the Berlin / Munich synthesize fixtures.
  • Test infra: explicit encoding="utf-8" on golden read/write so
    non-ASCII characters in goldens survive Windows runners.
  • SonarCloud quality gate: factored shared test setup
    (_make_simple_handler, _build_metadata_mock_archive,
    _wire_typo_fallback_archive) and namespace browse-payload shape
    (_new_scheme_browse_payload, _materialise_paths) so new-code
    duplication stays under 3%.
v2.0.0a4 New feature
Notable features
  • `get_section` tool returns a specific markdown section with full metadata and optional truncation handling
  • `zim_query(synthesize=True)` mode performs pure retrieval‑plus‑concatenation synthesis without LLM generation, including passage extraction, citation attribution, budget enforcement, and structured response
Full changelog

v2 Phase C, part 2: completes the retrieval-primitives phase. Adds the
get_section tool (#7) and the zim_query(synthesize=True) mode (#10)
on top of the EntryBundle infrastructure that shipped in v2.0.0a3.
No wire-format breaks — both new surfaces are additive.

#7 — New tool get_section

get_section(zim_file_path, entry_path, section_id, *, max_chars=None)
  → Union[GetSectionResponse, ToolErrorPayload]

Returns a single section's body (~500-1500 tokens — small-model sweet
spot per parent-document-retrieval research) plus full metadata.
section_id values come from get_table_of_contents
(TocHeading.section_id). On miss, returns
tool_error("section_not_found", extras={"available_section_ids": [...] }) so the model can self-correct.

The data layer slices EntryBundle.rendered_markdown[char_start:char_end]
where the bundle's section ranges include subsections (a parent heading's
char_end extends to the next heading at the same or higher level).
Parent sections therefore return the full subtree body. max_chars
truncates the body and sets truncated=True plus _meta.truncated=True
in the envelope for budget-aware clients.

#10 — New zim_query(synthesize=True) mode

{
    "query": str,
    "answer_markdown": str,        # passages + inline [cite: ...] markers
    "passages": list[SynthesizePassage],
    "citations": list[Citation],
    "archives_searched": list[str],
    "fallback_used": Literal["xapian_score", "rrf_fusion", "reranker"],
    "total_chars": int,
    "total_words": int,
    "_meta": MetaEnvelope,
}

Pure retrieval + concatenation; no LLM generation. The seven-stage
pipeline (in openzim_mcp/synthesize.py):

  1. Per-archive search — Xapian top-K hits (search_top_k helper
    on ZimOperations).
  2. RRF fusion — Reciprocal Rank Fusion (k=60) when multiple archives
    are searched; identity passthrough for single-archive
    (fallback_used="xapian_score" vs "rrf_fusion").
  3. Identity rerank — placeholder for Phase D's cross-encoder.
  4. Passage extraction — libzim snippets rendered to markdown.
  5. Section attribution — best-effort lookup via EntryBundle;
    passages get cite_id = "{archive}/{entry_path}#{section_id}"
    when the snippet text is found in a section's char range. Bundle
    build failures keep the cite_id at entry level.
  6. Budget enforcementoutput_char_budget truncates the last
    passage; subsequent passages are dropped.
  7. Render + citations — passages joined with \n\n and inline
    [cite: ...] markers; structured Citation list deduplicated by
    cite_id.

Zero hits returns an empty response with _meta.reason="0_hits".

Other

  • Extended tool_error() with an extras: Optional[Dict[str, Any]]
    kwarg so error payloads can carry self-correction hints (e.g. the
    available_section_ids list above) without # type: ignore at
    call sites.
  • New tests: tests/test_get_section.py (4),
    tests/test_synthesize.py (~20 unit + 3 end-to-end),
    tests/test_golden_v2_phase_c.py (3 get_section + 3 synthesize
    snapshots, deterministic via the new v2_phase_c_zim heading-rich
    fixture). test_response_contract exempts both new tools from the
    list-pagination contract while still asserting _meta is present.
  • The Phase A _meta envelope continues to attach on every response.
    _meta.truncated is now correctly forwarded by get_section_data
    on truncation (was a hidden gap in earlier scaffolding).
v2.0.0a1 New feature
⚠ Upgrade required
  • Added `tiktoken>=0.7.0` to core dependencies; ensure the environment provides this version.
  • Compact‑mode empty‑result prose now uses a new footer + suggestions format; existing integrations expecting the old paragraph output must handle the structured `_meta.suggestions` field.
Notable features
  • Every dict‑returning tool now includes a `_meta` envelope with token/character estimates, truncation flags, total characters, suggestions, and reason.
  • Compact‑mode responses add a markdown blockquote footer showing token count (configurable via `OPENZIM_MCP_META__FOOTER_ENABLED`).
  • In compact mode, `.infobox` / `.vcard` tables are emitted as Markdown key‑value lists; large tables are replaced with `[Table N: ...]` placeholders.
Full changelog

First v2 pre-release. Phase A of the multi-phase v2 effort. All changes additive at the tool-signature layer; small compact-mode prose change for empty search results (see Changed below).

Added

  • meta: every dict-returning tool now includes a _meta envelope (tokens_est, chars, truncated, more_at_offset, total_chars, suggestions, reason). tokens_est uses tiktoken cl100k_base with a 5% pad. (#5)
  • simple: compact-mode responses gain a one-line markdown blockquote footer (> ~4.2K tokens · ...). Set OPENZIM_MCP_META__FOOTER_ENABLED=false to suppress. (#5)
  • content: in compact mode, .infobox / .vcard tables emit a Markdown KV list prepended to the body. (#2)
  • content: in compact mode, tables exceeding row or character thresholds are replaced with [Table N: ...] placeholders. (#2)
  • search: every search response is query-aware — snippets contain the actual matched passage (with **bold** highlights, capped at 5 hits) rather than the article lead. (#1)
  • search: _meta.suggestions[] surfaces typo variants (alt_spelling) and other-archive candidates (alt_archive) for empty / low-confidence searches. (#4)
  • search: find_entry_by_title fuzzy fallback now triggers whenever no result clears 0.7 (previously only on zero hits). Score and length-gate are configurable via OPENZIM_MCP_SEARCH__FUZZY_TITLE_*. (#14)

Changed

  • simple: compact-mode empty-result prose now renders via the new footer + structured suggestions instead of the v1.2.0 paragraph. The information is one-for-one; the format is more model-readable. compact=False paths retain byte-identical v1.2.0 behavior. (#4)
  • search: find_entry_by_title typo-corrected hits now score 0.85 (was hardcoded 0.7) by default. (#14)

Dependencies

  • Added tiktoken>=0.7.0 to core dependencies.
v1.3.0 New feature
Notable features
  • Refinements and production‑readiness improvements
Full changelog

1.3.0 (2026-05-08)

Features

  • v1.2.0 follow-up — refinements + production-readiness improvements (#106) (e9396ec)
v1.2.0 New feature
Notable features
  • simple-mode enhancements: tell_me_about command, larger code snippets, compact pagination
Full changelog

1.2.0 (2026-05-06)

Features

  • http: operator-acknowledged auth bypass + rate-limit env-var docs (#104) (7294b1d)
  • v1.2.0 simple-mode tool ergonomics — tell_me_about, bigger snippets, compact pagination (#103) (212a60a)
v1.1.2 Bug fix

Fixed server CORS origins not being mirrored to the SDK.

Full changelog

1.1.2 (2026-05-05)

Bug Fixes

  • server: mirror cors_origins into SDK transport allowed_origins (#100) (96001d1)
v1.1.1 Bug fix

Fixed bugs in walk_namespace, related-articles, and confidence beta-refinement.

Full changelog

1.1.1 (2026-05-05)

Bug Fixes

  • walk_namespace, related-articles, and confidence beta-refinement fixes (#98) (912d346)
v1.1.0 New feature
Notable features
  • Tool responses use MCP structured content (no more double-stringified JSON)
Full changelog

1.1.0 (2026-05-05)

Features

  • tool responses use MCP structured content (no more double-stringified JSON) (#96) (5b541ec)

Bug Fixes

  • http: allow MCP-Protocol-Version header and DELETE method in CORS (#93) (dbb791e)
  • namespace, pagination, resources, and find-by-title beta-test fixes (#92) (4b572ef)
  • server: make simple mode actually expose only zim_query (#94) (92c725f)
v1.0.1 Bug fix

Fixed http host header processing to honor an operator‑configured allowlist.

Full changelog

Bug Fixes

  • http: allow operator-configured Host allowlist (#90) (c4dad8a)
v1.0.0 Breaking risk
Security fixes
  • Redact absolute paths from MCP error responses (CVE not specified)
Notable features
  • Streamable HTTP transport with bearer-token auth, CORS allow-list, and health/readyz endpoints
  • Multi-stage, multi-arch Docker image (linux/amd64, linux/arm64) running as non‑root with built‑in health check
Full changelog

Includes an end-to-end review pass before tagging — security hardening, correctness fixes, performance work, and a refactor that splits zim_operations.py into a zim/ package via mixin classes. See sections below.

Features

  • http: streamable HTTP transport with bearer-token auth, CORS allow-list, and /healthz//readyz endpoints
  • http: safe-default startup check refuses to bind a non-localhost host without an auth token
  • transport: legacy SSE transport (--transport sse) for clients that haven't migrated to streamable-HTTP; bound to localhost only, no auth/CORS middleware
  • docker: multi-stage, multi-arch (linux/amd64, linux/arm64) image published to ghcr.io/cameronrye/openzim-mcp, runs as non-root with a built-in health check
  • content: get_zim_entries batch tool — fetch up to 50 entries in one call, with per-entry success/error reporting
  • resources: per-entry zim://{name}/entry/{path} resource serves entries with their native MIME type (clients must URL-encode / as %2F in the path segment)
  • subscriptions: clients can subscribe to zim://files and zim://{name}; mtime-polling watcher emits notifications/resources/updated when allowed directories or .zim files change
  • search: opaque cursor parameter on search_zim_file for resumable pagination
  • simple: intent pattern routes batch retrieval queries to get_zim_entries

Improvements

  • content: get_related_articles resolves relative hrefs against the source entry's directory and detects the content namespace correctly on domain-scheme archives (previously returned nothing)
  • content: suggestion fallback uses SuggestionSearcher(archive).suggest(text) (the prior archive.suggest() call did not exist)
  • tools: list_zim_files accepts a case-insensitive name_filter substring argument; one shared cache slot regardless of filter value
  • content: get_zim_entries accepts bare entry-path strings paired with a zim_file_path default (dicts still work for multi-archive batches)
  • content: heading-id resolution falls through id → mw-headline anchor → preceding <a name=""> → slug, returning (id, source) so consumers can distinguish real anchors from synthetic slugs
  • content: summary extraction skips USWDS banners and skip-nav blocks above the first <h1> (MedlinePlus / NIH / NIST style sites)
  • content: link extraction drops non-navigable schemes (javascript:, mailto:, tel:, data:, blob:, vbscript:)
  • server: __version__ reads from importlib.metadata; serverInfo.version reports openzim-mcp's actual version (no longer the FastMCP SDK default)

Removed

  • tools: advanced-mode tool surface drops 27 → 21. Removed: warm_cache, cache_stats, cache_clear, get_random_entry, diagnose_server_state, resolve_server_conflicts. The cache itself remains; the explicit management tools were dropped.
  • instance: multi-instance conflict tracking removed; instance_tracker.py deleted. HTTP server instances coexist freely.

Bug Fixes

  • content: sanitize per-entry paths in get_zim_entries and expand test coverage
  • resources: per-entry zim:// returns libzim's native MIME type
  • http: start subscription watcher via wrapped lifespan
  • instance: relax conflict logic for HTTP transport so multiple HTTP server instances can coexist

Security

  • errors: redact absolute paths from MCP error responses (rejected traversals previously leaked the canonical allowed-directory layout)
  • errors: regex-based path redaction with cross-platform separator handling and tightened lookbehind so wrapped/quoted paths ((/opt/foo), "/opt/bar") no longer slip through
  • diagnostics: redact filesystem paths and PIDs in get_server_health / get_server_configuration responses (no longer transport-gated; always redacted)
  • resources: sanitize URI-decoded entry paths before passing to libzim
  • search: always sanitize zim_file_path in find_entry_by_title (previously skipped when cross_file=True)
  • prompts: strip control characters and cap user-supplied arguments before interpolating into MCP prompt bodies; re-check empty after sanitization to avoid empty ('', ...) tool calls
  • http: require auth on OPTIONS /mcp (the unconditional preflight bypass let unauthenticated callers probe the endpoint)
  • http: resolve localhost before classifying as loopback; warn and fall through to the public-host path when /etc/hosts maps it elsewhere
  • rate-limit: make global + per-operation acquire atomic; concurrent callers no longer transiently over-consume the global bucket
  • rate-limit: per-client buckets with LRU eviction (10k cap) — infrastructure ready for HTTP context wiring

Correctness

  • search: reject mismatched cursor and query arguments instead of silently applying the cursor's offsets to a different query
  • cache: stop caching error sentinels and zero-result responses (previously a transient libzim error or index warmup poisoned the cache for the full TTL); audit follow-up extends the gate to get_search_suggestions, get_entry_summary, get_table_of_contents
  • cache: treat empty-string cache values as hits, not misses
  • content: resolve redirects to their target before rendering; cache the resolved path so subsequent lookups skip the chain; reject redirect cycles and chains deeper than MAX_REDIRECT_DEPTH = 10
  • content: instantiate html2text.HTML2Text per call to eliminate a shared-state race that corrupted concurrent conversions
  • content: preserve Unicode in heading slugs (Arabic, Chinese, Cyrillic, Japanese ZIMs no longer produce empty TOC anchors); disambiguate duplicate heading slugs with _2, _3 suffixes
  • content: drop trailing punctuation from path tokens extracted by the simple-tools get_zim_entries parser
  • simple-tools: dispatch the get_zim_entries intent (was silently falling through to search_zim_file); honor explicit zim_file_path for walk_namespace, find_by_title, and related intents
  • subscriptions: detect same-size ZIM replacements via mtime change (size-only detection silently missed identical-size replacements)
  • validation: browse_namespace and walk_namespace parameter checks now raise OpenZimMcpValidationError instead of OpenZimMcpArchiveError or markdown error strings; bound walk_namespace limit to [1, 500] per the documented contract
  • validation: validate get_zim_entries batch size before charging rate-limit so an oversized batch doesn't increment the limiter

Performance

  • search: skip-counter pagination in _perform_filtered_search (offset=900, limit=10 went from ~1000 backend calls to ~10)
  • content: get_entries groups by ZIM file and opens each archive once
  • navigation: cache namespace listings per (archive, namespace); pagination now slices from cache instead of re-scanning
  • search: hoist Searcher construction in _find_entry_by_search (up to 5 Xapian opens collapse to 1)
  • suggestions: Strategy 2 uses libzim's SuggestionSearcher instead of a strided ID scan that skipped 95% of entries on large archives
  • subscriptions: SubscriberRegistry is set-backed for O(1) subscribe/unsubscribe/clear; broadcast fans out concurrently with per-call wait_for timeout so one slow subscriber doesn't stall the watcher

Refactoring

  • zim: split zim_operations.py (3557 → 39 lines, pure shim) into a zim/ package with _SearchMixin, _ContentMixin, _StructureMixin, _NamespaceMixin. Public API preserved via re-exports
  • simple-tools: extract IntentParser into intent_parser.py (parsing logic now unit-testable without ZimOperations mocks)
  • config: unify RateLimitConfig into a single Pydantic BaseModel; per_operation_limits is now reachable from environment variables and JSON config
  • defaults: default cache persistence_path to ~/.cache/openzim-mcp (absolute) rather than .openzim_mcp_cache (relative to CWD)
  • defaults: relocate MAX_REDIRECT_DEPTH and SUBSCRIPTION_SEND_SECONDS to defaults.py (matches existing project pattern)
  • resources: offload blocking list_zim_files_data directory scan via asyncio.to_thread
  • resources: extract _resolve_zim_name helper, replacing duplicated inline ZIM-name match loops
  • simple-tools: intent confidence boost capped (low-priority intents with extracted params can no longer overtake higher-priority param-less intents)
  • prompts: dedupe ask-for-args message into a _ask_for_args(prompt_name) helper

Hardening (other)

  • cache: validate values are JSON-serializable at write time when persistence is enabled (previously default=str silently coerced non-JSON types)
  • security: add an unconditional .. pattern to path normalization so embedded foo..bar traversal candidates trigger the regex layer
  • exceptions: drop details from Exception.args so it no longer leaks into repr() and tracebacks
  • main: route startup banner through the logger (now respects OPENZIM_MCP_LOGGING__LEVEL)
  • simple-tools: consistently append low-confidence note across all intents (was missing on search_all, walk_namespace, find_by_title, related)

Pre-release fix-up

Final bug-sweep passes after the main review work above. Categorised by area for easier scanning.

  • content/structure: _resolve_entry_with_fallback and get_binary_entry now follow the redirect chain (bounded by the shared MAX_REDIRECT_DEPTH = 10 cap with cycle detection) before calling entry.get_item(). Without this the structure, links, TOC, summary, and binary-entry tools all crashed with RuntimeError from libzim whenever the requested path was a redirect to the canonical article (the common case for Kiwix-generated ZIMs)
  • content: _get_main_page_content resolves archive.main_entry and the fallback main_page_paths entries before calling get_item(). Most ZIMs point W/mainPage at the real article via a redirect; previously this raised on every such archive
  • content: get_zim_metadata resolves redirect entries before reading metadata content
  • content: get_related_articles preserves trailing slash in path resolution and resolves relative links against the post-redirect path
  • zim: _resolve_link_to_entry_path rejects self-referential refs that previously fed back into the resolver
  • search: _perform_filtered_search canonicalises lowercase / long-form namespace input so filters stop silently dropping every result; suggestion cache now skips zero-result responses
  • search: search_all validates effective_limit is in the documented 1-50 range
  • simple-tools: get_article intent forwards options[content_offset] so simple-mode pagination works (previously always returned page 1); passthrough intents forward options[limit] / options[offset]
  • subscriptions: broadcast_resource_updated re-raises CancelledError that gather(return_exceptions=True) had silently collected, so stop() no longer hangs until the next sleep tick
  • subscriptions: MtimeWatcher.start() offloads initial _scan via asyncio.to_thread to match _tick, no longer blocking the ASGI lifespan on slow filesystems
  • subscriptions: mtime scan offloaded to thread; fan-out cleanup guarded against late exceptions
  • prompts: switch user-input interpolation delimiter to backticks and preserve quotes in user input
  • rate-limit: add missing RATE_LIMIT_COSTS keys for find_entry_by_title, get_zim_entries, get_related_articles (were silently using the cost=1 default)
  • http: add Mcp-Session-Id to CORS allow_headers and expose_headers so browser MCP clients can resume sessions
  • main: catch pydantic.ValidationError from OpenZimMcpConfig construction and re-surface as OpenZimMcpConfigurationError so operators see a clean message instead of a pydantic validation dump
  • cache: suppress shutdown logging spam; tolerate malformed persisted entries
  • security: symlink-tighten archive scan; harden error context; sanitise name_filter; reject whitespace-only CORS wildcard
  • tools: get_binary_entry docstring example uses keyword include_data=False (positional False was landing in max_size_bytes)
  • packaging: Development Status :: 5 - Production/Stable classifier for the 1.0.0 release

Final pre-release sweep

  • resource: ZimEntryResource.read and the zim://files / zim://{name} resource handlers now offload archive opens via asyncio.to_thread; previously a single read stalled the HTTP/SSE event loop for every other concurrent client
  • resource: ZimEntryResource.read resolves redirect chains (with cycle detection and the shared MAX_REDIRECT_DEPTH = 10 cap) before entry.get_item(); previously every redirect-stub path crashed with RuntimeError from libzim
  • content: get_zim_entries (batch) replaces manual __enter__/__exit__ with a regular with block — cleaner cleanup on BaseException, no silent swallowing of __exit__ errors
  • content: drop _get_main_page_content's archive._get_entry_by_id(0) fallback (libzim private API; entry-zero is not the spec's main-page pointer); the inline redirect helper now uses MAX_REDIRECT_DEPTH and raises OpenZimMcpArchiveError on cycles or chain exhaustion to match the rest of the redirect helpers
  • server: OpenZimMcpServer.run() defaults to self.config.transport (translating the short name 'http' to FastMCP's 'streamable-http') and rejects an explicit transport= argument that contradicts the configured value — closes the gap where HTTP-mode subscriptions could be wired while a stdio transport was actually started
  • search/structure: find_entry_by_title, search_all, and get_related_articles raise OpenZimMcpValidationError on out-of-range limit / limit_per_file instead of returning a hand-formatted markdown string, so the tool layer sees a consistent exception shape
  • http: _is_loopback_host adds a 1-second timeout around socket.gethostbyname("localhost") so a slow resolver can't hang server startup
  • ci: drop pull_request_target trigger from test.yml / codeql.yml / performance.yml (closes the pwn-request gap where untrusted PR code could exfiltrate secrets); release-please prerelease detection reads the resolved tag name (works for workflow_dispatch); release-please bootstrap-sha placeholders removed; Dockerfile uv image pinned to 0.11
  • make: make benchmark selects via -k benchmark (the previously referenced tests/test_benchmarks.py does not exist); make security no longer swallows bandit / pip-audit non-zero exits, so make check (used by release.yml) actually fails on findings
  • docs: OPENZIM_MCP_TOOL_MODE, _TRANSPORT, _HOST, _PORT, _AUTH_TOKEN, _CORS_ORIGINS, _WATCH_INTERVAL_SECONDS, _SUBSCRIPTIONS_ENABLED documented in the README configuration table; install commands aligned across README / website/llms.txt / website/index.html (lead with uv tool install openzim-mcp); website/llm.txt renamed to website/llms.txt (matches the llmstxt.org convention) and advertised in the sitemap
v0.9.0 New feature
Security fixes
  • Resolved dependency CVEs: pyjwt upgraded to 2.12.0, requests to 2.33.0, pytest to 9.0.3, python-multipart to 0.0.26, python‑dotenv to 1.2.2
Notable features
  • Multi-archive search (`search_all`) queries every allowed ZIM file simultaneously and merges results
  • Three new MCP prompts: `/research`, `/summarize`, `/explore` for workflow automation
  • Eight additional power‑user tools (e.g., `walk_namespace`, `warm_cache`, `get_related_articles`) and two MCP resources (`zim://files`, `zim://{name}`)
Full changelog

Ships 8 new tools, 3 MCP prompts, 2 MCP resources, 6 reliability fixes, and refreshed user-facing docs.

Stats: 34 commits • 56 files • +6,711 / -1,423 • 715 tests passing (was 669 on main, +46 new) • coverage 78%

Headline features

Multi-archive search

search_all queries every ZIM file in your allowed directories at once and merges the results — no need to know which archive holds the answer. Skips files that can't be searched without aborting the rest.

MCP Prompts (first use of the primitive)

Three pre-built workflow slash commands in MCP-aware clients (Claude Code, Inspector):

  • /research <topic> — search across archives, then drill into top hits
  • /summarize <zim_file_path> <entry_path> — TOC + summary + key links
  • /explore <zim_file_path> — high-level briefing of a ZIM's contents

All three guard against empty arguments and ask the user for missing input rather than rendering a malformed workflow.

Find entries by title

find_entry_by_title resolves titles (or partial titles) to entry paths via libzim's title-indexed suggestion search, with a fast direct-path probe for exact matches. Optionally cross-file.

Power-user tools

  • walk_namespace — deterministic cursor-paginated namespace iteration (vs. browse_namespace which samples)
  • warm_cache — pre-populate cache for a ZIM file before a long session
  • get_random_entry — sample one random article (composes with /explore)
  • get_related_articles — link-graph nearest neighbours (outbound, inbound, or both, with cursor-based pagination)
  • cache_stats / cache_clear — inspect and manage the in-memory cache

MCP Resources (first use of the primitive)

Your client's resource browser and @-mention picker now see ZIM files directly:

  • zim://files — index of all available ZIM files
  • zim://{name} — overview of one ZIM (metadata, namespace summary, main page preview)

Reliability fixes

  • Namespace listing now does deterministic known-prefix probes for minority namespaces (M, W, X, I) that random sampling silently missed on archives where one namespace dominated. Below 3 sampled hits per namespace, we report a lower-bound count instead of fabricating numbers from the sampling ratio.
  • Search filtering uses a streaming scan (BATCH_SIZE=500, MAX_SCAN=10000) instead of a hard 1000-hit cap, so rare-mime-type filters (e.g. image/png on an HTML-dominant corpus) now return matches that were previously hidden.
  • Error routing matches by message pattern first, so "entry not found" no longer gets "check disk space" advice from the broad OpenZimMcpArchiveError template.
  • Phantom server-instance conflicts are no longer reported — the conflict detector re-verifies process liveness right before raising and auto-unregisters phantoms.
  • max_content_length floor relaxed from 1000 → 100 to support short-preview use cases.
  • get_article_structure / get_table_of_contents docstrings now warn that "mini" / "nopic" flavour ZIMs strip subsection headings, so callers know low heading counts aren't necessarily a bug.

Simple-mode patterns

The natural-language Simple-mode tool now routes 8 new intent patterns to the new tools (search_all, walk_namespace, warm_cache, find_by_title, random_entry, related, cache_stats, cache_clear). The runtime-visible zim_query docstring lists all 26 patterns.

Documentation

  • README: new "What's new in v0.9.0" section, refreshed banner, tool count updated 18 → 26, added MCP prompts/resources section.
  • Website: hero banner refreshed, version stat updated, three new NEW-badged feature cards (multi-archive search, MCP prompts, find by title); displaced cards (article summaries, TOC, binary content) demoted to non-NEW.
  • Simple Mode Guide: tool count updated 18 → 26.
  • Spec & plan committed under docs/superpowers/ for traceability.

Testing

  • 715 tests passing (+46 new tests across 9 new test files)
  • make lint / make security (bandit) all clean
  • Coverage holds at 78% (baseline was 78%; new tests focus on validation/contract paths per project convention)
  • Manual smoke test of new tools and prompts via MCP Inspector recommended before merge

Out of scope (deferred)

Per the v0.9.0 spec, intentionally deferred to a future release:

  • Per-entry MCP resources (zim://{name}/entry/A/Article) — FastMCP URI templates don't handle literal / in path captures cleanly
  • HTTP/SSE/streamable-http transport — there is a contributed fork implementing this; planned for a focused 0.9.1 point release
  • Resource subscriptions
  • Batch get_entries

Known follow-ups

Tracked from the final code review pass; non-blocking but worth filing:

  • get_related_articles inbound mode opens N+1 archives per call (each candidate triggers a separate extract_article_links call). Cache hits hide it on warm runs; on a network unmount mid-call, every candidate after gets silently skipped. Worth refactoring extract_article_links to accept an already-open archive.
  • search_all returns markdown-text inside JSON rather than structured per-result data. Consistent with how search_zim_file already works, but a future contract cleanup.
  • ~~5 pre-existing dependency CVEs flagged by pip-audit~~ — resolved: pyjwt 2.10.1→2.12.0 (#65), requests 2.32.5→2.33.0 (#66), pytest 9.0.2→9.0.3 (#75), python-multipart 0.0.22→0.0.26 (#76), python-dotenv 1.2.1→1.2.2 (#77) all merged into main and rebased into this branch.
  • 5 newly-disclosed CVEs found by post-rebase pip-audit (cryptography 46.0.5, pygments 2.19.2, black 26.1.0, pip 26.0). All covered by existing Dependabot PRs (#64, #72, #74) — candidate for a 0.9.1 follow-up.
  • Wiki content (wiki-content/*.md) reflects pre-0.9 surface; out of scope for this PR.

Versioning

  • pyproject.toml, .release-please-manifest.json, openzim_mcp/__init__.py all bumped to 0.9.0
  • uv.lock regenerated
  • CHANGELOG will be auto-generated by release-please from the conventional commits on this branch
v0.8.3 Bug fix
Security fixes
  • Resolved GitHub code scanning alerts #133 and #134 in security.py
Full changelog

Bug Fixes

  • fix logo URL in README.md to use absolute GitHub raw URL for PyPI display (README.md)
  • resolve GitHub code scanning alert #133 - variable defined multiple times in security.py (security.py)
  • resolve GitHub code scanning alert #134 - mixed import styles in test_main.py (test_main.py)
  • remove unused contextlib import from security.py (flake8 fix)
v0.8.2 Bug fix

Fixed search pagination when offset exceeds total results.

Full changelog

Bug Fixes

  • fix search pagination when offset exceeds total results (zim_operations.py)
  • improve exception handling in instance tracker for Python 3 compatibility (instance_tracker.py)
  • add fallback to stderr for logging during shutdown (instance_tracker.py)
  • improve Windows process checking with debug logging (instance_tracker.py)
  • fix release workflow to skip automatic GitHub release creation (release-please.yml)
  • resolve linting issues in simple_tools.py and content_tools.py
v0.8.1 New feature
Notable features
  • Article summaries via get_entry_summary
  • Table of contents extraction via get_table_of_contents
  • Token‑based pagination cursors
Full changelog

0.8.1 (2026-01-29)

Features

  • add article summaries, table of contents, and pagination cursors

Bug Fixes

  • remove unused imports in test files for CI linting
  • resolve GitHub code scanning alerts

Details

  • Article Summaries (get_entry_summary): Extract concise article summaries from opening paragraphs
  • Table of Contents Extraction (get_table_of_contents): Build hierarchical TOC from article headings
  • Pagination Cursors: Token-based pagination for easier result navigation

Enhanced

  • Intent Parsing: Improved multi-match resolution with weighted scoring
  • Simple Mode: Added natural language support for new features
v0.7.1 Maintenance

Minor fixes and improvements.

Full changelog

0.7.1 (2026-01-28)

Bug Fixes

  • ci: handle existing GitHub releases in release workflow (#54) (63afa3d)
v0.7.0 New feature
Notable features
  • Added binary content retrieval for PDFs, images, and media files
Full changelog

Features

  • add binary content retrieval for PDFs, images, and media files (#52) (95611c9)

Beta — feedback welcome: [email protected]