Release history
cameronrye/openzim-mcp releases
Modern, secure MCP server for accessing ZIM format knowledge bases offline. Enables AI models to search and navigate Wikipedia, educational content, and other compressed knowledge archives with smart retrieval, caching, and comprehensive API.
All releases
54 shown
Slashed‑compound widening + politeness expansion + param‑leak fixes
Connector footer + Unicode tokenisation + Cursor ai
Cache accounting fixes + search error fix
- dep: CVE-2026-44431 — fixed by upgrading urllib3 to 2.7.0
- dep: CVE-2026-44432 — fixed by upgrading urllib3 to 2.7.0
- make security passes --skip-editable to avoid pip-audit failure on local package
Full changelog
Re-cut of v2.0.0a7 — the v2.0.0a7 tag exists but its GitHub Release
failed to publish because pip-audit surfaced two upstream urllib3
CVEs (CVE-2026-44431 / 44432) that landed in the audit database
between the v2.0.0a6 and v2.0.0a7 builds. v2.0.0a8 carries the same
v2.0.0a7 content plus the urllib3 → 2.7.0 bump that closes the CVEs.
Also adjusts make security to pass --skip-editable so pip-audit
doesn't fail looking for the local package on PyPI mid-release.
Defect + opportunity batch on top of v2.0.0a6, found by end-to-end
testing against a real Wikipedia ZIM (118 GB, 27.2M entries,
Feb 2026 snapshot). 14 defects fixed, 8 opportunities added.
1388 tests pass (+13 from new test modules); no regressions.
Fixed — Phase A (snippets, infobox, typo fallback)
- #14:
_typo_variantsnow reaches"Photosythesis"→"Photosynthesis".
v2.0.0a4 shipped only transposition + deletion edits — mathematically
unable to recover the missing'n'(insertion). Added insertion +
substitution against the full a-z alphabet, length-gated at ≥ 5 chars
to bound cost (~700 variants for a 13-char input; ≤ 10 ms/call). - #1: snippet highlighter no longer produces malformed markdown.
_highlight_termspreviously wrapped query terms verbatim, producing
**Artificial **photosynthesis****,_****Berlin****_, and
[**Photosynthesis**](**Photosynthesis** "**Photosynthesis**")when
the match landed inside existing bold / italic / link constructs.
Added a skip regex covering paired emphasis runs and full
[text](href "tooltip")link constructs (deliberately not bare
parens, so prose like(also called assimilation)keeps its
highlighting). - #1: snippet fallback to stem-prefix substring match. When no
whole-word match existed, the snippet used to drop to the lead
paragraph. Now it falls back to a stem-prefix substring (first ⅔ of
the query term) so"photosynthesis"catches paragraphs mentioning
"photosynthetic"instead of returning the article's unrelated lead. - Op1: snippets drop the duplicate
# <Title>H1.create_snippet
accepts an optionaltitle=;_get_entry_snippetforwards the
entry title so the heading that already appears in the result row
doesn't burn 5–15 tokens per result. - #2 / Op5: infobox extraction tracks parent-section context.
extract_infoboxnow prefixes labels with their parent
<th colspan>heading row, so a Berlin infobox renders
Area — City/State/Population — City/Stateinstead of three
identicalCity/Staterows. Also skips rows whose nearest table
ancestor isn't the infobox (handles nested chronology / coords
microformats) and rejects<th>/<td>candidates borrowed from
inside nested tables. - Op6: strip image-caption / hatnote / sidebar / navbox / inline
citation noise.UNWANTED_HTML_SELECTORSnow dropsfigure,
figcaption,.thumb,.thumbcaption,.gallery,.hatnote,
.sidebar,.navbox,.metadata.mbox-small,sup.reference,
.reference,.mw-collapsible-toggle, and the.geo-*coordinate
microformats. Article leads now start with the actual prose, not
Schematic of … For other uses, see X (disambiguation). Part of a series on … 52°31'07"N 13°24'16"E ….
Fixed — Phase B (response contract)
- #3 / Op8:
zim_queryaccepts acursorparameter. Tools advertised
opaque base64 cursors in their responses, but the simple-mode
zim_querytool only took an integeroffset— the cursors were
decorative. Now decoded;s.opopulatesoptions["offset"]and the
per-tool state is preserved. Length-capped at 2 KB
defense-in-depth.
Fixed — Phase C (primitives)
- #9 / #7:
get_sectiontable rendering now matchesget_zim_entry.
The bundle'srendered_markdownwas built withcompact=Falsewhile
get_zim_entryrendered withcompact=True. Result:get_section "Geography"returned pipe-soup tables while the surrounding article
fetch path showed[Table N: M rows x P cols - pass compact=False to expand]placeholders. Bundle and search-snippet rendering paths now
both applycompact=True, so the markdown is consistent everywhere. - #10 / D8: synthesize attribution carries the
#section_idsuffix.
_locate_passagecouldn't find passages containing**bold**
highlight markers inside the bundle's plain markdown — every citation
fell back to entry-level (section_id: null). Now strips**
markers before locating so attribution resolves correctly. - #10 / D5: synthesize strips natural-language interrogative prefix.
synthesize=Truewith"tell me about Berlin"previously fed the
entire phrase to BM25 — returning Irving Berlin songs, Nat King Cole
albums, and a graffiti article instead of the canonical Berlin
entry. Intent-parses first, hands only the topic to the search
stage; preserves the original query for response echo. - #10 / D8 / Op4: response dedupe + link-strip in compact mode.
passages[].text_markdownpreviously duplicatedanswer_markdown
verbatim (~50% token bloat on every synthesize call). In compact
mode, passages now omit the body text. Wikipedia link-soup
([text](href "tooltip")) is also stripped from passages — small
models can't follow inline links from inside tool responses anyway. - Op3:
get_sectionsupports narrow scoping. New
include_subsections=Falseparameter onget_section_data(and the
narrow section X of Y/just section X of Yquery syntax in
simple mode) ends the slice at the next heading of any level, so a
caller can fetch just the H2 lead paragraphs without the cascading
H3 sub-tree. - Op2: compact structure response carries per-heading summaries.
The 80-charsummaryfield is derived from each section's body
preview so a small model can choose which section to drill into,
not just see which exist.
Fixed — namespace / metadata / tell me about
- D2:
browse namespace Cno longer crashes on new-scheme archives.
Legacy code built a full 27 M-entry list before slicing 50 rows out
of it — slow, memory-hostile, and triggered "session expired" errors
on real Wikipedia archives. New_browse_new_scheme_c_paginated
pages directly through the entry-id range. - D3:
browse namespace Wreturns the actual W entries. New-scheme
archives keep W off libzim's iterable surface, but the well-known
paths (W/mainPage,W/favicon, ...) are reachable via
has_entry_by_path. New_browse_new_scheme_w_paginatedprobes
them so the response matcheslist_namespaces' count. - D11: metadata previews cap at 800 chars. Wikipedia ZIMs store
M/Titleas a full HTML document (~1 MB) rather than the bare title
string. Themetadata for <archive>call previously returned 980 KB,
starving every other metadata field. Each entry is now capped with
a[truncated, N chars total]marker. - D6 / Op7:
tell me about <topic>auto-fetches on title-index hit.
When the top BM25 result wasn't a strong-title match (Xapian ranked
List of songs about Berlinabove the canonicalBerlinarticle),
the response used to render the search list. Now falls back to
find_entry_by_title_data; promotes any score-1.0 result past the
BM25 ranking and inlines the article body.
CI / quality
- 3 new test modules, 47 additional assertions covering each fix:
test_typo_variants_v2a7.py,test_content_processor_fixes_v2a7.py,
test_v2a7_fixes_helpers.py. End-to-end proof that"Photosythesis"
resolves through the full call path (mock archive + suggester); perf
guard against quadratic regressions in_typo_variants; cursor
garbage-rejection; metadata cap on both long and short values. - Goldens regenerated (all strict improvements): pipe-soup infobox
snippet → clean lead-paragraph snippet for Einstein; H1 dedup +
section attribution on the Berlin / Munich synthesize fixtures. - Test infra: explicit
encoding="utf-8"on golden read/write so
non-ASCII characters in goldens survive Windows runners. - SonarCloud quality gate: factored shared test setup
(_make_simple_handler,_build_metadata_mock_archive,
_wire_typo_fallback_archive) and namespace browse-payload shape
(_new_scheme_browse_payload,_materialise_paths) so new-code
duplication stays under 3%.
- `get_section` tool returns a specific markdown section with full metadata and optional truncation handling
- `zim_query(synthesize=True)` mode performs pure retrieval‑plus‑concatenation synthesis without LLM generation, including passage extraction, citation attribution, budget enforcement, and structured response
Full changelog
v2 Phase C, part 2: completes the retrieval-primitives phase. Adds the
get_section tool (#7) and the zim_query(synthesize=True) mode (#10)
on top of the EntryBundle infrastructure that shipped in v2.0.0a3.
No wire-format breaks — both new surfaces are additive.
#7 — New tool get_section
get_section(zim_file_path, entry_path, section_id, *, max_chars=None)
→ Union[GetSectionResponse, ToolErrorPayload]
Returns a single section's body (~500-1500 tokens — small-model sweet
spot per parent-document-retrieval research) plus full metadata.
section_id values come from get_table_of_contents
(TocHeading.section_id). On miss, returns
tool_error("section_not_found", extras={"available_section_ids": [...] }) so the model can self-correct.
The data layer slices EntryBundle.rendered_markdown[char_start:char_end]
where the bundle's section ranges include subsections (a parent heading's
char_end extends to the next heading at the same or higher level).
Parent sections therefore return the full subtree body. max_chars
truncates the body and sets truncated=True plus _meta.truncated=True
in the envelope for budget-aware clients.
#10 — New zim_query(synthesize=True) mode
{
"query": str,
"answer_markdown": str, # passages + inline [cite: ...] markers
"passages": list[SynthesizePassage],
"citations": list[Citation],
"archives_searched": list[str],
"fallback_used": Literal["xapian_score", "rrf_fusion", "reranker"],
"total_chars": int,
"total_words": int,
"_meta": MetaEnvelope,
}
Pure retrieval + concatenation; no LLM generation. The seven-stage
pipeline (in openzim_mcp/synthesize.py):
- Per-archive search — Xapian top-K hits (
search_top_khelper
onZimOperations). - RRF fusion — Reciprocal Rank Fusion (k=60) when multiple archives
are searched; identity passthrough for single-archive
(fallback_used="xapian_score"vs"rrf_fusion"). - Identity rerank — placeholder for Phase D's cross-encoder.
- Passage extraction — libzim snippets rendered to markdown.
- Section attribution — best-effort lookup via
EntryBundle;
passages getcite_id = "{archive}/{entry_path}#{section_id}"
when the snippet text is found in a section's char range. Bundle
build failures keep the cite_id at entry level. - Budget enforcement —
output_char_budgettruncates the last
passage; subsequent passages are dropped. - Render + citations — passages joined with
\n\nand inline
[cite: ...]markers; structuredCitationlist deduplicated by
cite_id.
Zero hits returns an empty response with _meta.reason="0_hits".
Other
- Extended
tool_error()with anextras: Optional[Dict[str, Any]]
kwarg so error payloads can carry self-correction hints (e.g. the
available_section_idslist above) without# type: ignoreat
call sites. - New tests:
tests/test_get_section.py(4),
tests/test_synthesize.py(~20 unit + 3 end-to-end),
tests/test_golden_v2_phase_c.py(3get_section+ 3synthesize
snapshots, deterministic via the newv2_phase_c_zimheading-rich
fixture).test_response_contractexempts both new tools from the
list-pagination contract while still asserting_metais present. - The Phase A
_metaenvelope continues to attach on every response.
_meta.truncatedis now correctly forwarded byget_section_data
on truncation (was a hidden gap in earlier scaffolding).
- Added `tiktoken>=0.7.0` to core dependencies; ensure the environment provides this version.
- Compact‑mode empty‑result prose now uses a new footer + suggestions format; existing integrations expecting the old paragraph output must handle the structured `_meta.suggestions` field.
- Every dict‑returning tool now includes a `_meta` envelope with token/character estimates, truncation flags, total characters, suggestions, and reason.
- Compact‑mode responses add a markdown blockquote footer showing token count (configurable via `OPENZIM_MCP_META__FOOTER_ENABLED`).
- In compact mode, `.infobox` / `.vcard` tables are emitted as Markdown key‑value lists; large tables are replaced with `[Table N: ...]` placeholders.
Full changelog
First v2 pre-release. Phase A of the multi-phase v2 effort. All changes additive at the tool-signature layer; small compact-mode prose change for empty search results (see Changed below).
Added
- meta: every dict-returning tool now includes a
_metaenvelope (tokens_est,chars,truncated,more_at_offset,total_chars,suggestions,reason).tokens_estuses tiktokencl100k_basewith a 5% pad. (#5) - simple: compact-mode responses gain a one-line markdown blockquote footer (
> ~4.2K tokens · ...). SetOPENZIM_MCP_META__FOOTER_ENABLED=falseto suppress. (#5) - content: in compact mode,
.infobox/.vcardtables emit a Markdown KV list prepended to the body. (#2) - content: in compact mode, tables exceeding row or character thresholds are replaced with
[Table N: ...]placeholders. (#2) - search: every search response is query-aware — snippets contain the actual matched passage (with
**bold**highlights, capped at 5 hits) rather than the article lead. (#1) - search:
_meta.suggestions[]surfaces typo variants (alt_spelling) and other-archive candidates (alt_archive) for empty / low-confidence searches. (#4) - search:
find_entry_by_titlefuzzy fallback now triggers whenever no result clears 0.7 (previously only on zero hits). Score and length-gate are configurable viaOPENZIM_MCP_SEARCH__FUZZY_TITLE_*. (#14)
Changed
- simple: compact-mode empty-result prose now renders via the new footer + structured suggestions instead of the v1.2.0 paragraph. The information is one-for-one; the format is more model-readable.
compact=Falsepaths retain byte-identical v1.2.0 behavior. (#4) - search:
find_entry_by_titletypo-corrected hits now score0.85(was hardcoded0.7) by default. (#14)
Dependencies
- Added
tiktoken>=0.7.0to core dependencies.
- Refinements and production‑readiness improvements
Full changelog
- simple-mode enhancements: tell_me_about command, larger code snippets, compact pagination
Full changelog
Fixed server CORS origins not being mirrored to the SDK.
Full changelog
Fixed bugs in walk_namespace, related-articles, and confidence beta-refinement.
Full changelog
- Tool responses use MCP structured content (no more double-stringified JSON)
Full changelog
Fixed http host header processing to honor an operator‑configured allowlist.
- Redact absolute paths from MCP error responses (CVE not specified)
- Streamable HTTP transport with bearer-token auth, CORS allow-list, and health/readyz endpoints
- Multi-stage, multi-arch Docker image (linux/amd64, linux/arm64) running as non‑root with built‑in health check
Full changelog
Includes an end-to-end review pass before tagging — security hardening, correctness fixes, performance work, and a refactor that splits zim_operations.py into a zim/ package via mixin classes. See sections below.
Features
- http: streamable HTTP transport with bearer-token auth, CORS allow-list, and
/healthz//readyzendpoints - http: safe-default startup check refuses to bind a non-localhost host without an auth token
- transport: legacy SSE transport (
--transport sse) for clients that haven't migrated to streamable-HTTP; bound to localhost only, no auth/CORS middleware - docker: multi-stage, multi-arch (
linux/amd64,linux/arm64) image published toghcr.io/cameronrye/openzim-mcp, runs as non-root with a built-in health check - content:
get_zim_entriesbatch tool — fetch up to 50 entries in one call, with per-entry success/error reporting - resources: per-entry
zim://{name}/entry/{path}resource serves entries with their native MIME type (clients must URL-encode/as%2Fin the path segment) - subscriptions: clients can subscribe to
zim://filesandzim://{name}; mtime-polling watcher emitsnotifications/resources/updatedwhen allowed directories or.zimfiles change - search: opaque
cursorparameter onsearch_zim_filefor resumable pagination - simple: intent pattern routes batch retrieval queries to
get_zim_entries
Improvements
- content:
get_related_articlesresolves relative hrefs against the source entry's directory and detects the content namespace correctly on domain-scheme archives (previously returned nothing) - content: suggestion fallback uses
SuggestionSearcher(archive).suggest(text)(the priorarchive.suggest()call did not exist) - tools:
list_zim_filesaccepts a case-insensitivename_filtersubstring argument; one shared cache slot regardless of filter value - content:
get_zim_entriesaccepts bare entry-path strings paired with azim_file_pathdefault (dicts still work for multi-archive batches) - content: heading-id resolution falls through
id→ mw-headline anchor → preceding<a name="">→ slug, returning(id, source)so consumers can distinguish real anchors from synthetic slugs - content: summary extraction skips USWDS banners and skip-nav blocks above the first
<h1>(MedlinePlus / NIH / NIST style sites) - content: link extraction drops non-navigable schemes (
javascript:,mailto:,tel:,data:,blob:,vbscript:) - server:
__version__reads fromimportlib.metadata;serverInfo.versionreports openzim-mcp's actual version (no longer the FastMCP SDK default)
Removed
- tools: advanced-mode tool surface drops 27 → 21. Removed:
warm_cache,cache_stats,cache_clear,get_random_entry,diagnose_server_state,resolve_server_conflicts. The cache itself remains; the explicit management tools were dropped. - instance: multi-instance conflict tracking removed;
instance_tracker.pydeleted. HTTP server instances coexist freely.
Bug Fixes
- content: sanitize per-entry paths in
get_zim_entriesand expand test coverage - resources: per-entry
zim://returns libzim's native MIME type - http: start subscription watcher via wrapped lifespan
- instance: relax conflict logic for HTTP transport so multiple HTTP server instances can coexist
Security
- errors: redact absolute paths from MCP error responses (rejected traversals previously leaked the canonical allowed-directory layout)
- errors: regex-based path redaction with cross-platform separator handling and tightened lookbehind so wrapped/quoted paths (
(/opt/foo),"/opt/bar") no longer slip through - diagnostics: redact filesystem paths and PIDs in
get_server_health/get_server_configurationresponses (no longer transport-gated; always redacted) - resources: sanitize URI-decoded entry paths before passing to libzim
- search: always sanitize
zim_file_pathinfind_entry_by_title(previously skipped whencross_file=True) - prompts: strip control characters and cap user-supplied arguments before interpolating into MCP prompt bodies; re-check empty after sanitization to avoid empty
('', ...)tool calls - http: require auth on
OPTIONS /mcp(the unconditional preflight bypass let unauthenticated callers probe the endpoint) - http: resolve
localhostbefore classifying as loopback; warn and fall through to the public-host path when/etc/hostsmaps it elsewhere - rate-limit: make global + per-operation acquire atomic; concurrent callers no longer transiently over-consume the global bucket
- rate-limit: per-client buckets with LRU eviction (10k cap) — infrastructure ready for HTTP context wiring
Correctness
- search: reject mismatched
cursorandqueryarguments instead of silently applying the cursor's offsets to a different query - cache: stop caching error sentinels and zero-result responses (previously a transient libzim error or index warmup poisoned the cache for the full TTL); audit follow-up extends the gate to
get_search_suggestions,get_entry_summary,get_table_of_contents - cache: treat empty-string cache values as hits, not misses
- content: resolve redirects to their target before rendering; cache the resolved path so subsequent lookups skip the chain; reject redirect cycles and chains deeper than
MAX_REDIRECT_DEPTH = 10 - content: instantiate
html2text.HTML2Textper call to eliminate a shared-state race that corrupted concurrent conversions - content: preserve Unicode in heading slugs (Arabic, Chinese, Cyrillic, Japanese ZIMs no longer produce empty TOC anchors); disambiguate duplicate heading slugs with
_2,_3suffixes - content: drop trailing punctuation from path tokens extracted by the simple-tools
get_zim_entriesparser - simple-tools: dispatch the
get_zim_entriesintent (was silently falling through tosearch_zim_file); honor explicitzim_file_pathforwalk_namespace,find_by_title, andrelatedintents - subscriptions: detect same-size ZIM replacements via mtime change (size-only detection silently missed identical-size replacements)
- validation:
browse_namespaceandwalk_namespaceparameter checks now raiseOpenZimMcpValidationErrorinstead ofOpenZimMcpArchiveErroror markdown error strings; boundwalk_namespacelimitto[1, 500]per the documented contract - validation: validate
get_zim_entriesbatch size before charging rate-limit so an oversized batch doesn't increment the limiter
Performance
- search: skip-counter pagination in
_perform_filtered_search(offset=900, limit=10 went from ~1000 backend calls to ~10) - content:
get_entriesgroups by ZIM file and opens each archive once - navigation: cache namespace listings per
(archive, namespace); pagination now slices from cache instead of re-scanning - search: hoist
Searcherconstruction in_find_entry_by_search(up to 5 Xapian opens collapse to 1) - suggestions: Strategy 2 uses libzim's
SuggestionSearcherinstead of a strided ID scan that skipped 95% of entries on large archives - subscriptions:
SubscriberRegistryis set-backed for O(1) subscribe/unsubscribe/clear; broadcast fans out concurrently with per-callwait_fortimeout so one slow subscriber doesn't stall the watcher
Refactoring
- zim: split
zim_operations.py(3557 → 39 lines, pure shim) into azim/package with_SearchMixin,_ContentMixin,_StructureMixin,_NamespaceMixin. Public API preserved via re-exports - simple-tools: extract
IntentParserintointent_parser.py(parsing logic now unit-testable withoutZimOperationsmocks) - config: unify
RateLimitConfiginto a single PydanticBaseModel;per_operation_limitsis now reachable from environment variables and JSON config - defaults: default cache
persistence_pathto~/.cache/openzim-mcp(absolute) rather than.openzim_mcp_cache(relative to CWD) - defaults: relocate
MAX_REDIRECT_DEPTHandSUBSCRIPTION_SEND_SECONDStodefaults.py(matches existing project pattern) - resources: offload blocking
list_zim_files_datadirectory scan viaasyncio.to_thread - resources: extract
_resolve_zim_namehelper, replacing duplicated inline ZIM-name match loops - simple-tools: intent confidence boost capped (low-priority intents with extracted params can no longer overtake higher-priority param-less intents)
- prompts: dedupe ask-for-args message into a
_ask_for_args(prompt_name)helper
Hardening (other)
- cache: validate values are JSON-serializable at write time when persistence is enabled (previously
default=strsilently coerced non-JSON types) - security: add an unconditional
..pattern to path normalization so embeddedfoo..bartraversal candidates trigger the regex layer - exceptions: drop
detailsfromException.argsso it no longer leaks intorepr()and tracebacks - main: route startup banner through the logger (now respects
OPENZIM_MCP_LOGGING__LEVEL) - simple-tools: consistently append low-confidence note across all intents (was missing on
search_all,walk_namespace,find_by_title,related)
Pre-release fix-up
Final bug-sweep passes after the main review work above. Categorised by area for easier scanning.
- content/structure:
_resolve_entry_with_fallbackandget_binary_entrynow follow the redirect chain (bounded by the sharedMAX_REDIRECT_DEPTH = 10cap with cycle detection) before callingentry.get_item(). Without this the structure, links, TOC, summary, and binary-entry tools all crashed withRuntimeErrorfrom libzim whenever the requested path was a redirect to the canonical article (the common case for Kiwix-generated ZIMs) - content:
_get_main_page_contentresolvesarchive.main_entryand the fallbackmain_page_pathsentries before callingget_item(). Most ZIMs pointW/mainPageat the real article via a redirect; previously this raised on every such archive - content:
get_zim_metadataresolves redirect entries before reading metadata content - content:
get_related_articlespreserves trailing slash in path resolution and resolves relative links against the post-redirect path - zim:
_resolve_link_to_entry_pathrejects self-referential refs that previously fed back into the resolver - search:
_perform_filtered_searchcanonicalises lowercase / long-form namespace input so filters stop silently dropping every result; suggestion cache now skips zero-result responses - search:
search_allvalidates effective_limit is in the documented 1-50 range - simple-tools:
get_articleintent forwardsoptions[content_offset]so simple-mode pagination works (previously always returned page 1); passthrough intents forwardoptions[limit]/options[offset] - subscriptions:
broadcast_resource_updatedre-raisesCancelledErrorthatgather(return_exceptions=True)had silently collected, sostop()no longer hangs until the next sleep tick - subscriptions:
MtimeWatcher.start()offloads initial_scanviaasyncio.to_threadto match_tick, no longer blocking the ASGI lifespan on slow filesystems - subscriptions: mtime scan offloaded to thread; fan-out cleanup guarded against late exceptions
- prompts: switch user-input interpolation delimiter to backticks and preserve quotes in user input
- rate-limit: add missing
RATE_LIMIT_COSTSkeys forfind_entry_by_title,get_zim_entries,get_related_articles(were silently using the cost=1 default) - http: add
Mcp-Session-Idto CORSallow_headersandexpose_headersso browser MCP clients can resume sessions - main: catch
pydantic.ValidationErrorfromOpenZimMcpConfigconstruction and re-surface asOpenZimMcpConfigurationErrorso operators see a clean message instead of a pydantic validation dump - cache: suppress shutdown logging spam; tolerate malformed persisted entries
- security: symlink-tighten archive scan; harden error context; sanitise
name_filter; reject whitespace-only CORS wildcard - tools:
get_binary_entrydocstring example uses keywordinclude_data=False(positionalFalsewas landing inmax_size_bytes) - packaging:
Development Status :: 5 - Production/Stableclassifier for the 1.0.0 release
Final pre-release sweep
- resource:
ZimEntryResource.readand thezim://files/zim://{name}resource handlers now offload archive opens viaasyncio.to_thread; previously a single read stalled the HTTP/SSE event loop for every other concurrent client - resource:
ZimEntryResource.readresolves redirect chains (with cycle detection and the sharedMAX_REDIRECT_DEPTH = 10cap) beforeentry.get_item(); previously every redirect-stub path crashed withRuntimeErrorfrom libzim - content:
get_zim_entries(batch) replaces manual__enter__/__exit__with a regularwithblock — cleaner cleanup onBaseException, no silent swallowing of__exit__errors - content: drop
_get_main_page_content'sarchive._get_entry_by_id(0)fallback (libzim private API; entry-zero is not the spec's main-page pointer); the inline redirect helper now usesMAX_REDIRECT_DEPTHand raisesOpenZimMcpArchiveErroron cycles or chain exhaustion to match the rest of the redirect helpers - server:
OpenZimMcpServer.run()defaults toself.config.transport(translating the short name'http'to FastMCP's'streamable-http') and rejects an explicittransport=argument that contradicts the configured value — closes the gap where HTTP-mode subscriptions could be wired while a stdio transport was actually started - search/structure:
find_entry_by_title,search_all, andget_related_articlesraiseOpenZimMcpValidationErroron out-of-rangelimit/limit_per_fileinstead of returning a hand-formatted markdown string, so the tool layer sees a consistent exception shape - http:
_is_loopback_hostadds a 1-second timeout aroundsocket.gethostbyname("localhost")so a slow resolver can't hang server startup - ci: drop
pull_request_targettrigger fromtest.yml/codeql.yml/performance.yml(closes the pwn-request gap where untrusted PR code could exfiltrate secrets); release-please prerelease detection reads the resolved tag name (works forworkflow_dispatch); release-please bootstrap-sha placeholders removed; Dockerfile uv image pinned to0.11 - make:
make benchmarkselects via-k benchmark(the previously referencedtests/test_benchmarks.pydoes not exist);make securityno longer swallows bandit / pip-audit non-zero exits, somake check(used byrelease.yml) actually fails on findings - docs:
OPENZIM_MCP_TOOL_MODE,_TRANSPORT,_HOST,_PORT,_AUTH_TOKEN,_CORS_ORIGINS,_WATCH_INTERVAL_SECONDS,_SUBSCRIPTIONS_ENABLEDdocumented in the README configuration table; install commands aligned across README /website/llms.txt/website/index.html(lead withuv tool install openzim-mcp);website/llm.txtrenamed towebsite/llms.txt(matches the llmstxt.org convention) and advertised in the sitemap
- Resolved dependency CVEs: pyjwt upgraded to 2.12.0, requests to 2.33.0, pytest to 9.0.3, python-multipart to 0.0.26, python‑dotenv to 1.2.2
- Multi-archive search (`search_all`) queries every allowed ZIM file simultaneously and merges results
- Three new MCP prompts: `/research`, `/summarize`, `/explore` for workflow automation
- Eight additional power‑user tools (e.g., `walk_namespace`, `warm_cache`, `get_related_articles`) and two MCP resources (`zim://files`, `zim://{name}`)
Full changelog
Ships 8 new tools, 3 MCP prompts, 2 MCP resources, 6 reliability fixes, and refreshed user-facing docs.
Stats: 34 commits • 56 files • +6,711 / -1,423 • 715 tests passing (was 669 on main, +46 new) • coverage 78%
Headline features
Multi-archive search
search_all queries every ZIM file in your allowed directories at once and merges the results — no need to know which archive holds the answer. Skips files that can't be searched without aborting the rest.
MCP Prompts (first use of the primitive)
Three pre-built workflow slash commands in MCP-aware clients (Claude Code, Inspector):
/research <topic>— search across archives, then drill into top hits/summarize <zim_file_path> <entry_path>— TOC + summary + key links/explore <zim_file_path>— high-level briefing of a ZIM's contents
All three guard against empty arguments and ask the user for missing input rather than rendering a malformed workflow.
Find entries by title
find_entry_by_title resolves titles (or partial titles) to entry paths via libzim's title-indexed suggestion search, with a fast direct-path probe for exact matches. Optionally cross-file.
Power-user tools
walk_namespace— deterministic cursor-paginated namespace iteration (vs.browse_namespacewhich samples)warm_cache— pre-populate cache for a ZIM file before a long sessionget_random_entry— sample one random article (composes with/explore)get_related_articles— link-graph nearest neighbours (outbound, inbound, or both, with cursor-based pagination)cache_stats/cache_clear— inspect and manage the in-memory cache
MCP Resources (first use of the primitive)
Your client's resource browser and @-mention picker now see ZIM files directly:
zim://files— index of all available ZIM fileszim://{name}— overview of one ZIM (metadata, namespace summary, main page preview)
Reliability fixes
- Namespace listing now does deterministic known-prefix probes for minority namespaces (M, W, X, I) that random sampling silently missed on archives where one namespace dominated. Below 3 sampled hits per namespace, we report a lower-bound count instead of fabricating numbers from the sampling ratio.
- Search filtering uses a streaming scan (BATCH_SIZE=500, MAX_SCAN=10000) instead of a hard 1000-hit cap, so rare-mime-type filters (e.g.
image/pngon an HTML-dominant corpus) now return matches that were previously hidden. - Error routing matches by message pattern first, so "entry not found" no longer gets "check disk space" advice from the broad
OpenZimMcpArchiveErrortemplate. - Phantom server-instance conflicts are no longer reported — the conflict detector re-verifies process liveness right before raising and auto-unregisters phantoms.
max_content_lengthfloor relaxed from 1000 → 100 to support short-preview use cases.get_article_structure/get_table_of_contentsdocstrings now warn that "mini" / "nopic" flavour ZIMs strip subsection headings, so callers know low heading counts aren't necessarily a bug.
Simple-mode patterns
The natural-language Simple-mode tool now routes 8 new intent patterns to the new tools (search_all, walk_namespace, warm_cache, find_by_title, random_entry, related, cache_stats, cache_clear). The runtime-visible zim_query docstring lists all 26 patterns.
Documentation
- README: new "What's new in v0.9.0" section, refreshed banner, tool count updated 18 → 26, added MCP prompts/resources section.
- Website: hero banner refreshed, version stat updated, three new NEW-badged feature cards (multi-archive search, MCP prompts, find by title); displaced cards (article summaries, TOC, binary content) demoted to non-NEW.
- Simple Mode Guide: tool count updated 18 → 26.
- Spec & plan committed under
docs/superpowers/for traceability.
Testing
- 715 tests passing (+46 new tests across 9 new test files)
make lint/make security(bandit) all clean- Coverage holds at 78% (baseline was 78%; new tests focus on validation/contract paths per project convention)
- Manual smoke test of new tools and prompts via MCP Inspector recommended before merge
Out of scope (deferred)
Per the v0.9.0 spec, intentionally deferred to a future release:
- Per-entry MCP resources (
zim://{name}/entry/A/Article) — FastMCP URI templates don't handle literal/in path captures cleanly - HTTP/SSE/streamable-http transport — there is a contributed fork implementing this; planned for a focused 0.9.1 point release
- Resource subscriptions
- Batch
get_entries
Known follow-ups
Tracked from the final code review pass; non-blocking but worth filing:
get_related_articlesinbound mode opens N+1 archives per call (each candidate triggers a separateextract_article_linkscall). Cache hits hide it on warm runs; on a network unmount mid-call, every candidate after gets silently skipped. Worth refactoringextract_article_linksto accept an already-open archive.search_allreturns markdown-text inside JSON rather than structured per-result data. Consistent with howsearch_zim_filealready works, but a future contract cleanup.- ~~5 pre-existing dependency CVEs flagged by
pip-audit~~ — resolved: pyjwt 2.10.1→2.12.0 (#65), requests 2.32.5→2.33.0 (#66), pytest 9.0.2→9.0.3 (#75), python-multipart 0.0.22→0.0.26 (#76), python-dotenv 1.2.1→1.2.2 (#77) all merged into main and rebased into this branch. - 5 newly-disclosed CVEs found by post-rebase
pip-audit(cryptography 46.0.5, pygments 2.19.2, black 26.1.0, pip 26.0). All covered by existing Dependabot PRs (#64, #72, #74) — candidate for a 0.9.1 follow-up. - Wiki content (
wiki-content/*.md) reflects pre-0.9 surface; out of scope for this PR.
Versioning
pyproject.toml,.release-please-manifest.json,openzim_mcp/__init__.pyall bumped to0.9.0uv.lockregenerated- CHANGELOG will be auto-generated by release-please from the conventional commits on this branch
- Resolved GitHub code scanning alerts #133 and #134 in security.py
Full changelog
Bug Fixes
- fix logo URL in README.md to use absolute GitHub raw URL for PyPI display (README.md)
- resolve GitHub code scanning alert #133 - variable defined multiple times in security.py (security.py)
- resolve GitHub code scanning alert #134 - mixed import styles in test_main.py (test_main.py)
- remove unused
contextlibimport from security.py (flake8 fix)
Fixed search pagination when offset exceeds total results.
Full changelog
Bug Fixes
- fix search pagination when offset exceeds total results (zim_operations.py)
- improve exception handling in instance tracker for Python 3 compatibility (instance_tracker.py)
- add fallback to stderr for logging during shutdown (instance_tracker.py)
- improve Windows process checking with debug logging (instance_tracker.py)
- fix release workflow to skip automatic GitHub release creation (release-please.yml)
- resolve linting issues in simple_tools.py and content_tools.py
- Article summaries via get_entry_summary
- Table of contents extraction via get_table_of_contents
- Token‑based pagination cursors
Full changelog
0.8.1 (2026-01-29)
Features
- add article summaries, table of contents, and pagination cursors
Bug Fixes
- remove unused imports in test files for CI linting
- resolve GitHub code scanning alerts
Details
- Article Summaries (
get_entry_summary): Extract concise article summaries from opening paragraphs - Table of Contents Extraction (
get_table_of_contents): Build hierarchical TOC from article headings - Pagination Cursors: Token-based pagination for easier result navigation
Enhanced
- Intent Parsing: Improved multi-match resolution with weighted scoring
- Simple Mode: Added natural language support for new features
Minor fixes and improvements.
Full changelog
- Added binary content retrieval for PDFs, images, and media files