Skip to content

Retio-ai/pagemap

MCP Browser & Automation

PageMap reduces raw HTML pages to compact, AI‑readable page maps (2‑5 K tokens) while preserving interactive capabilities like clicking and typing.

Python Latest v1.1.1 · 13d ago Security brief →

Features

  • Compresses large HTML pages into 2‑5 K token “page maps” (≈97% reduction).
  • Provides full interaction tools: click, type, select, hover, navigate across tabs.
  • Auto‑detects 16 page types and supports structured extraction for 30+ e‑commerce sites.
  • Smart recovery detects login barriers, cookie consent pop‑ups, bot challenges and suggests next steps.

Recent releases

View all 14 releases →
No immediate action
v1.1.0 Feature

Delta Intelligence evidence packet

v1.0.0 New feature
Security fixes
  • SSRF defense, prompt injection defense, robots.txt compliance, resource guards
Notable features
  • 13 MCP tools (e.g., get_page_map, execute_action, fill_form)
  • 16 auto-detected page types and extraction optimizations
  • Support for 60+ e-commerce sites across multiple regions
Full changelog

PageMap v1.0.0 — First Public Release

The browsing MCP server that fits in your context window. Compresses ~100K-token HTML into a 2-5K-token structured map while preserving every actionable element.

Highlights

  • 13 MCP tools — get_page_map, execute_action, fill_form, scroll_page, wait_for, take_screenshot, get_page_state, navigate_back, batch_get_page_map, open_tab, switch_tab, list_tabs, close_tab
  • 16 page types auto-detected with optimized extraction
  • 60+ e-commerce sites supported across Global, Korea, Japan, China
  • 8 JSON-LD schemas — Product, NewsArticle, VideoObject, FAQPage, Event, LocalBusiness, BreadcrumbList, ItemList
  • 10 languages with locale auto-detection and CJK token budget adjustment
  • 2-layer caching — cache hit (~10ms), content refresh (~500ms), full rebuild (~1.5s)
  • Security hardened — SSRF defense, prompt injection defense, robots.txt compliance, resource guards

Install

pip install retio-pagemap

MCP Client Config

{
  "mcpServers": {
    "pagemap": {
      "command": "uvx",
      "args": ["retio-pagemap"]
    }
  }
}

Docker

docker run -p 8000:8000 retio1001/pagemap --transport http

Full documentation: https://github.com/Retio-ai/Retio-pagemap#readme

v0.7.3 Breaking risk
Notable features
  • CreditMiddleware deducts credits per tool call and returns HTTP 402 with RFC 9457 `problem+json` when balance is insufficient
  • RedisRateLimiter implements a token‑bucket algorithm via an atomic Lua script, replacing the in‑process RateLimiter for multi‑worker deployments; selectable at runtime via RateLimiterProtocol
  • Paddle payment infrastructure adds webhook middleware with HMAC‑SHA256 verification, SQLite credit repository (schema v2), three tiered credit packs ($10/500, $25/1500, $50/5000) and four related telemetry events
Full changelog

[0.7.3] - 2026-02-26

Added

  • Credit debit middlewareCreditMiddleware ASGI middleware deducts credits per tool call before dispatching to the MCP handler. Integrates with CreditRepositoryProtocol (SQLite + in-memory). Returns HTTP 402 with RFC 9457 problem+json body when balance is insufficient
  • Redis distributed rate limiterRedisRateLimiter implements token-bucket algorithm via a Lua script executed atomically on Redis. Replaces in-process RateLimiter for multi-worker deployments. RateLimiterProtocol allows runtime selection between in-memory and Redis backends
  • CQP agent behavior signal events — Two new telemetry event types for tracking agent tool usage patterns: TOOL_CALL_SEQUENCE (session-level tool sequence with timing deltas) and TOOL_DISAGREEMENT (consecutive-same-tool and same-URL-recall signals). TypedDict payloads + builder functions added to telemetry/events.py
  • OtlpHttpExporter — Cloud telemetry exporter with OTLP-JSON over HTTP, gzip compression, retryable 429/502/503/504 responses, Retry-After header parsing, exponential backoff + jitter
  • config.py 3-layer telemetry configTelemetryConfig resolves settings from YAML file → environment variables → CLI flags in priority order. Supports sample_rate, batch_size, flush_interval
  • FanOutWriter — Multiplexes telemetry events to multiple writers (e.g., local JSONL + remote OTLP) simultaneously
  • Paddle payment infrastructuresrc/pagemap/paddle/ module: webhook.py (ASGI middleware, HMAC-SHA256 signature verification, 30 s replay tolerance, idempotency gate via event_id), signature.py (constant-time hmac.compare_digest), credits.py (CreditRepositoryProtocol, SQLite schema v2 with CHECK ≥ 0, BEGIN IMMEDIATE atomic writes), products.py (3 credit pack tiers: $10/500, $25/1500, $50/5000), checkout.py (paddle-python-sdk lazy import), config.py (PaddleConfig from env). 4 telemetry events: PADDLE_WEBHOOK_RECEIVED, PADDLE_CREDITS_ADDED, PADDLE_WEBHOOK_INVALID, PADDLE_WEBHOOK_DUPLICATE

Changed

  • _to_float() European thousand-separator parsing — Single-separator strings like "1.500" are now correctly parsed as 1500.0 when the separator is in a thousands position (3 digits follow). Previously returned 1.5
  • BBC News pre-AOM portal hintbbc.co.uk and bbc.com domains are now classified as news portals before AOM processing, ensuring the news portal compressor is applied even on pages where <article> count is low
  • Inline element boundary spacing — Inline tags (<a>, <strong>, <em>, <span>, <b>, <i>) now insert a space at their boundary during text extraction, preventing word concatenation artifacts (e.g., "priceitem""price item")
  • product_detail option UI preservation — Option selector elements (size/color dropdowns, radio buttons) are now rescued from AOM removal for product_detail pages, recovering 47 → 100+ tokens of structured option information
  • Page classifier: category listing fix — Category index pages (e.g., /category/women) are now correctly classified as listing rather than article. Scoring weight for path-based listing signals increased
  • Semaphore pool slot leak fixedBrowserPool no longer accesses Semaphore._value (private CPython attribute). Slot count is now tracked via an explicit _available counter, eliminating AttributeError on non-CPython runtimes and future CPython versions
  • ServeHelpAction parsing stabilization_ServeHelpAction.__call__ now catches SystemExit raised by argparse during help generation; help text is always printed even if the subparser raises
  • Sensitive tests moved to tests/private/ — Auth, rate limiter, billing, telemetry, and SSRF telemetry test files relocated to tests/private/ (excluded from public release). release.sh updated accordingly
  • 4377 → 4735 tests passing (+358)

Fixed

  • Unused variable assignments removed from test_redis_rate_limiter.py and test_ssrf_telemetry.py
  • Import sort order corrected in rate_limiter.py, redis_rate_limiter.py, and related test files (ruff I001)
v0.7.2 Breaking risk
⚠ Upgrade required
  • OG image/thumbnail URLs now validated with _is_valid_url(); javascript: and data: URLs are rejected
  • _is_inside_article_or_main() uses O(1) lookup via pre‑computed set; no external action required but improves performance on large documents
  • VideoObject schema overrides compressor selection in _SCHEMA_OVERRIDES; existing configurations remain compatible
Security fixes
  • Added max_depth=5 parameter to _find_type_in_jsonld() to prevent RecursionError from maliciously nested @graph structures (DoS vector)
  • Applied sanitize_text() to currency, telephone, price_range, datePublished, upload_date, duration, start_date/end_date, BreadcrumbList name and _parse_h1() return value — eliminates prompt injection vector
Notable features
  • _extract_price_from_html() lxml DOM fallback price extractor
  • _extract_video_meta_from_dom() class‑name and regex video metadata extraction
  • News portal detection & compression via _is_news_portal() / _compress_for_news_portal()
Full changelog

What's Changed

Security

  • _find_type_in_jsonld() recursion depth limit — Added max_depth=5 parameter to prevent RecursionError from maliciously nested @graph structures (DoS vector)
  • metadata.py field sanitization — Applied sanitize_text() to currency, telephone, price_range, datePublished, upload_date, duration, start_date/end_date, BreadcrumbList name, and _parse_h1() return value — eliminates prompt injection vector from 8+ previously unsanitized fields
  • OG image/thumbnail URL validation — Applied _is_valid_url() to image_url/thumbnail_url OG fields; javascript: and data: URLs no longer pass through

Added

  • _extract_price_from_html() — lxml DOM-based last-resort price extractor from raw/pruned HTML; priority: a-offscreen text > price-class text_content() > aria-label (handles Amazon nested span structures)
  • _extract_video_meta_from_dom() — Class-name and regex-based video metadata extraction from heading chunks (channel, view_count, duration); called as last-resort fallback for VideoObject schema
  • _is_news_portal() / _compress_for_news_portal() — Detects news portal pattern (≥3 <article> elements or ≥3 headline links) in dashboard-classified pages; dedicated numbered headline list compressor with optional per-article summaries (BBC News improvement)
  • pagemap serve --help forwarded options_ServeHelpAction + _get_server_options_help() dynamically append server options (e.g. --transport, --port, --allow-local) to pagemap serve --help output
  • 3 new test filestest_hn_regression.py, test_news_portal_compression.py, test_pruned_context_builder_fixes.py

Fixed

  • HN/forum table-based grid whitelist — Added table/tbody to _GRID_CONTAINER_TAGS; table-based content listings (Hacker News, forums) now receive link-density penalty exemption (fixes HN content regression: 1,010 → 511 tok in v0.7.1)
  • AOM content rescue detached parent — Added tree.getpath(parent) check before re-inserting rescued elements; skips rescue when parent was detached during removal phase (prevents silent ghost subtree insertion)
  • removed_nodes stat overcounting — Rescued nodes are now subtracted from removed_nodes and removal_reasons counters (previously inflated telemetry)
  • _extract_price_from_dom_chunks() empty-text fallback — Added aria-label and data-* attribute fallbacks when chunk text is empty
  • _to_float() European number format"1.500,99" now correctly parses as 1500.99 instead of 1.5
  • _to_int() silent truncation — Changed from int(float) to round(float): "4.9"5 instead of 4 (fixes silent corruption in reviewCount etc.)
  • _extract_price_from_offers() zero-price falsylowPrice or price pattern replaced with explicit None check; price=0 is preserved correctly
  • _extract_image_url() ImageObject dict — Added support for "image": {"@type": "ImageObject", "url": "..."} pattern (previously missed)
  • extract_metadata() pruned_html parameter — Product and VideoObject schemas receive pruned HTML for lxml-based price extraction fallback; VideoObject also gets DOM-based metadata fallback

Changed

  • _is_inside_article_or_main() O(1) lookup — Pre-computes _article_main_descendants set before filtering loop; passes to _compute_weight() via new article_main_descendants parameter. Eliminates O(nodes × depth) traversal on large documents
  • _compress_for_dashboard() news portal delegation — Now accepts doc parameter and delegates to _compress_for_news_portal() when _is_news_portal() detects news portal structure
  • VideoObject added to _SCHEMA_OVERRIDES — VideoObject schema now overrides page_type-based compressor selection; _SCHEMA_OVERRIDES moved to module level (eliminates per-call frozenset recreation)
  • Phase 4 product price regex — Replaced inline regex with pre-compiled _PRICE_CLASS_RE (named groups, handles single-quote class attributes); extracted price injected back into metadata dict for downstream consumers
  • VideoObject itemprop authorchannel — Added "author": "channel" to _ITEMPROP_FIELD_MAP["VideoObject"]
  • VideoObject OG og:site_name removed — Removed og:site_namechannel mapping (was incorrectly using site name e.g. "YouTube" as channel name)
  • Video description CJK budget factor — Reduced from 0.95 to 0.85 to account for CJK-heavy descriptions (~1.5 chars/token vs English ~4 chars/token); _truncate_to_tokens() guard handles overshoot
  • 4014 → 4194 tests passing (+180)

Full Changelog: https://github.com/Retio-ai/Retio-pagemap/blob/main/CHANGELOG.md

v0.7.1 Breaking risk
⚠ Upgrade required
  • `pagemap serve --transport http --port <PORT>` now functional; CLI forwards remaining arguments to the server.
  • CLI help text cleaned: duplicate "build build" epilog removed, `retio-pagemap --help` restored.
Notable features
  • Article extractor increased token limit to >400 tokens with budget‑based compressor
  • YouTube page type added with VideoObject metadata parser and formatted K/M number display
  • Amazon price extraction enhanced via DOM fallback for `a-price`/`a-offscreen` classes
Full changelog

Content Extraction Quality + DX Improvements

Resolves 11 of 13 issues from QUALITY_REVIEW_v0.7.0. Core product value — content extraction — significantly improved for article, video, and product pages.

Highlights

  • Article content: 84 → 400+ tokens — Readability.js-inspired AOM filter exemption for <p> tags inside <article>/<main> (long paragraphs survive link-density penalty). Budget-based article compressor replaces fixed 2-paragraph limit.
  • YouTube: video page type + rich metadata — New page classifier with URL/meta/DOM/JSON-LD signals. VideoObject metadata parser extracts views, likes, channel, duration. Dedicated video compressor with K/M number formatting.
  • Amazon price extraction — DOM price fallback scans a-price/a-offscreen class patterns. Product schema content rescue enhanced. Product compressor price fallback from pruned HTML.
  • pagemap serve HTTP transportpagemap serve --transport http --port 8000 now works. CLI forwards remaining args to server.
  • DX fixes--help epilog build build duplicate removed, retio-pagemap --help restored, README version updated.

Bug Fixes

  • session_manager resource leakpool.acquire() exception now properly releases semaphore slot
  • Card detection entity leak&amp; entities no longer leak into agent output (e.g., "H&M")

Stats

  • 4,014 tests passing (+31 net, +48 new)
  • 22 files changed, 566 insertions, 84 deletions

See CHANGELOG for full details.

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

About

Stars
32
Forks
5
Languages
Python TypeScript Shell

Install & Platforms

Install via
pip docker

Alternative to

Playwright MCP Firecrawl Jina Reader

Beta — feedback welcome: [email protected]