Retio-ai/pagemap

MCP Browser & Automation

PageMap converts large HTML pages into compact, AI‑readable page maps (2‑5 K tokens) with interaction capabilities, reducing token usage by ~97%.

Track releases GitHub

Python Latest v1.1.1 · 2mo ago Security brief →

Features

Compresses raw HTML (100K+ tokens) to a structured map of 2‑5K tokens
Provides full browsing interaction: click, type, select, hover across pages
Supports 16 auto‑detected page types and built‑in extraction for 30+ e‑commerce sites
Detects and handles barriers (login, bot checks, consent popups) with recovery suggestions

Recent releases

View all 14 releases →

No immediate action

v1.1.0 Feature 2mo

Delta Intelligence evidence packet

Open

v1.0.0 New feature 4mo

Security fixes

SSRF defense, prompt injection defense, robots.txt compliance, resource guards

Notable features

13 MCP tools (e.g., get_page_map, execute_action, fill_form)
16 auto-detected page types and extraction optimizations
Support for 60+ e-commerce sites across multiple regions

Full changelog

PageMap v1.0.0 — First Public Release

The browsing MCP server that fits in your context window. Compresses ~100K-token HTML into a 2-5K-token structured map while preserving every actionable element.

Highlights

13 MCP tools — get_page_map, execute_action, fill_form, scroll_page, wait_for, take_screenshot, get_page_state, navigate_back, batch_get_page_map, open_tab, switch_tab, list_tabs, close_tab
16 page types auto-detected with optimized extraction
60+ e-commerce sites supported across Global, Korea, Japan, China
8 JSON-LD schemas — Product, NewsArticle, VideoObject, FAQPage, Event, LocalBusiness, BreadcrumbList, ItemList
10 languages with locale auto-detection and CJK token budget adjustment
2-layer caching — cache hit (~10ms), content refresh (~500ms), full rebuild (~1.5s)
Security hardened — SSRF defense, prompt injection defense, robots.txt compliance, resource guards

Install

pip install retio-pagemap

MCP Client Config

{
  "mcpServers": {
    "pagemap": {
      "command": "uvx",
      "args": ["retio-pagemap"]
    }
  }
}

Docker

docker run -p 8000:8000 retio1001/pagemap --transport http

Full documentation: https://github.com/Retio-ai/Retio-pagemap#readme

View release on GitHub

v0.7.3 Breaking risk 5mo

Notable features

CreditMiddleware deducts credits per tool call and returns HTTP 402 with RFC 9457 `problem+json` when balance is insufficient
RedisRateLimiter implements a token‑bucket algorithm via an atomic Lua script, replacing the in‑process RateLimiter for multi‑worker deployments; selectable at runtime via RateLimiterProtocol
Paddle payment infrastructure adds webhook middleware with HMAC‑SHA256 verification, SQLite credit repository (schema v2), three tiered credit packs ($10/500, $25/1500, $50/5000) and four related telemetry events

Full changelog

[0.7.3] - 2026-02-26

Added

Credit debit middleware — CreditMiddleware ASGI middleware deducts credits per tool call before dispatching to the MCP handler. Integrates with CreditRepositoryProtocol (SQLite + in-memory). Returns HTTP 402 with RFC 9457 problem+json body when balance is insufficient
Redis distributed rate limiter — RedisRateLimiter implements token-bucket algorithm via a Lua script executed atomically on Redis. Replaces in-process RateLimiter for multi-worker deployments. RateLimiterProtocol allows runtime selection between in-memory and Redis backends
CQP agent behavior signal events — Two new telemetry event types for tracking agent tool usage patterns: TOOL_CALL_SEQUENCE (session-level tool sequence with timing deltas) and TOOL_DISAGREEMENT (consecutive-same-tool and same-URL-recall signals). TypedDict payloads + builder functions added to telemetry/events.py
OtlpHttpExporter — Cloud telemetry exporter with OTLP-JSON over HTTP, gzip compression, retryable 429/502/503/504 responses, Retry-After header parsing, exponential backoff + jitter
config.py 3-layer telemetry config — TelemetryConfig resolves settings from YAML file → environment variables → CLI flags in priority order. Supports sample_rate, batch_size, flush_interval
FanOutWriter — Multiplexes telemetry events to multiple writers (e.g., local JSONL + remote OTLP) simultaneously
Paddle payment infrastructure — src/pagemap/paddle/ module: webhook.py (ASGI middleware, HMAC-SHA256 signature verification, 30 s replay tolerance, idempotency gate via event_id), signature.py (constant-time hmac.compare_digest), credits.py (CreditRepositoryProtocol, SQLite schema v2 with CHECK ≥ 0, BEGIN IMMEDIATE atomic writes), products.py (3 credit pack tiers: $10/500, $25/1500, $50/5000), checkout.py (paddle-python-sdk lazy import), config.py (PaddleConfig from env). 4 telemetry events: PADDLE_WEBHOOK_RECEIVED, PADDLE_CREDITS_ADDED, PADDLE_WEBHOOK_INVALID, PADDLE_WEBHOOK_DUPLICATE

Changed

_to_float() European thousand-separator parsing — Single-separator strings like "1.500" are now correctly parsed as 1500.0 when the separator is in a thousands position (3 digits follow). Previously returned 1.5
BBC News pre-AOM portal hint — bbc.co.uk and bbc.com domains are now classified as news portals before AOM processing, ensuring the news portal compressor is applied even on pages where <article> count is low
Inline element boundary spacing — Inline tags (<a>, <strong>, <em>, <span>, <b>, <i>) now insert a space at their boundary during text extraction, preventing word concatenation artifacts (e.g., "priceitem" → "price item")
product_detail option UI preservation — Option selector elements (size/color dropdowns, radio buttons) are now rescued from AOM removal for product_detail pages, recovering 47 → 100+ tokens of structured option information
Page classifier: category listing fix — Category index pages (e.g., /category/women) are now correctly classified as listing rather than article. Scoring weight for path-based listing signals increased
Semaphore pool slot leak fixed — BrowserPool no longer accesses Semaphore._value (private CPython attribute). Slot count is now tracked via an explicit _available counter, eliminating AttributeError on non-CPython runtimes and future CPython versions
ServeHelpAction parsing stabilization — _ServeHelpAction.__call__ now catches SystemExit raised by argparse during help generation; help text is always printed even if the subparser raises
Sensitive tests moved to tests/private/ — Auth, rate limiter, billing, telemetry, and SSRF telemetry test files relocated to tests/private/ (excluded from public release). release.sh updated accordingly
4377 → 4735 tests passing (+358)

Fixed

Unused variable assignments removed from test_redis_rate_limiter.py and test_ssrf_telemetry.py
Import sort order corrected in rate_limiter.py, redis_rate_limiter.py, and related test files (ruff I001)

View release on GitHub

v0.7.2 Breaking risk 5mo

⚠ Upgrade required

OG image/thumbnail URLs now validated with _is_valid_url(); javascript: and data: URLs are rejected
_is_inside_article_or_main() uses O(1) lookup via pre‑computed set; no external action required but improves performance on large documents
VideoObject schema overrides compressor selection in _SCHEMA_OVERRIDES; existing configurations remain compatible

Security fixes

Added max_depth=5 parameter to _find_type_in_jsonld() to prevent RecursionError from maliciously nested @graph structures (DoS vector)
Applied sanitize_text() to currency, telephone, price_range, datePublished, upload_date, duration, start_date/end_date, BreadcrumbList name and _parse_h1() return value — eliminates prompt injection vector

Notable features

_extract_price_from_html() lxml DOM fallback price extractor
_extract_video_meta_from_dom() class‑name and regex video metadata extraction
News portal detection & compression via _is_news_portal() / _compress_for_news_portal()

Full changelog

What's Changed

Security

_find_type_in_jsonld() recursion depth limit — Added max_depth=5 parameter to prevent RecursionError from maliciously nested @graph structures (DoS vector)
metadata.py field sanitization — Applied sanitize_text() to currency, telephone, price_range, datePublished, upload_date, duration, start_date/end_date, BreadcrumbList name, and _parse_h1() return value — eliminates prompt injection vector from 8+ previously unsanitized fields
OG image/thumbnail URL validation — Applied _is_valid_url() to image_url/thumbnail_url OG fields; javascript: and data: URLs no longer pass through

Added

_extract_price_from_html() — lxml DOM-based last-resort price extractor from raw/pruned HTML; priority: a-offscreen text > price-class text_content() > aria-label (handles Amazon nested span structures)
_extract_video_meta_from_dom() — Class-name and regex-based video metadata extraction from heading chunks (channel, view_count, duration); called as last-resort fallback for VideoObject schema
_is_news_portal() / _compress_for_news_portal() — Detects news portal pattern (≥3 <article> elements or ≥3 headline links) in dashboard-classified pages; dedicated numbered headline list compressor with optional per-article summaries (BBC News improvement)
pagemap serve --help forwarded options — _ServeHelpAction + _get_server_options_help() dynamically append server options (e.g. --transport, --port, --allow-local) to pagemap serve --help output
3 new test files — test_hn_regression.py, test_news_portal_compression.py, test_pruned_context_builder_fixes.py

Fixed

HN/forum table-based grid whitelist — Added table/tbody to _GRID_CONTAINER_TAGS; table-based content listings (Hacker News, forums) now receive link-density penalty exemption (fixes HN content regression: 1,010 → 511 tok in v0.7.1)
AOM content rescue detached parent — Added tree.getpath(parent) check before re-inserting rescued elements; skips rescue when parent was detached during removal phase (prevents silent ghost subtree insertion)
removed_nodes stat overcounting — Rescued nodes are now subtracted from removed_nodes and removal_reasons counters (previously inflated telemetry)
_extract_price_from_dom_chunks() empty-text fallback — Added aria-label and data-* attribute fallbacks when chunk text is empty
_to_float() European number format — "1.500,99" now correctly parses as 1500.99 instead of 1.5
_to_int() silent truncation — Changed from int(float) to round(float): "4.9" → 5 instead of 4 (fixes silent corruption in reviewCount etc.)
_extract_price_from_offers() zero-price falsy — lowPrice or price pattern replaced with explicit None check; price=0 is preserved correctly
_extract_image_url() ImageObject dict — Added support for "image": {"@type": "ImageObject", "url": "..."} pattern (previously missed)
extract_metadata() pruned_html parameter — Product and VideoObject schemas receive pruned HTML for lxml-based price extraction fallback; VideoObject also gets DOM-based metadata fallback

Changed

_is_inside_article_or_main() O(1) lookup — Pre-computes _article_main_descendants set before filtering loop; passes to _compute_weight() via new article_main_descendants parameter. Eliminates O(nodes × depth) traversal on large documents
_compress_for_dashboard() news portal delegation — Now accepts doc parameter and delegates to _compress_for_news_portal() when _is_news_portal() detects news portal structure
VideoObject added to _SCHEMA_OVERRIDES — VideoObject schema now overrides page_type-based compressor selection; _SCHEMA_OVERRIDES moved to module level (eliminates per-call frozenset recreation)
Phase 4 product price regex — Replaced inline regex with pre-compiled _PRICE_CLASS_RE (named groups, handles single-quote class attributes); extracted price injected back into metadata dict for downstream consumers
VideoObject itemprop author → channel — Added "author": "channel" to _ITEMPROP_FIELD_MAP["VideoObject"]
VideoObject OG og:site_name removed — Removed og:site_name → channel mapping (was incorrectly using site name e.g. "YouTube" as channel name)
Video description CJK budget factor — Reduced from 0.95 to 0.85 to account for CJK-heavy descriptions (~1.5 chars/token vs English ~4 chars/token); _truncate_to_tokens() guard handles overshoot
4014 → 4194 tests passing (+180)

Full Changelog: https://github.com/Retio-ai/Retio-pagemap/blob/main/CHANGELOG.md

View release on GitHub

v0.7.1 Breaking risk 5mo

⚠ Upgrade required

`pagemap serve --transport http --port <PORT>` now functional; CLI forwards remaining arguments to the server.
CLI help text cleaned: duplicate "build build" epilog removed, `retio-pagemap --help` restored.

Notable features

Article extractor increased token limit to >400 tokens with budget‑based compressor
YouTube page type added with VideoObject metadata parser and formatted K/M number display
Amazon price extraction enhanced via DOM fallback for `a-price`/`a-offscreen` classes

Full changelog

Content Extraction Quality + DX Improvements

Resolves 11 of 13 issues from QUALITY_REVIEW_v0.7.0. Core product value — content extraction — significantly improved for article, video, and product pages.

Highlights

Article content: 84 → 400+ tokens — Readability.js-inspired AOM filter exemption for <p> tags inside <article>/<main> (long paragraphs survive link-density penalty). Budget-based article compressor replaces fixed 2-paragraph limit.
YouTube: video page type + rich metadata — New page classifier with URL/meta/DOM/JSON-LD signals. VideoObject metadata parser extracts views, likes, channel, duration. Dedicated video compressor with K/M number formatting.
Amazon price extraction — DOM price fallback scans a-price/a-offscreen class patterns. Product schema content rescue enhanced. Product compressor price fallback from pruned HTML.
pagemap serve HTTP transport — pagemap serve --transport http --port 8000 now works. CLI forwards remaining args to server.
DX fixes — --help epilog build build duplicate removed, retio-pagemap --help restored, README version updated.

Bug Fixes

session_manager resource leak — pool.acquire() exception now properly releases semaphore slot
Card detection entity leak — & entities no longer leak into agent output (e.g., "H&M")