Retio-ai/pagemap
MCP Browser & AutomationPageMap reduces raw HTML pages to compact, AI‑readable page maps (2‑5 K tokens) while preserving interactive capabilities like clicking and typing.
Features
- Compresses large HTML pages into 2‑5 K token “page maps” (≈97% reduction).
- Provides full interaction tools: click, type, select, hover, navigate across tabs.
- Auto‑detects 16 page types and supports structured extraction for 30+ e‑commerce sites.
- Smart recovery detects login barriers, cookie consent pop‑ups, bot challenges and suggests next steps.
Recent releases
View all 14 releases →- SSRF defense, prompt injection defense, robots.txt compliance, resource guards
- 13 MCP tools (e.g., get_page_map, execute_action, fill_form)
- 16 auto-detected page types and extraction optimizations
- Support for 60+ e-commerce sites across multiple regions
Full changelog
PageMap v1.0.0 — First Public Release
The browsing MCP server that fits in your context window. Compresses ~100K-token HTML into a 2-5K-token structured map while preserving every actionable element.
Highlights
- 13 MCP tools — get_page_map, execute_action, fill_form, scroll_page, wait_for, take_screenshot, get_page_state, navigate_back, batch_get_page_map, open_tab, switch_tab, list_tabs, close_tab
- 16 page types auto-detected with optimized extraction
- 60+ e-commerce sites supported across Global, Korea, Japan, China
- 8 JSON-LD schemas — Product, NewsArticle, VideoObject, FAQPage, Event, LocalBusiness, BreadcrumbList, ItemList
- 10 languages with locale auto-detection and CJK token budget adjustment
- 2-layer caching — cache hit (~10ms), content refresh (~500ms), full rebuild (~1.5s)
- Security hardened — SSRF defense, prompt injection defense, robots.txt compliance, resource guards
Install
pip install retio-pagemap
MCP Client Config
{
"mcpServers": {
"pagemap": {
"command": "uvx",
"args": ["retio-pagemap"]
}
}
}
Docker
docker run -p 8000:8000 retio1001/pagemap --transport http
Full documentation: https://github.com/Retio-ai/Retio-pagemap#readme
- CreditMiddleware deducts credits per tool call and returns HTTP 402 with RFC 9457 `problem+json` when balance is insufficient
- RedisRateLimiter implements a token‑bucket algorithm via an atomic Lua script, replacing the in‑process RateLimiter for multi‑worker deployments; selectable at runtime via RateLimiterProtocol
- Paddle payment infrastructure adds webhook middleware with HMAC‑SHA256 verification, SQLite credit repository (schema v2), three tiered credit packs ($10/500, $25/1500, $50/5000) and four related telemetry events
Full changelog
[0.7.3] - 2026-02-26
Added
- Credit debit middleware —
CreditMiddlewareASGI middleware deducts credits per tool call before dispatching to the MCP handler. Integrates withCreditRepositoryProtocol(SQLite + in-memory). Returns HTTP 402 with RFC 9457problem+jsonbody when balance is insufficient - Redis distributed rate limiter —
RedisRateLimiterimplements token-bucket algorithm via a Lua script executed atomically on Redis. Replaces in-processRateLimiterfor multi-worker deployments.RateLimiterProtocolallows runtime selection between in-memory and Redis backends - CQP agent behavior signal events — Two new telemetry event types for tracking agent tool usage patterns:
TOOL_CALL_SEQUENCE(session-level tool sequence with timing deltas) andTOOL_DISAGREEMENT(consecutive-same-tool and same-URL-recall signals). TypedDict payloads + builder functions added totelemetry/events.py OtlpHttpExporter— Cloud telemetry exporter with OTLP-JSON over HTTP, gzip compression, retryable 429/502/503/504 responses,Retry-Afterheader parsing, exponential backoff + jitterconfig.py3-layer telemetry config —TelemetryConfigresolves settings from YAML file → environment variables → CLI flags in priority order. Supportssample_rate,batch_size,flush_intervalFanOutWriter— Multiplexes telemetry events to multiple writers (e.g., local JSONL + remote OTLP) simultaneously- Paddle payment infrastructure —
src/pagemap/paddle/module:webhook.py(ASGI middleware, HMAC-SHA256 signature verification, 30 s replay tolerance, idempotency gate viaevent_id),signature.py(constant-timehmac.compare_digest),credits.py(CreditRepositoryProtocol, SQLite schema v2 withCHECK ≥ 0,BEGIN IMMEDIATEatomic writes),products.py(3 credit pack tiers: $10/500, $25/1500, $50/5000),checkout.py(paddle-python-sdklazy import),config.py(PaddleConfigfrom env). 4 telemetry events:PADDLE_WEBHOOK_RECEIVED,PADDLE_CREDITS_ADDED,PADDLE_WEBHOOK_INVALID,PADDLE_WEBHOOK_DUPLICATE
Changed
_to_float()European thousand-separator parsing — Single-separator strings like"1.500"are now correctly parsed as1500.0when the separator is in a thousands position (3 digits follow). Previously returned1.5- BBC News pre-AOM portal hint —
bbc.co.ukandbbc.comdomains are now classified as news portals before AOM processing, ensuring the news portal compressor is applied even on pages where<article>count is low - Inline element boundary spacing — Inline tags (
<a>,<strong>,<em>,<span>,<b>,<i>) now insert a space at their boundary during text extraction, preventing word concatenation artifacts (e.g.,"priceitem"→"price item") product_detailoption UI preservation — Option selector elements (size/color dropdowns, radio buttons) are now rescued from AOM removal forproduct_detailpages, recovering 47 → 100+ tokens of structured option information- Page classifier: category listing fix — Category index pages (e.g.,
/category/women) are now correctly classified aslistingrather thanarticle. Scoring weight for path-based listing signals increased - Semaphore pool slot leak fixed —
BrowserPoolno longer accessesSemaphore._value(private CPython attribute). Slot count is now tracked via an explicit_availablecounter, eliminatingAttributeErroron non-CPython runtimes and future CPython versions ServeHelpActionparsing stabilization —_ServeHelpAction.__call__now catchesSystemExitraised by argparse during help generation; help text is always printed even if the subparser raises- Sensitive tests moved to
tests/private/— Auth, rate limiter, billing, telemetry, and SSRF telemetry test files relocated totests/private/(excluded from public release).release.shupdated accordingly - 4377 → 4735 tests passing (+358)
Fixed
- Unused variable assignments removed from
test_redis_rate_limiter.pyandtest_ssrf_telemetry.py - Import sort order corrected in
rate_limiter.py,redis_rate_limiter.py, and related test files (ruff I001)
- OG image/thumbnail URLs now validated with _is_valid_url(); javascript: and data: URLs are rejected
- _is_inside_article_or_main() uses O(1) lookup via pre‑computed set; no external action required but improves performance on large documents
- VideoObject schema overrides compressor selection in _SCHEMA_OVERRIDES; existing configurations remain compatible
- Added max_depth=5 parameter to _find_type_in_jsonld() to prevent RecursionError from maliciously nested @graph structures (DoS vector)
- Applied sanitize_text() to currency, telephone, price_range, datePublished, upload_date, duration, start_date/end_date, BreadcrumbList name and _parse_h1() return value — eliminates prompt injection vector
- _extract_price_from_html() lxml DOM fallback price extractor
- _extract_video_meta_from_dom() class‑name and regex video metadata extraction
- News portal detection & compression via _is_news_portal() / _compress_for_news_portal()
Full changelog
What's Changed
Security
_find_type_in_jsonld()recursion depth limit — Addedmax_depth=5parameter to preventRecursionErrorfrom maliciously nested@graphstructures (DoS vector)- metadata.py field sanitization — Applied
sanitize_text()tocurrency,telephone,price_range,datePublished,upload_date,duration,start_date/end_date, BreadcrumbListname, and_parse_h1()return value — eliminates prompt injection vector from 8+ previously unsanitized fields - OG image/thumbnail URL validation — Applied
_is_valid_url()toimage_url/thumbnail_urlOG fields;javascript:anddata:URLs no longer pass through
Added
_extract_price_from_html()— lxml DOM-based last-resort price extractor from raw/pruned HTML; priority:a-offscreentext > price-classtext_content()>aria-label(handles Amazon nested span structures)_extract_video_meta_from_dom()— Class-name and regex-based video metadata extraction from heading chunks (channel, view_count, duration); called as last-resort fallback for VideoObject schema_is_news_portal()/_compress_for_news_portal()— Detects news portal pattern (≥3<article>elements or ≥3 headline links) in dashboard-classified pages; dedicated numbered headline list compressor with optional per-article summaries (BBC News improvement)pagemap serve --helpforwarded options —_ServeHelpAction+_get_server_options_help()dynamically append server options (e.g.--transport,--port,--allow-local) topagemap serve --helpoutput- 3 new test files —
test_hn_regression.py,test_news_portal_compression.py,test_pruned_context_builder_fixes.py
Fixed
- HN/forum table-based grid whitelist — Added
table/tbodyto_GRID_CONTAINER_TAGS; table-based content listings (Hacker News, forums) now receive link-density penalty exemption (fixes HN content regression: 1,010 → 511 tok in v0.7.1) - AOM content rescue detached parent — Added
tree.getpath(parent)check before re-inserting rescued elements; skips rescue when parent was detached during removal phase (prevents silent ghost subtree insertion) removed_nodesstat overcounting — Rescued nodes are now subtracted fromremoved_nodesandremoval_reasonscounters (previously inflated telemetry)_extract_price_from_dom_chunks()empty-text fallback — Addedaria-labelanddata-*attribute fallbacks when chunk text is empty_to_float()European number format —"1.500,99"now correctly parses as1500.99instead of1.5_to_int()silent truncation — Changed fromint(float)toround(float):"4.9"→5instead of4(fixes silent corruption in reviewCount etc.)_extract_price_from_offers()zero-price falsy —lowPrice or pricepattern replaced with explicitNonecheck; price=0 is preserved correctly_extract_image_url()ImageObject dict — Added support for"image": {"@type": "ImageObject", "url": "..."}pattern (previously missed)extract_metadata()pruned_html parameter — Product and VideoObject schemas receive pruned HTML for lxml-based price extraction fallback; VideoObject also gets DOM-based metadata fallback
Changed
_is_inside_article_or_main()O(1) lookup — Pre-computes_article_main_descendantsset before filtering loop; passes to_compute_weight()via newarticle_main_descendantsparameter. Eliminates O(nodes × depth) traversal on large documents_compress_for_dashboard()news portal delegation — Now acceptsdocparameter and delegates to_compress_for_news_portal()when_is_news_portal()detects news portal structureVideoObjectadded to_SCHEMA_OVERRIDES— VideoObject schema now overrides page_type-based compressor selection;_SCHEMA_OVERRIDESmoved to module level (eliminates per-call frozenset recreation)- Phase 4 product price regex — Replaced inline regex with pre-compiled
_PRICE_CLASS_RE(named groups, handles single-quote class attributes); extracted price injected back into metadata dict for downstream consumers - VideoObject itemprop
author→channel— Added"author": "channel"to_ITEMPROP_FIELD_MAP["VideoObject"] - VideoObject OG
og:site_nameremoved — Removedog:site_name→channelmapping (was incorrectly using site name e.g. "YouTube" as channel name) - Video description CJK budget factor — Reduced from 0.95 to 0.85 to account for CJK-heavy descriptions (~1.5 chars/token vs English ~4 chars/token);
_truncate_to_tokens()guard handles overshoot - 4014 → 4194 tests passing (+180)
Full Changelog: https://github.com/Retio-ai/Retio-pagemap/blob/main/CHANGELOG.md
- `pagemap serve --transport http --port <PORT>` now functional; CLI forwards remaining arguments to the server.
- CLI help text cleaned: duplicate "build build" epilog removed, `retio-pagemap --help` restored.
- Article extractor increased token limit to >400 tokens with budget‑based compressor
- YouTube page type added with VideoObject metadata parser and formatted K/M number display
- Amazon price extraction enhanced via DOM fallback for `a-price`/`a-offscreen` classes
Full changelog
Content Extraction Quality + DX Improvements
Resolves 11 of 13 issues from QUALITY_REVIEW_v0.7.0. Core product value — content extraction — significantly improved for article, video, and product pages.
Highlights
- Article content: 84 → 400+ tokens — Readability.js-inspired AOM filter exemption for
<p>tags inside<article>/<main>(long paragraphs survive link-density penalty). Budget-based article compressor replaces fixed 2-paragraph limit. - YouTube:
videopage type + rich metadata — New page classifier with URL/meta/DOM/JSON-LD signals. VideoObject metadata parser extracts views, likes, channel, duration. Dedicated video compressor with K/M number formatting. - Amazon price extraction — DOM price fallback scans
a-price/a-offscreenclass patterns. Product schema content rescue enhanced. Product compressor price fallback from pruned HTML. pagemap serveHTTP transport —pagemap serve --transport http --port 8000now works. CLI forwards remaining args to server.- DX fixes —
--helpepilogbuild buildduplicate removed,retio-pagemap --helprestored, README version updated.
Bug Fixes
- session_manager resource leak —
pool.acquire()exception now properly releases semaphore slot - Card detection entity leak —
&entities no longer leak into agent output (e.g., "H&M")
Stats
- 4,014 tests passing (+31 net, +48 new)
- 22 files changed, 566 insertions, 84 deletions
See CHANGELOG for full details.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.