Skip to content

Release history

Retio-ai/pagemap releases

Compresses ~100K-token HTML into 2-5K-token structured maps while preserving every actionable element. AI agents can read and interact with any web page at 97% fewer tokens.

All releases

14 shown

Review required
v1.1.1 Maintenance
Dependencies Auth

Dependency floor refresh

No immediate action
v1.1.0 Feature

Delta Intelligence evidence packet

v1.0.0 New feature
Security fixes
  • SSRF defense, prompt injection defense, robots.txt compliance, resource guards
Notable features
  • 13 MCP tools (e.g., get_page_map, execute_action, fill_form)
  • 16 auto-detected page types and extraction optimizations
  • Support for 60+ e-commerce sites across multiple regions
Full changelog

PageMap v1.0.0 — First Public Release

The browsing MCP server that fits in your context window. Compresses ~100K-token HTML into a 2-5K-token structured map while preserving every actionable element.

Highlights

  • 13 MCP tools — get_page_map, execute_action, fill_form, scroll_page, wait_for, take_screenshot, get_page_state, navigate_back, batch_get_page_map, open_tab, switch_tab, list_tabs, close_tab
  • 16 page types auto-detected with optimized extraction
  • 60+ e-commerce sites supported across Global, Korea, Japan, China
  • 8 JSON-LD schemas — Product, NewsArticle, VideoObject, FAQPage, Event, LocalBusiness, BreadcrumbList, ItemList
  • 10 languages with locale auto-detection and CJK token budget adjustment
  • 2-layer caching — cache hit (~10ms), content refresh (~500ms), full rebuild (~1.5s)
  • Security hardened — SSRF defense, prompt injection defense, robots.txt compliance, resource guards

Install

pip install retio-pagemap

MCP Client Config

{
  "mcpServers": {
    "pagemap": {
      "command": "uvx",
      "args": ["retio-pagemap"]
    }
  }
}

Docker

docker run -p 8000:8000 retio1001/pagemap --transport http

Full documentation: https://github.com/Retio-ai/Retio-pagemap#readme

v0.7.3 Breaking risk
Notable features
  • CreditMiddleware deducts credits per tool call and returns HTTP 402 with RFC 9457 `problem+json` when balance is insufficient
  • RedisRateLimiter implements a token‑bucket algorithm via an atomic Lua script, replacing the in‑process RateLimiter for multi‑worker deployments; selectable at runtime via RateLimiterProtocol
  • Paddle payment infrastructure adds webhook middleware with HMAC‑SHA256 verification, SQLite credit repository (schema v2), three tiered credit packs ($10/500, $25/1500, $50/5000) and four related telemetry events
Full changelog

[0.7.3] - 2026-02-26

Added

  • Credit debit middlewareCreditMiddleware ASGI middleware deducts credits per tool call before dispatching to the MCP handler. Integrates with CreditRepositoryProtocol (SQLite + in-memory). Returns HTTP 402 with RFC 9457 problem+json body when balance is insufficient
  • Redis distributed rate limiterRedisRateLimiter implements token-bucket algorithm via a Lua script executed atomically on Redis. Replaces in-process RateLimiter for multi-worker deployments. RateLimiterProtocol allows runtime selection between in-memory and Redis backends
  • CQP agent behavior signal events — Two new telemetry event types for tracking agent tool usage patterns: TOOL_CALL_SEQUENCE (session-level tool sequence with timing deltas) and TOOL_DISAGREEMENT (consecutive-same-tool and same-URL-recall signals). TypedDict payloads + builder functions added to telemetry/events.py
  • OtlpHttpExporter — Cloud telemetry exporter with OTLP-JSON over HTTP, gzip compression, retryable 429/502/503/504 responses, Retry-After header parsing, exponential backoff + jitter
  • config.py 3-layer telemetry configTelemetryConfig resolves settings from YAML file → environment variables → CLI flags in priority order. Supports sample_rate, batch_size, flush_interval
  • FanOutWriter — Multiplexes telemetry events to multiple writers (e.g., local JSONL + remote OTLP) simultaneously
  • Paddle payment infrastructuresrc/pagemap/paddle/ module: webhook.py (ASGI middleware, HMAC-SHA256 signature verification, 30 s replay tolerance, idempotency gate via event_id), signature.py (constant-time hmac.compare_digest), credits.py (CreditRepositoryProtocol, SQLite schema v2 with CHECK ≥ 0, BEGIN IMMEDIATE atomic writes), products.py (3 credit pack tiers: $10/500, $25/1500, $50/5000), checkout.py (paddle-python-sdk lazy import), config.py (PaddleConfig from env). 4 telemetry events: PADDLE_WEBHOOK_RECEIVED, PADDLE_CREDITS_ADDED, PADDLE_WEBHOOK_INVALID, PADDLE_WEBHOOK_DUPLICATE

Changed

  • _to_float() European thousand-separator parsing — Single-separator strings like "1.500" are now correctly parsed as 1500.0 when the separator is in a thousands position (3 digits follow). Previously returned 1.5
  • BBC News pre-AOM portal hintbbc.co.uk and bbc.com domains are now classified as news portals before AOM processing, ensuring the news portal compressor is applied even on pages where <article> count is low
  • Inline element boundary spacing — Inline tags (<a>, <strong>, <em>, <span>, <b>, <i>) now insert a space at their boundary during text extraction, preventing word concatenation artifacts (e.g., "priceitem""price item")
  • product_detail option UI preservation — Option selector elements (size/color dropdowns, radio buttons) are now rescued from AOM removal for product_detail pages, recovering 47 → 100+ tokens of structured option information
  • Page classifier: category listing fix — Category index pages (e.g., /category/women) are now correctly classified as listing rather than article. Scoring weight for path-based listing signals increased
  • Semaphore pool slot leak fixedBrowserPool no longer accesses Semaphore._value (private CPython attribute). Slot count is now tracked via an explicit _available counter, eliminating AttributeError on non-CPython runtimes and future CPython versions
  • ServeHelpAction parsing stabilization_ServeHelpAction.__call__ now catches SystemExit raised by argparse during help generation; help text is always printed even if the subparser raises
  • Sensitive tests moved to tests/private/ — Auth, rate limiter, billing, telemetry, and SSRF telemetry test files relocated to tests/private/ (excluded from public release). release.sh updated accordingly
  • 4377 → 4735 tests passing (+358)

Fixed

  • Unused variable assignments removed from test_redis_rate_limiter.py and test_ssrf_telemetry.py
  • Import sort order corrected in rate_limiter.py, redis_rate_limiter.py, and related test files (ruff I001)
v0.7.2 Breaking risk
⚠ Upgrade required
  • OG image/thumbnail URLs now validated with _is_valid_url(); javascript: and data: URLs are rejected
  • _is_inside_article_or_main() uses O(1) lookup via pre‑computed set; no external action required but improves performance on large documents
  • VideoObject schema overrides compressor selection in _SCHEMA_OVERRIDES; existing configurations remain compatible
Security fixes
  • Added max_depth=5 parameter to _find_type_in_jsonld() to prevent RecursionError from maliciously nested @graph structures (DoS vector)
  • Applied sanitize_text() to currency, telephone, price_range, datePublished, upload_date, duration, start_date/end_date, BreadcrumbList name and _parse_h1() return value — eliminates prompt injection vector
Notable features
  • _extract_price_from_html() lxml DOM fallback price extractor
  • _extract_video_meta_from_dom() class‑name and regex video metadata extraction
  • News portal detection & compression via _is_news_portal() / _compress_for_news_portal()
Full changelog

What's Changed

Security

  • _find_type_in_jsonld() recursion depth limit — Added max_depth=5 parameter to prevent RecursionError from maliciously nested @graph structures (DoS vector)
  • metadata.py field sanitization — Applied sanitize_text() to currency, telephone, price_range, datePublished, upload_date, duration, start_date/end_date, BreadcrumbList name, and _parse_h1() return value — eliminates prompt injection vector from 8+ previously unsanitized fields
  • OG image/thumbnail URL validation — Applied _is_valid_url() to image_url/thumbnail_url OG fields; javascript: and data: URLs no longer pass through

Added

  • _extract_price_from_html() — lxml DOM-based last-resort price extractor from raw/pruned HTML; priority: a-offscreen text > price-class text_content() > aria-label (handles Amazon nested span structures)
  • _extract_video_meta_from_dom() — Class-name and regex-based video metadata extraction from heading chunks (channel, view_count, duration); called as last-resort fallback for VideoObject schema
  • _is_news_portal() / _compress_for_news_portal() — Detects news portal pattern (≥3 <article> elements or ≥3 headline links) in dashboard-classified pages; dedicated numbered headline list compressor with optional per-article summaries (BBC News improvement)
  • pagemap serve --help forwarded options_ServeHelpAction + _get_server_options_help() dynamically append server options (e.g. --transport, --port, --allow-local) to pagemap serve --help output
  • 3 new test filestest_hn_regression.py, test_news_portal_compression.py, test_pruned_context_builder_fixes.py

Fixed

  • HN/forum table-based grid whitelist — Added table/tbody to _GRID_CONTAINER_TAGS; table-based content listings (Hacker News, forums) now receive link-density penalty exemption (fixes HN content regression: 1,010 → 511 tok in v0.7.1)
  • AOM content rescue detached parent — Added tree.getpath(parent) check before re-inserting rescued elements; skips rescue when parent was detached during removal phase (prevents silent ghost subtree insertion)
  • removed_nodes stat overcounting — Rescued nodes are now subtracted from removed_nodes and removal_reasons counters (previously inflated telemetry)
  • _extract_price_from_dom_chunks() empty-text fallback — Added aria-label and data-* attribute fallbacks when chunk text is empty
  • _to_float() European number format"1.500,99" now correctly parses as 1500.99 instead of 1.5
  • _to_int() silent truncation — Changed from int(float) to round(float): "4.9"5 instead of 4 (fixes silent corruption in reviewCount etc.)
  • _extract_price_from_offers() zero-price falsylowPrice or price pattern replaced with explicit None check; price=0 is preserved correctly
  • _extract_image_url() ImageObject dict — Added support for "image": {"@type": "ImageObject", "url": "..."} pattern (previously missed)
  • extract_metadata() pruned_html parameter — Product and VideoObject schemas receive pruned HTML for lxml-based price extraction fallback; VideoObject also gets DOM-based metadata fallback

Changed

  • _is_inside_article_or_main() O(1) lookup — Pre-computes _article_main_descendants set before filtering loop; passes to _compute_weight() via new article_main_descendants parameter. Eliminates O(nodes × depth) traversal on large documents
  • _compress_for_dashboard() news portal delegation — Now accepts doc parameter and delegates to _compress_for_news_portal() when _is_news_portal() detects news portal structure
  • VideoObject added to _SCHEMA_OVERRIDES — VideoObject schema now overrides page_type-based compressor selection; _SCHEMA_OVERRIDES moved to module level (eliminates per-call frozenset recreation)
  • Phase 4 product price regex — Replaced inline regex with pre-compiled _PRICE_CLASS_RE (named groups, handles single-quote class attributes); extracted price injected back into metadata dict for downstream consumers
  • VideoObject itemprop authorchannel — Added "author": "channel" to _ITEMPROP_FIELD_MAP["VideoObject"]
  • VideoObject OG og:site_name removed — Removed og:site_namechannel mapping (was incorrectly using site name e.g. "YouTube" as channel name)
  • Video description CJK budget factor — Reduced from 0.95 to 0.85 to account for CJK-heavy descriptions (~1.5 chars/token vs English ~4 chars/token); _truncate_to_tokens() guard handles overshoot
  • 4014 → 4194 tests passing (+180)

Full Changelog: https://github.com/Retio-ai/Retio-pagemap/blob/main/CHANGELOG.md

v0.7.1 Breaking risk
⚠ Upgrade required
  • `pagemap serve --transport http --port <PORT>` now functional; CLI forwards remaining arguments to the server.
  • CLI help text cleaned: duplicate "build build" epilog removed, `retio-pagemap --help` restored.
Notable features
  • Article extractor increased token limit to >400 tokens with budget‑based compressor
  • YouTube page type added with VideoObject metadata parser and formatted K/M number display
  • Amazon price extraction enhanced via DOM fallback for `a-price`/`a-offscreen` classes
Full changelog

Content Extraction Quality + DX Improvements

Resolves 11 of 13 issues from QUALITY_REVIEW_v0.7.0. Core product value — content extraction — significantly improved for article, video, and product pages.

Highlights

  • Article content: 84 → 400+ tokens — Readability.js-inspired AOM filter exemption for <p> tags inside <article>/<main> (long paragraphs survive link-density penalty). Budget-based article compressor replaces fixed 2-paragraph limit.
  • YouTube: video page type + rich metadata — New page classifier with URL/meta/DOM/JSON-LD signals. VideoObject metadata parser extracts views, likes, channel, duration. Dedicated video compressor with K/M number formatting.
  • Amazon price extraction — DOM price fallback scans a-price/a-offscreen class patterns. Product schema content rescue enhanced. Product compressor price fallback from pruned HTML.
  • pagemap serve HTTP transportpagemap serve --transport http --port 8000 now works. CLI forwards remaining args to server.
  • DX fixes--help epilog build build duplicate removed, retio-pagemap --help restored, README version updated.

Bug Fixes

  • session_manager resource leakpool.acquire() exception now properly releases semaphore slot
  • Card detection entity leak&amp; entities no longer leak into agent output (e.g., "H&M")

Stats

  • 4,014 tests passing (+31 net, +48 new)
  • 22 files changed, 566 insertions, 84 deletions

See CHANGELOG for full details.

v0.7.0 New feature
Notable features
  • HTTP Transport: `--transport http` flag enabling Streamable HTTP, structlog JSON logging, K8s probes, API Gateway middleware, and graceful drain
  • 4‑Layer Security Middleware adding Rate Limiting, API Key Auth, TLS 1.3 enforcement, session isolation, browser context recycling, per‑session resource quotas
  • RFC 9457 Error Standardization providing structured 15‑type error taxonomy with secret masking
Full changelog

v0.7.0 — HTTP Transport + Security + Quality

The biggest release yet: HTTP transport, production-grade security, and comprehensive quality improvements.

Highlights

  • HTTP Transport--transport http flag with Streamable HTTP, structlog JSON logging, K8s probes, API Gateway middleware, graceful drain
  • 4-Layer Security Middleware — Gateway → Rate Limiting → API Key Auth → Security Headers. TLS 1.3 enforcement, session isolation, browser context recycling, per-session resource quotas
  • Quality Review (12 issues resolved) — Content extraction overhaul, captcha/WAF detection, interactable noise filtering, 4-phase image pipeline, Unicode script language filtering, CLI UX improvements
  • RFC 9457 Error Standardization — Structured error responses with 15-type taxonomy and secret masking
  • i18n Expansion — 6 → 10 locales (added zh, es, it, pt, nl) with full detection terms
  • JSON-LD Schema Expansion — NewsArticle, BreadcrumbList, FAQPage, Event, LocalBusiness + SaaS/Government/Wiki compressors
  • robots.txt Compliance — RFC 9309 checker with fail-open semantics
  • 3,983 tests (+1,043 from v0.6.0)

Full changelog

See CHANGELOG.md for the complete list of changes.

v0.6.0 New feature
⚠ Upgrade required
  • Configure PAGEMAP_MAX_TEXT_BYTES and PAGEMAP_MAX_IMAGE_BYTES env vars to adjust MCP response size limits
  • New telemetry events (pagemap.guard.resource_triggered, pagemap.tool.error) are emitted for resource guards and errors
Notable features
  • DOM node guard: raises ResourceExhaustionError when >50K nodes
  • HTML size limit: caps page.content() at 5 MB across all build paths
  • MCP response size guards: text ≤1 MB and screenshot ≤5 MB, configurable via env vars
Full changelog

[0.6.0] - 2026-02-23

Added

  • Phase α: RequestContext extractionRequestContext frozen dataclass + _create_stdio_context() helper. All 9 tool _impl functions accept ctx keyword-only parameter, zero direct _state references achieved. Foundation for Phase β (HTTP transport)
  • DOM node guardResourceExhaustionError when DOM exceeds 50K nodes. Single getComputedStyle()-based evaluate combines node counting + hidden element detection
  • HTML size limit — 5MB cap on page.content(). Applied to all 3 code paths: build_page_map_live, build_page_map_from_page, rebuild_content_only
  • Hidden content 2-layer detection — (1) JS getComputedStyle() DOM removal (display:none, visibility:hidden, opacity:0, font-size:0, off-screen), (2) AOM filter inline style patterns. 43 new tests
  • MCP response size guards — text 1MB (PAGEMAP_MAX_TEXT_BYTES) + screenshot 5MB (PAGEMAP_MAX_IMAGE_BYTES). Configurable via env vars, truncation with recovery hint tail marker, telemetry event emission
  • Agent-friendly error messages_safe_error() + _RECOVERY_HINTS recovery hint system. Actionable hints on all error paths
  • TOOL_ERROR telemetrypagemap.tool.error event emitted on _safe_error() calls, enriched with session_id
  • Resource guard telemetrypagemap.guard.resource_triggered, pagemap.guard.response_size_exceeded event types with TypedDict payloads and builder functions
  • Docker infrastructureDockerfile 2-stage build (uv + Playwright + non-root user), docker-compose.yml, .github/workflows/docker.yml CI/CD
  • GitHub Codespaces / Devcontainer.devcontainer/devcontainer.json (MS Playwright image + uv feature + 3 VS Code extensions)
  • Cursor Marketplace plugin.cursor-plugin/plugin.json + rules/ + mcp.json
  • VS Code MCP Gallery support — OCI package entry added to server.json

Changed

  • CI/CD hardening — GitHub Actions SHA pinning, pip-audit, bandit -r, CodeQL SAST, Dependabot (pip + github-actions + docker), gitleaks pre-commit hook, weekly latest-deps CI job
  • Guard helper extraction_check_html_size(), _check_resource_limits() refactored into standalone functions
  • Telemetry async flushflush_async prevents event loop blocking
  • 2198 → 2266 tests passing (+68)

Fixed

  • font-size:0 regex false positive — valid values like 0.5em, 0.875rem were incorrectly matched as font-size:0. 43 regression tests added
  • release.sh pyproject.toml readme path patching
  • ruff lint/format fixes for telemetry and test files
v0.5.2 New feature
Notable features
  • Chromium auto‑install eliminates manual `playwright install` steps
  • Claude Code Plugin integration via `claude plugin add pagemap`
  • Six client configurations (Claude Code, Cursor, Claude Desktop, VS Code Copilot, Windsurf) and ten ready‑to‑use example prompts
Full changelog

Benchmark leap: 63.6% → 84.7% task success

PageMap now leads all competitors in both accuracy and token efficiency across 94 tasks on 16 sites.

Highlights

  • 84.7% task success — +21.1%p from v0.5.0, +23.3%p ahead of Playwright MCP/CLI, +20.2%p ahead of Firecrawl
  • 5.1x fewer tokens than Playwright/Firecrawl at higher accuracy
  • 20.3x fewer tokens than full Playwright HTML
  • Chromium auto-install — no manual playwright install needed
  • Claude Code Pluginclaude plugin add pagemap
  • 6 client configs — Claude Code, Cursor, Claude Desktop (macOS/Windows), VS Code Copilot, Windsurf
  • 10 example prompts — ready-to-use prompts across 5 categories

Engine fixes

  • Wikipedia page type misclassification resolved
  • Product detail pruned_context enrichment (was 61 chars → full content)
  • Pruning error hierarchy, hostname matching, content_hash fixes
  • Snapshot recovery for 5 sites (H&M, Zara, COS, W Concept, SSF Shop)
  • AOM filter improvements: role="main" detection, e-commerce patterns

Full changelog

See CHANGELOG.md for details.


Install: uvx retio-pagemap or pip install retio-pagemap==0.5.2

v0.5.0 Breaking risk
Breaking changes
  • License updated from MIT to AGPL-3.0-only, requiring downstream projects to comply with AGPL-3.0 conditions.
Notable features
  • Diff-based updates output only changed sections (cache hit ~100ms, refresh ~500ms, full rebuild ~1.5s)
  • URL-based PageMap cache with 2‑layer LRU (20 entries) and 90‑second TTL
  • CJK token penalty correction for Korean (9.4x penalty compensated via language‑aware weights)
Full changelog

What's New

Core Features

  • Diff-based updatesto_agent_prompt_diff() outputs changed sections only, 3-tier rebuild: cache hit (~100ms) / content refresh (~500ms) / full rebuild (~1.5s)
  • URL-based PageMap cache — 2-layer architecture (active + URL LRU 20 entries), TTL 90s, DOM fingerprint validation
  • CJK token penalty correction — Korean 9.4x penalty compensated via language-aware budget weights
  • batch_get_page_map tool — parallel multi-URL processing (max 5 concurrent), per-tab 60s + global 120s timeout
  • Session concurrency guardtool_lock serializes all 9 MCP tool handlers

Pruning & Latency

  • FORM/MEDIA chunk pruning restoration, aside filter sidebar preservation
  • Orchestrator parallelization, dead regex removal, CDP session reuse, dynamic navigation wait
  • Hybrid networkidle strategy (load → 6s budget → DOM settle fallback)

Benchmark (94 tasks, 7 conditions)

| | PageMap | Playwright MCP | Firecrawl | Jina Reader |
|--|:------:|:---------:|:-----------:|:--------:|
| Task success | 63.6% | 61.5% | 64.5% | 57.8% |
| Avg tokens | 2,403 | 13,737 | 13,886 | 11,423 |
| Cost (94 tasks) | $0.97 | $4.09 | $3.97 | $2.26 |

Comparable accuracy, 5.7x fewer tokens, and the only tool with full interaction support.

Other

  • License: MIT → AGPL-3.0-only — all source files updated with SPDX headers
  • 1899 tests passing (+751 from v0.3.0)
  • MCP server tool count: 8 → 9

See CHANGELOG for full details.

v0.4.0 Breaking risk
Breaking changes
  • License updated from MIT to AGPL-3.0-only
Full changelog

Changed

  • License: MIT → AGPL-3.0-only — SPDX headers on all source files, classifier and badge updates
  • Template cache for domain+page_type structural knowledge
  • CJK token budget optimization
  • Phase 4 interactable-pruning coherence improvements

For commercial licensing, contact [email protected]

v0.3.0 New feature
Notable features
  • Added `scroll_page` tool for scrolling by page, half-page, or pixel amount
  • Added `take_screenshot` tool to capture viewport or full-page screenshots
  • Added `navigate_back` tool to go back in browser history
Full changelog

What's New

New MCP Tools (P6) — 3 → 8 tools

| Tool | Description |
|------|-------------|
| scroll_page | Scroll up/down by page, half-page, or pixel amount |
| take_screenshot | Capture viewport or full-page screenshot |
| navigate_back | Go back in browser history |
| fill_form | Batch-fill multiple form fields in one call |
| wait_for | Wait for text to appear or disappear on the page |
| hover (action) | Hover on any interactive element |

Reliability Improvements

  • Popup/new tab auto-handling — auto-detected, SSRF-checked, and switched to
  • JS dialog auto-handling — alert auto-accepted, confirm/prompt auto-dismissed, content reported to agent
  • Same role:name disambiguation — CSS selector fallback for duplicate labels

Full Changelog

See CHANGELOG.md for details.

1148 tests passing (+200 from v0.2.0)

v0.2.0 New feature
⚠ Upgrade required
  • --allow-local CLI flag (and PAGEMAP_ALLOW_LOCAL env var) grants loopback, RFC 1918 and IPv6 ULA access; cloud metadata endpoints remain blocked
  • _validate_url() now checks cloud metadata hosts first and respects --allow-local
  • _validate_resolved_ips() prioritizes cloud metadata IPs and exempts local IPs when --allow-local is active
Security fixes
  • SSRF defense added: _normalize_ip() pure‑arithmetic parsing, pre/post DNS resolve & validation, route guard and navigation block
  • Browser hardening: Chromium launch args (`--block-new-web-contents`, WebRTC leak prevention), service workers blocked, markdown injection neutralization
Notable features
  • 3‑strategy locator fallback (role → CSS selector → degraded role) with CSS selector field on Interactable
  • Action retry logic up to 2 retries, DOM change detection via structural fingerprinting
  • Overall execute_action timeout of 30 seconds using asyncio.wait_for
Full changelog

What's New in v0.2.0

Added

  • execute_action reliability overhaul (P2)
    • 3-strategy locator fallback chain: get_by_role(exact) → CSS selector → role(.first, degraded)
    • CSS selector field on Interactable (Tier 1-2 CDP-based + Tier 3 batch JS inline generation)
    • Action retry logic: up to 2 retries with 15s wall-clock budget and locator re-resolution; click retried only on pre-dispatch failures
    • DOM change detection: pre/post structural fingerprint comparison catches URL-stable DOM mutations (modals, SPA navigations)
    • Overall execute_action timeout (30s) via asyncio.wait_for
    • Browser death detection (TargetClosedError, connection lost) with automatic _last_page_map invalidation and recovery guidance
    • Affordance-action compatibility pre-check (e.g. type on a button blocked early with suggested action)
    • Tier 3 CDP N+1 elimination: per-element 4x sequential CDP calls → single batch Runtime.evaluate
  • SSRF 4-layer defense (S2)
    • _normalize_ip() pure-arithmetic parsing (octal/hex/decimal bypass defense)
    • Pre-nav DNS resolve + IP validation (_resolve_dns + _validate_resolved_ips, dual is_global check)
    • Post-nav DNS revalidation (redirect chain TOCTOU mitigation)
    • Context route guard (install_ssrf_route_guard, document/subdocument JS navigation blocking)
    • Post-action navigation SSRF check with about:blank redirect on block
  • Browser hardening (S3)
    • Chromium launch args hardening (--block-new-web-contents, WebRTC IP leak prevention, ServiceWorker disable)
    • Context options hardening (service_workers="block", accept_downloads=False)
    • Internal protocol blocking expanded (view-source://, blob:, data:, about: — page-level → context-level)
    • Markdown injection defense (javascript:/vbscript:/data:/blob: URI neutralization)
  • --allow-local flag for local development (P6)
    • --allow-local CLI flag: opt-in access to loopback (127.x, ::1), RFC 1918 (10.x, 172.16-31.x, 192.168.x), IPv6 ULA (fc00::/7)
    • PAGEMAP_ALLOW_LOCAL env var: alternative for containerized deployments
    • Cloud metadata endpoints (169.254.x.x, metadata.google.internal) remain unconditionally blocked
  • AX tree failure isolation (P8)detect_interactables_ax() failure no longer crashes entire build; graceful degradation returns pruning results only

Changed

  • _validate_url(): cloud metadata hosts checked first (always blocked), BLOCKED_HOSTS now respects --allow-local
  • _validate_resolved_ips(): cloud metadata IPs prioritized, _is_local_ip() exemption for --allow-local
  • main(): extracted _parse_server_args(), SECURITY warning logged when --allow-local is active
  • 938 tests passing (+332 from v0.1.3)

Full changelog: https://github.com/Retio-ai/Retio-pagemap/compare/v0.1.3...v0.2.0

v0.1.3 Feature
Notable features
  • GitHub Actions CI pipeline for lint and test across Python 3.11‑3.13 matrix
  • Automated PyPI publishing CD pipeline via GitHub Release using OIDC trusted publishers
Full changelog

Added

  • GitHub Actions CI pipeline (lint + test, Python 3.11/3.12/3.13 matrix)
  • CD pipeline for automated PyPI publishing via GitHub Release (OIDC trusted publishers)
  • CI badge in README

Changed

  • Applied ruff format to entire codebase
  • Excluded internal config.yaml from public release

Beta — feedback welcome: [email protected]