Release history
Retio-ai/pagemap releases
Compresses ~100K-token HTML into 2-5K-token structured maps while preserving every actionable element. AI agents can read and interact with any web page at 97% fewer tokens.
All releases
14 shown
- SSRF defense, prompt injection defense, robots.txt compliance, resource guards
- 13 MCP tools (e.g., get_page_map, execute_action, fill_form)
- 16 auto-detected page types and extraction optimizations
- Support for 60+ e-commerce sites across multiple regions
Full changelog
PageMap v1.0.0 — First Public Release
The browsing MCP server that fits in your context window. Compresses ~100K-token HTML into a 2-5K-token structured map while preserving every actionable element.
Highlights
- 13 MCP tools — get_page_map, execute_action, fill_form, scroll_page, wait_for, take_screenshot, get_page_state, navigate_back, batch_get_page_map, open_tab, switch_tab, list_tabs, close_tab
- 16 page types auto-detected with optimized extraction
- 60+ e-commerce sites supported across Global, Korea, Japan, China
- 8 JSON-LD schemas — Product, NewsArticle, VideoObject, FAQPage, Event, LocalBusiness, BreadcrumbList, ItemList
- 10 languages with locale auto-detection and CJK token budget adjustment
- 2-layer caching — cache hit (~10ms), content refresh (~500ms), full rebuild (~1.5s)
- Security hardened — SSRF defense, prompt injection defense, robots.txt compliance, resource guards
Install
pip install retio-pagemap
MCP Client Config
{
"mcpServers": {
"pagemap": {
"command": "uvx",
"args": ["retio-pagemap"]
}
}
}
Docker
docker run -p 8000:8000 retio1001/pagemap --transport http
Full documentation: https://github.com/Retio-ai/Retio-pagemap#readme
- CreditMiddleware deducts credits per tool call and returns HTTP 402 with RFC 9457 `problem+json` when balance is insufficient
- RedisRateLimiter implements a token‑bucket algorithm via an atomic Lua script, replacing the in‑process RateLimiter for multi‑worker deployments; selectable at runtime via RateLimiterProtocol
- Paddle payment infrastructure adds webhook middleware with HMAC‑SHA256 verification, SQLite credit repository (schema v2), three tiered credit packs ($10/500, $25/1500, $50/5000) and four related telemetry events
Full changelog
[0.7.3] - 2026-02-26
Added
- Credit debit middleware —
CreditMiddlewareASGI middleware deducts credits per tool call before dispatching to the MCP handler. Integrates withCreditRepositoryProtocol(SQLite + in-memory). Returns HTTP 402 with RFC 9457problem+jsonbody when balance is insufficient - Redis distributed rate limiter —
RedisRateLimiterimplements token-bucket algorithm via a Lua script executed atomically on Redis. Replaces in-processRateLimiterfor multi-worker deployments.RateLimiterProtocolallows runtime selection between in-memory and Redis backends - CQP agent behavior signal events — Two new telemetry event types for tracking agent tool usage patterns:
TOOL_CALL_SEQUENCE(session-level tool sequence with timing deltas) andTOOL_DISAGREEMENT(consecutive-same-tool and same-URL-recall signals). TypedDict payloads + builder functions added totelemetry/events.py OtlpHttpExporter— Cloud telemetry exporter with OTLP-JSON over HTTP, gzip compression, retryable 429/502/503/504 responses,Retry-Afterheader parsing, exponential backoff + jitterconfig.py3-layer telemetry config —TelemetryConfigresolves settings from YAML file → environment variables → CLI flags in priority order. Supportssample_rate,batch_size,flush_intervalFanOutWriter— Multiplexes telemetry events to multiple writers (e.g., local JSONL + remote OTLP) simultaneously- Paddle payment infrastructure —
src/pagemap/paddle/module:webhook.py(ASGI middleware, HMAC-SHA256 signature verification, 30 s replay tolerance, idempotency gate viaevent_id),signature.py(constant-timehmac.compare_digest),credits.py(CreditRepositoryProtocol, SQLite schema v2 withCHECK ≥ 0,BEGIN IMMEDIATEatomic writes),products.py(3 credit pack tiers: $10/500, $25/1500, $50/5000),checkout.py(paddle-python-sdklazy import),config.py(PaddleConfigfrom env). 4 telemetry events:PADDLE_WEBHOOK_RECEIVED,PADDLE_CREDITS_ADDED,PADDLE_WEBHOOK_INVALID,PADDLE_WEBHOOK_DUPLICATE
Changed
_to_float()European thousand-separator parsing — Single-separator strings like"1.500"are now correctly parsed as1500.0when the separator is in a thousands position (3 digits follow). Previously returned1.5- BBC News pre-AOM portal hint —
bbc.co.ukandbbc.comdomains are now classified as news portals before AOM processing, ensuring the news portal compressor is applied even on pages where<article>count is low - Inline element boundary spacing — Inline tags (
<a>,<strong>,<em>,<span>,<b>,<i>) now insert a space at their boundary during text extraction, preventing word concatenation artifacts (e.g.,"priceitem"→"price item") product_detailoption UI preservation — Option selector elements (size/color dropdowns, radio buttons) are now rescued from AOM removal forproduct_detailpages, recovering 47 → 100+ tokens of structured option information- Page classifier: category listing fix — Category index pages (e.g.,
/category/women) are now correctly classified aslistingrather thanarticle. Scoring weight for path-based listing signals increased - Semaphore pool slot leak fixed —
BrowserPoolno longer accessesSemaphore._value(private CPython attribute). Slot count is now tracked via an explicit_availablecounter, eliminatingAttributeErroron non-CPython runtimes and future CPython versions ServeHelpActionparsing stabilization —_ServeHelpAction.__call__now catchesSystemExitraised by argparse during help generation; help text is always printed even if the subparser raises- Sensitive tests moved to
tests/private/— Auth, rate limiter, billing, telemetry, and SSRF telemetry test files relocated totests/private/(excluded from public release).release.shupdated accordingly - 4377 → 4735 tests passing (+358)
Fixed
- Unused variable assignments removed from
test_redis_rate_limiter.pyandtest_ssrf_telemetry.py - Import sort order corrected in
rate_limiter.py,redis_rate_limiter.py, and related test files (ruff I001)
- OG image/thumbnail URLs now validated with _is_valid_url(); javascript: and data: URLs are rejected
- _is_inside_article_or_main() uses O(1) lookup via pre‑computed set; no external action required but improves performance on large documents
- VideoObject schema overrides compressor selection in _SCHEMA_OVERRIDES; existing configurations remain compatible
- Added max_depth=5 parameter to _find_type_in_jsonld() to prevent RecursionError from maliciously nested @graph structures (DoS vector)
- Applied sanitize_text() to currency, telephone, price_range, datePublished, upload_date, duration, start_date/end_date, BreadcrumbList name and _parse_h1() return value — eliminates prompt injection vector
- _extract_price_from_html() lxml DOM fallback price extractor
- _extract_video_meta_from_dom() class‑name and regex video metadata extraction
- News portal detection & compression via _is_news_portal() / _compress_for_news_portal()
Full changelog
What's Changed
Security
_find_type_in_jsonld()recursion depth limit — Addedmax_depth=5parameter to preventRecursionErrorfrom maliciously nested@graphstructures (DoS vector)- metadata.py field sanitization — Applied
sanitize_text()tocurrency,telephone,price_range,datePublished,upload_date,duration,start_date/end_date, BreadcrumbListname, and_parse_h1()return value — eliminates prompt injection vector from 8+ previously unsanitized fields - OG image/thumbnail URL validation — Applied
_is_valid_url()toimage_url/thumbnail_urlOG fields;javascript:anddata:URLs no longer pass through
Added
_extract_price_from_html()— lxml DOM-based last-resort price extractor from raw/pruned HTML; priority:a-offscreentext > price-classtext_content()>aria-label(handles Amazon nested span structures)_extract_video_meta_from_dom()— Class-name and regex-based video metadata extraction from heading chunks (channel, view_count, duration); called as last-resort fallback for VideoObject schema_is_news_portal()/_compress_for_news_portal()— Detects news portal pattern (≥3<article>elements or ≥3 headline links) in dashboard-classified pages; dedicated numbered headline list compressor with optional per-article summaries (BBC News improvement)pagemap serve --helpforwarded options —_ServeHelpAction+_get_server_options_help()dynamically append server options (e.g.--transport,--port,--allow-local) topagemap serve --helpoutput- 3 new test files —
test_hn_regression.py,test_news_portal_compression.py,test_pruned_context_builder_fixes.py
Fixed
- HN/forum table-based grid whitelist — Added
table/tbodyto_GRID_CONTAINER_TAGS; table-based content listings (Hacker News, forums) now receive link-density penalty exemption (fixes HN content regression: 1,010 → 511 tok in v0.7.1) - AOM content rescue detached parent — Added
tree.getpath(parent)check before re-inserting rescued elements; skips rescue when parent was detached during removal phase (prevents silent ghost subtree insertion) removed_nodesstat overcounting — Rescued nodes are now subtracted fromremoved_nodesandremoval_reasonscounters (previously inflated telemetry)_extract_price_from_dom_chunks()empty-text fallback — Addedaria-labelanddata-*attribute fallbacks when chunk text is empty_to_float()European number format —"1.500,99"now correctly parses as1500.99instead of1.5_to_int()silent truncation — Changed fromint(float)toround(float):"4.9"→5instead of4(fixes silent corruption in reviewCount etc.)_extract_price_from_offers()zero-price falsy —lowPrice or pricepattern replaced with explicitNonecheck; price=0 is preserved correctly_extract_image_url()ImageObject dict — Added support for"image": {"@type": "ImageObject", "url": "..."}pattern (previously missed)extract_metadata()pruned_html parameter — Product and VideoObject schemas receive pruned HTML for lxml-based price extraction fallback; VideoObject also gets DOM-based metadata fallback
Changed
_is_inside_article_or_main()O(1) lookup — Pre-computes_article_main_descendantsset before filtering loop; passes to_compute_weight()via newarticle_main_descendantsparameter. Eliminates O(nodes × depth) traversal on large documents_compress_for_dashboard()news portal delegation — Now acceptsdocparameter and delegates to_compress_for_news_portal()when_is_news_portal()detects news portal structureVideoObjectadded to_SCHEMA_OVERRIDES— VideoObject schema now overrides page_type-based compressor selection;_SCHEMA_OVERRIDESmoved to module level (eliminates per-call frozenset recreation)- Phase 4 product price regex — Replaced inline regex with pre-compiled
_PRICE_CLASS_RE(named groups, handles single-quote class attributes); extracted price injected back into metadata dict for downstream consumers - VideoObject itemprop
author→channel— Added"author": "channel"to_ITEMPROP_FIELD_MAP["VideoObject"] - VideoObject OG
og:site_nameremoved — Removedog:site_name→channelmapping (was incorrectly using site name e.g. "YouTube" as channel name) - Video description CJK budget factor — Reduced from 0.95 to 0.85 to account for CJK-heavy descriptions (~1.5 chars/token vs English ~4 chars/token);
_truncate_to_tokens()guard handles overshoot - 4014 → 4194 tests passing (+180)
Full Changelog: https://github.com/Retio-ai/Retio-pagemap/blob/main/CHANGELOG.md
- `pagemap serve --transport http --port <PORT>` now functional; CLI forwards remaining arguments to the server.
- CLI help text cleaned: duplicate "build build" epilog removed, `retio-pagemap --help` restored.
- Article extractor increased token limit to >400 tokens with budget‑based compressor
- YouTube page type added with VideoObject metadata parser and formatted K/M number display
- Amazon price extraction enhanced via DOM fallback for `a-price`/`a-offscreen` classes
Full changelog
Content Extraction Quality + DX Improvements
Resolves 11 of 13 issues from QUALITY_REVIEW_v0.7.0. Core product value — content extraction — significantly improved for article, video, and product pages.
Highlights
- Article content: 84 → 400+ tokens — Readability.js-inspired AOM filter exemption for
<p>tags inside<article>/<main>(long paragraphs survive link-density penalty). Budget-based article compressor replaces fixed 2-paragraph limit. - YouTube:
videopage type + rich metadata — New page classifier with URL/meta/DOM/JSON-LD signals. VideoObject metadata parser extracts views, likes, channel, duration. Dedicated video compressor with K/M number formatting. - Amazon price extraction — DOM price fallback scans
a-price/a-offscreenclass patterns. Product schema content rescue enhanced. Product compressor price fallback from pruned HTML. pagemap serveHTTP transport —pagemap serve --transport http --port 8000now works. CLI forwards remaining args to server.- DX fixes —
--helpepilogbuild buildduplicate removed,retio-pagemap --helprestored, README version updated.
Bug Fixes
- session_manager resource leak —
pool.acquire()exception now properly releases semaphore slot - Card detection entity leak —
&entities no longer leak into agent output (e.g., "H&M")
Stats
- 4,014 tests passing (+31 net, +48 new)
- 22 files changed, 566 insertions, 84 deletions
See CHANGELOG for full details.
- HTTP Transport: `--transport http` flag enabling Streamable HTTP, structlog JSON logging, K8s probes, API Gateway middleware, and graceful drain
- 4‑Layer Security Middleware adding Rate Limiting, API Key Auth, TLS 1.3 enforcement, session isolation, browser context recycling, per‑session resource quotas
- RFC 9457 Error Standardization providing structured 15‑type error taxonomy with secret masking
Full changelog
v0.7.0 — HTTP Transport + Security + Quality
The biggest release yet: HTTP transport, production-grade security, and comprehensive quality improvements.
Highlights
- HTTP Transport —
--transport httpflag with Streamable HTTP, structlog JSON logging, K8s probes, API Gateway middleware, graceful drain - 4-Layer Security Middleware — Gateway → Rate Limiting → API Key Auth → Security Headers. TLS 1.3 enforcement, session isolation, browser context recycling, per-session resource quotas
- Quality Review (12 issues resolved) — Content extraction overhaul, captcha/WAF detection, interactable noise filtering, 4-phase image pipeline, Unicode script language filtering, CLI UX improvements
- RFC 9457 Error Standardization — Structured error responses with 15-type taxonomy and secret masking
- i18n Expansion — 6 → 10 locales (added zh, es, it, pt, nl) with full detection terms
- JSON-LD Schema Expansion — NewsArticle, BreadcrumbList, FAQPage, Event, LocalBusiness + SaaS/Government/Wiki compressors
- robots.txt Compliance — RFC 9309 checker with fail-open semantics
- 3,983 tests (+1,043 from v0.6.0)
Full changelog
See CHANGELOG.md for the complete list of changes.
- Configure PAGEMAP_MAX_TEXT_BYTES and PAGEMAP_MAX_IMAGE_BYTES env vars to adjust MCP response size limits
- New telemetry events (pagemap.guard.resource_triggered, pagemap.tool.error) are emitted for resource guards and errors
- DOM node guard: raises ResourceExhaustionError when >50K nodes
- HTML size limit: caps page.content() at 5 MB across all build paths
- MCP response size guards: text ≤1 MB and screenshot ≤5 MB, configurable via env vars
Full changelog
[0.6.0] - 2026-02-23
Added
- Phase α: RequestContext extraction —
RequestContextfrozen dataclass +_create_stdio_context()helper. All 9 tool_implfunctions acceptctxkeyword-only parameter, zero direct_statereferences achieved. Foundation for Phase β (HTTP transport) - DOM node guard —
ResourceExhaustionErrorwhen DOM exceeds 50K nodes. SinglegetComputedStyle()-based evaluate combines node counting + hidden element detection - HTML size limit — 5MB cap on
page.content(). Applied to all 3 code paths:build_page_map_live,build_page_map_from_page,rebuild_content_only - Hidden content 2-layer detection — (1) JS
getComputedStyle()DOM removal (display:none,visibility:hidden,opacity:0,font-size:0, off-screen), (2) AOM filter inline style patterns. 43 new tests - MCP response size guards — text 1MB (
PAGEMAP_MAX_TEXT_BYTES) + screenshot 5MB (PAGEMAP_MAX_IMAGE_BYTES). Configurable via env vars, truncation with recovery hint tail marker, telemetry event emission - Agent-friendly error messages —
_safe_error()+_RECOVERY_HINTSrecovery hint system. Actionable hints on all error paths - TOOL_ERROR telemetry —
pagemap.tool.errorevent emitted on_safe_error()calls, enriched withsession_id - Resource guard telemetry —
pagemap.guard.resource_triggered,pagemap.guard.response_size_exceededevent types with TypedDict payloads and builder functions - Docker infrastructure —
Dockerfile2-stage build (uv + Playwright + non-root user),docker-compose.yml,.github/workflows/docker.ymlCI/CD - GitHub Codespaces / Devcontainer —
.devcontainer/devcontainer.json(MS Playwright image + uv feature + 3 VS Code extensions) - Cursor Marketplace plugin —
.cursor-plugin/plugin.json+rules/+mcp.json - VS Code MCP Gallery support — OCI package entry added to
server.json
Changed
- CI/CD hardening — GitHub Actions SHA pinning,
pip-audit,bandit -r, CodeQL SAST, Dependabot (pip + github-actions + docker), gitleaks pre-commit hook, weekly latest-deps CI job - Guard helper extraction —
_check_html_size(),_check_resource_limits()refactored into standalone functions - Telemetry async flush —
flush_asyncprevents event loop blocking - 2198 → 2266 tests passing (+68)
Fixed
- font-size:0 regex false positive — valid values like
0.5em,0.875remwere incorrectly matched asfont-size:0. 43 regression tests added - release.sh pyproject.toml readme path patching
- ruff lint/format fixes for telemetry and test files
- Chromium auto‑install eliminates manual `playwright install` steps
- Claude Code Plugin integration via `claude plugin add pagemap`
- Six client configurations (Claude Code, Cursor, Claude Desktop, VS Code Copilot, Windsurf) and ten ready‑to‑use example prompts
Full changelog
Benchmark leap: 63.6% → 84.7% task success
PageMap now leads all competitors in both accuracy and token efficiency across 94 tasks on 16 sites.
Highlights
- 84.7% task success — +21.1%p from v0.5.0, +23.3%p ahead of Playwright MCP/CLI, +20.2%p ahead of Firecrawl
- 5.1x fewer tokens than Playwright/Firecrawl at higher accuracy
- 20.3x fewer tokens than full Playwright HTML
- Chromium auto-install — no manual
playwright installneeded - Claude Code Plugin —
claude plugin add pagemap - 6 client configs — Claude Code, Cursor, Claude Desktop (macOS/Windows), VS Code Copilot, Windsurf
- 10 example prompts — ready-to-use prompts across 5 categories
Engine fixes
- Wikipedia page type misclassification resolved
- Product detail pruned_context enrichment (was 61 chars → full content)
- Pruning error hierarchy, hostname matching, content_hash fixes
- Snapshot recovery for 5 sites (H&M, Zara, COS, W Concept, SSF Shop)
- AOM filter improvements: role="main" detection, e-commerce patterns
Full changelog
See CHANGELOG.md for details.
Install: uvx retio-pagemap or pip install retio-pagemap==0.5.2
- License updated from MIT to AGPL-3.0-only, requiring downstream projects to comply with AGPL-3.0 conditions.
- Diff-based updates output only changed sections (cache hit ~100ms, refresh ~500ms, full rebuild ~1.5s)
- URL-based PageMap cache with 2‑layer LRU (20 entries) and 90‑second TTL
- CJK token penalty correction for Korean (9.4x penalty compensated via language‑aware weights)
Full changelog
What's New
Core Features
- Diff-based updates —
to_agent_prompt_diff()outputs changed sections only, 3-tier rebuild: cache hit (~100ms) / content refresh (~500ms) / full rebuild (~1.5s) - URL-based PageMap cache — 2-layer architecture (active + URL LRU 20 entries), TTL 90s, DOM fingerprint validation
- CJK token penalty correction — Korean 9.4x penalty compensated via language-aware budget weights
batch_get_page_maptool — parallel multi-URL processing (max 5 concurrent), per-tab 60s + global 120s timeout- Session concurrency guard —
tool_lockserializes all 9 MCP tool handlers
Pruning & Latency
- FORM/MEDIA chunk pruning restoration, aside filter sidebar preservation
- Orchestrator parallelization, dead regex removal, CDP session reuse, dynamic navigation wait
- Hybrid
networkidlestrategy (load → 6s budget → DOM settle fallback)
Benchmark (94 tasks, 7 conditions)
| | PageMap | Playwright MCP | Firecrawl | Jina Reader |
|--|:------:|:---------:|:-----------:|:--------:|
| Task success | 63.6% | 61.5% | 64.5% | 57.8% |
| Avg tokens | 2,403 | 13,737 | 13,886 | 11,423 |
| Cost (94 tasks) | $0.97 | $4.09 | $3.97 | $2.26 |
Comparable accuracy, 5.7x fewer tokens, and the only tool with full interaction support.
Other
- License: MIT → AGPL-3.0-only — all source files updated with SPDX headers
- 1899 tests passing (+751 from v0.3.0)
- MCP server tool count: 8 → 9
See CHANGELOG for full details.
- License updated from MIT to AGPL-3.0-only
Full changelog
Changed
- License: MIT → AGPL-3.0-only — SPDX headers on all source files, classifier and badge updates
- Template cache for domain+page_type structural knowledge
- CJK token budget optimization
- Phase 4 interactable-pruning coherence improvements
For commercial licensing, contact [email protected]
- Added `scroll_page` tool for scrolling by page, half-page, or pixel amount
- Added `take_screenshot` tool to capture viewport or full-page screenshots
- Added `navigate_back` tool to go back in browser history
Full changelog
What's New
New MCP Tools (P6) — 3 → 8 tools
| Tool | Description |
|------|-------------|
| scroll_page | Scroll up/down by page, half-page, or pixel amount |
| take_screenshot | Capture viewport or full-page screenshot |
| navigate_back | Go back in browser history |
| fill_form | Batch-fill multiple form fields in one call |
| wait_for | Wait for text to appear or disappear on the page |
| hover (action) | Hover on any interactive element |
Reliability Improvements
- Popup/new tab auto-handling — auto-detected, SSRF-checked, and switched to
- JS dialog auto-handling — alert auto-accepted, confirm/prompt auto-dismissed, content reported to agent
- Same role:name disambiguation — CSS selector fallback for duplicate labels
Full Changelog
See CHANGELOG.md for details.
1148 tests passing (+200 from v0.2.0)
- --allow-local CLI flag (and PAGEMAP_ALLOW_LOCAL env var) grants loopback, RFC 1918 and IPv6 ULA access; cloud metadata endpoints remain blocked
- _validate_url() now checks cloud metadata hosts first and respects --allow-local
- _validate_resolved_ips() prioritizes cloud metadata IPs and exempts local IPs when --allow-local is active
- SSRF defense added: _normalize_ip() pure‑arithmetic parsing, pre/post DNS resolve & validation, route guard and navigation block
- Browser hardening: Chromium launch args (`--block-new-web-contents`, WebRTC leak prevention), service workers blocked, markdown injection neutralization
- 3‑strategy locator fallback (role → CSS selector → degraded role) with CSS selector field on Interactable
- Action retry logic up to 2 retries, DOM change detection via structural fingerprinting
- Overall execute_action timeout of 30 seconds using asyncio.wait_for
Full changelog
What's New in v0.2.0
Added
- execute_action reliability overhaul (P2)
- 3-strategy locator fallback chain:
get_by_role(exact)→ CSS selector → role(.first, degraded) - CSS selector field on
Interactable(Tier 1-2 CDP-based + Tier 3 batch JS inline generation) - Action retry logic: up to 2 retries with 15s wall-clock budget and locator re-resolution; click retried only on pre-dispatch failures
- DOM change detection: pre/post structural fingerprint comparison catches URL-stable DOM mutations (modals, SPA navigations)
- Overall execute_action timeout (30s) via
asyncio.wait_for - Browser death detection (
TargetClosedError, connection lost) with automatic_last_page_mapinvalidation and recovery guidance - Affordance-action compatibility pre-check (e.g.
typeon a button blocked early with suggested action) - Tier 3 CDP N+1 elimination: per-element 4x sequential CDP calls → single batch
Runtime.evaluate
- 3-strategy locator fallback chain:
- SSRF 4-layer defense (S2)
_normalize_ip()pure-arithmetic parsing (octal/hex/decimal bypass defense)- Pre-nav DNS resolve + IP validation (
_resolve_dns+_validate_resolved_ips, dualis_globalcheck) - Post-nav DNS revalidation (redirect chain TOCTOU mitigation)
- Context route guard (
install_ssrf_route_guard, document/subdocument JS navigation blocking) - Post-action navigation SSRF check with
about:blankredirect on block
- Browser hardening (S3)
- Chromium launch args hardening (
--block-new-web-contents, WebRTC IP leak prevention, ServiceWorker disable) - Context options hardening (
service_workers="block",accept_downloads=False) - Internal protocol blocking expanded (
view-source://,blob:,data:,about:— page-level → context-level) - Markdown injection defense (
javascript:/vbscript:/data:/blob:URI neutralization)
- Chromium launch args hardening (
--allow-localflag for local development (P6)--allow-localCLI flag: opt-in access to loopback (127.x, ::1), RFC 1918 (10.x, 172.16-31.x, 192.168.x), IPv6 ULA (fc00::/7)PAGEMAP_ALLOW_LOCALenv var: alternative for containerized deployments- Cloud metadata endpoints (169.254.x.x,
metadata.google.internal) remain unconditionally blocked
- AX tree failure isolation (P8) —
detect_interactables_ax()failure no longer crashes entire build; graceful degradation returns pruning results only
Changed
_validate_url(): cloud metadata hosts checked first (always blocked),BLOCKED_HOSTSnow respects--allow-local_validate_resolved_ips(): cloud metadata IPs prioritized,_is_local_ip()exemption for--allow-localmain(): extracted_parse_server_args(), SECURITY warning logged when--allow-localis active- 938 tests passing (+332 from v0.1.3)
Full changelog: https://github.com/Retio-ai/Retio-pagemap/compare/v0.1.3...v0.2.0
- GitHub Actions CI pipeline for lint and test across Python 3.11‑3.13 matrix
- Automated PyPI publishing CD pipeline via GitHub Release using OIDC trusted publishers
Full changelog
Added
- GitHub Actions CI pipeline (lint + test, Python 3.11/3.12/3.13 matrix)
- CD pipeline for automated PyPI publishing via GitHub Release (OIDC trusted publishers)
- CI badge in README
Changed
- Applied ruff format to entire codebase
- Excluded internal config.yaml from public release