Skip to content

Scrapling

AI Agents & Assistants

An adaptive Python framework for web scraping that automatically adjusts to site changes and bypasses anti‑bot protections.

Python Latest v0.4.8 · 24d ago Security brief →

Features

  • Adaptive parser that relocates elements when pages update
  • Built‑in fetchers that bypass anti‑bot systems (e.g., Cloudflare Turnstile)
  • Scalable spider framework with concurrent, multi‑session crawls and automatic proxy rotation

Recent releases

View all 24 releases →
v0.4.8 New feature
⚠ Upgrade required
  • Adaptive relocation now defaults to a 40% similarity threshold; lower the threshold if needed and heed the new warning on weak matches.
  • Run `scrapling install --force` after updating to refresh browsers and fingerprints.
Notable features
  • Added `LinkExtractor` primitive in `scrapling.spiders.LinkExtractor` for URL extraction with fine‑grained controls.
  • Introduced `CrawlSpider` and `CrawlRule` templates to simplify "follow links matching a pattern" boilerplate.
  • Provided `SitemapSpider` template that seeds crawls from sitemaps or `robots.txt`, handling gzip‑compressed sitemaps.
Full changelog

A big spider update that takes the crawling framework to the next level 🕷️

[!NOTE]
Follow us on X for daily tips and tricks

🚀 New Stuff and quality of life changes

  • Added a LinkExtractor primitive in scrapling.spiders.LinkExtractor to pull URLs out of a Response. There are a lot of controls (Check the docs)

    from scrapling.spiders import LinkExtractor
    
    extractor = LinkExtractor(allow=r"/posts/", deny_domains=["ads.example.com"])
    
  • Added CrawlSpider and CrawlRule generic spider templates so you no longer have to hand-write the same "follow links matching this pattern" boilerplate. Override rules() to return a list of CrawlRule objects, each pairing a LinkExtractor. (Check the docs)

    from scrapling.spiders import CrawlSpider, CrawlRule, LinkExtractor
    
    class QuotesSpider(CrawlSpider):
        name = "blog"
        start_urls = ["https://quotes.toscrape.com/"]
    
        def rules(self):
            return [
                CrawlRule(LinkExtractor(allow=r"/author/"), callback=self.parse_author),
                CrawlRule(LinkExtractor(allow=r"/page/\d+/")),  # pagination, no callback
            ]
    
        async def parse_author(self, response):
            yield {
                "name": response.css(".author-title::text").get(),
                "birthday": response.css(".author-born-date::text").get(),
                "url": response.url,
            }
    
  • Added a SitemapSpider template that seeds a crawl directly from a sitemap, or robots.txt URLs. Handles gzip-compressed sitemaps, and a lot of controls and options. URLs are dispatched via the crawl rules as shown above for CrawlSpider. (Check the docs)

    from scrapling.spiders import SitemapSpider, CrawlRule, LinkExtractor
    
    class NewsSitemap(SitemapSpider):
        name = "news"
        sitemap_urls = ["https://example.com/robots.txt"]
    
        def rules(self):
            return [
                CrawlRule(LinkExtractor(allow=r"/articles/"), callback=self.parse_article),
            ]
    
        async def parse_article(self, response):
            yield {"url": response.url, "title": response.css("h1::text").get()}
    
  • Adaptive relocation now defaults to a 40% similarity threshold instead of 0 across all methods. This will make the adaptive feature work better. When nothing crosses the threshold, a warning now tells you the top score it did see, so you can lower percentage deliberately if needed.

  • Updated all browsers and fingerprints. Run a new scrapling install --force after updating to refresh the browsers and fingerprints.

🐛 Bug Fixes

  • Fixed Fetcher.configure(...) not applying to per-request calls. Same fix applied to AsyncFetcher.
  • Fixed incorrect request fingerprinting that caused duplicate requests in spiders by @yetval in #255.
  • Fixed the Adaptive scraping engine staying silent on weak matches. Combined with the threshold change above, you now get a warning instead of a misleading "best guess" element when relocation fails.

Docs

  • Refreshed older code examples across the documentation to match the current version.
  • Improved the code copy-paste experience on the docs site and trimmed the agent skill so it uses fewer tokens per invocation.

🙏 Special thanks to the community for all the continuous testing and feedback


Big shoutout to our Platinum Sponsors

v0.4.7 Mixed
Notable features
  • screenshot MCP tool capturing pages as ImageContent
  • Custom session_id parameter for named sessions
Full changelog

A focused update bringing eyes to your AI agents 📸

[!NOTE]
Follow us on X for daily tips and tricks

🚀 New Stuff and quality of life changes

  • Added a screenshot MCP tool that captures a page and returns it as a real MCP ImageContent block so the model can actually see it. The tool requires an open browser session, so you call open_session first (either dynamic or stealthy) and pass the session_id here. Supports PNG and JPEG, full-page captures, JPEG quality, and the usual readiness controls (wait, wait_selector, network_idle, timeout). (implements #244)
  • Added a custom session_id parameter to open_session so you can name sessions meaningfully ("search", "checkout") instead of the random 12-character hex default. By @hauntedhost in #243

🐛 Bug Fixes

  • Fixed FetcherSession state corruption and a lazy session close crash. By @yetval in #245
  • Fixed TypeError: Session.request() got an unexpected keyword argument 'block_ads' when using the CLI's --ai-targeted flag with HTTP commands. By @voidborne-d in #249 (Fixes #247)

Translations

  • Added a Brazilian Portuguese README translation By @rgomids in #250

🙏 Special thanks to the community for all the continuous testing and feedback


Big shoutout to our Platinum Sponsors

v0.4.6 Mixed
Notable features
  • Built-in ad blocking with ~3,500 known ad and tracker domains
  • DNS-over-HTTPS support for privacy with proxies
  • page_setup callback for pre-navigation setup
Full changelog

A focused update on browser stealth, privacy, and developer experience 🔒

[!NOTE]
Follow us on X for daily tips and tricks

🚀 New Stuff and quality of life changes

  • Added built-in ad blocking for browser fetchers. Pass block_ads=True to block requests to ~3,500 known ad and tracker domains at the route interception level -- no DNS, no TCP, instant abort. Can be combined with blocked_domains for custom lists. The MCP server and CLI --ai-targeted mode enable this automatically to save tokens and speed up page loads.
    page = StealthyFetcher.fetch('https://example.com', block_ads=True)
    
  • Added DNS-over-HTTPS support to prevent DNS leaks when using proxies. Pass dns_over_https=True to route DNS queries through Cloudflare's DoH, so your real location isn't exposed through DNS resolution even when your HTTP traffic goes through a proxy.
    page = StealthyFetcher.fetch('https://example.com', proxy='http://proxy:8080', dns_over_https=True)
    
  • Added page_setup callback for browser fetchers. A function that runs before page.goto(), letting you register event listeners, routes, or scripts that must be set up before the page navigates. Pairs with page_action (which runs after navigation). (Solves #237)
    def capture_websockets(page):
        page.on("websocket", lambda ws: print(f"WS: {ws.url}"))
    
    page = DynamicFetcher.fetch('https://example.com', page_setup=capture_websockets)
    
  • Added --block-ads and --dns-over-https CLI options to both fetch and stealthy-fetch commands.

🐛 Bug Fixes

  • Fixed Seconds type alias rejecting float values. Passing wait=1.5 or timeout=500.0 to browser fetchers would fail with a type error because the type alias incorrectly treated float as metadata instead of a type. by @kuishou68 in #240
  • Fixed duplicate ID segments in full-path selector generation. Elements with id attributes had their selector appended twice when generating full CSS/XPath paths, producing selectors like body > #main > #main > #target > #target. Also fixed full-path XPath emitting bare [@id='x'] predicates (invalid XPath) instead of *[@id='x']. by @sjhddh in #241
  • Fixed missing shell signature parameters. The interactive shell was missing blocked_domains, block_ads, retries, retry_delay, capture_xhr, executable_path, and dns_over_https from its function signatures.

🙏 Special thanks to the community for all the continuous testing and feedback


Big shoutout to our Platinum Sponsors

v0.4.5 Security relevant
Security fixes
  • Default follow_redirects='safe' prevents SSRF attacks on internal IPs
Notable features
  • Spider development mode with response caching
Full changelog

A focused update with one big quality-of-life feature for spider developers and a couple of important fixes 🎉

[!NOTE]
Follow us on X for daily tips and tricks

🚀 New Stuff and quality of life changes

  • Spider Development Mode: Iterating on a spider's parse() logic used to mean re-hitting the target servers on every run, which is slow, noisy, and a great way to get rate-limited while you're still figuring out your selectors. The new development mode caches every response to disk on the first run and replays them from disk on every subsequent run, so you can tweak your callbacks and re-run as many times as you want without making a single network request. Enable it with one class attribute:

    class MySpider(Spider):
        name = "my_spider"
        start_urls = ["https://example.com"]
        development_mode = True
    
        async def parse(self, response):
            yield {"title": response.css("title::text").get("")}
    

    The cache lives in .scrapling_cache/{spider.name}/ by default and can be redirected anywhere with development_cache_dir. Two new stat counters, cache_hits and cache_misses, let you see how the cache performed. Cache replay bypasses download_delay, rate limiting, and the blocked-request retry path so iteration is as fast as the disk allows. Don't ship a spider with development_mode = True -- it's a development tool, not a production cache. See the docs for the full story.

  • Safer redirects by default: follow_redirects now defaults to "safe" across all HTTP fetchers, the MCP server, and the shell. Redirects are still followed, but ones targeting internal/private IPs (loopback, private networks, link-local) are rejected. This protects you from SSRF when scraping user-supplied URLs. Pass follow_redirects="all" to get the old behavior, or False to disable redirects entirely.

🐛 Bug Fixes

  • Force-stop no longer loses your checkpoint: Pressing Ctrl+C twice (force-stop) on a spider with crawldir enabled used to race against the checkpoint write -- the cancel scope would tear down the task before the pickle finished, leaving paused=False and triggering the cleanup path that deletes the previous checkpoint. The result was that force-stopping a long crawl could lose all the progress you were trying to save. The engine now writes the checkpoint before calling cancel_scope.cancel(), so a force-stop always preserves the latest pending state. By @voidborne-d in #230.

🙏 Special thanks to the community for all the continuous testing and feedback




v0.4.4 Breaking risk
Notable features
  • robots.txt compliance with Disallow, Crawl-delay, Request-rate
  • robots.txt cache pre-warming before crawl
  • robots_disallowed_count stat in CrawlStats
Full changelog

A new update with important spider improvements and bug fixes 🎉

🚀 New Stuff and quality of life changes

  • Added robots.txt compliance to the Spider framework with a new robots_txt_obey option. When enabled, the spider will automatically fetch and respect robots.txt rules before crawling, including Disallow, Crawl-delay, and Request-rate directives. Robots.txt files are fetched concurrently and cached per domain for the entire crawl. By @AbdullahY36 in #226
  • Added robots.txt cache pre-warming so all start_urls domains have their robots.txt fetched and parsed before the crawl loop begins, avoiding delays on the first request to each domain.
  • Added a new robots_disallowed_count stat to CrawlStats to track how many requests were blocked by robots.txt rules during a crawl.

Check it out on the website from here

🐛 Bug Fixes

  • Fixed a critical MRO issue with ProxyRotator where the _build_context_with_proxy stub was shadowing the real implementation from child classes, causing proxy rotation to always raise NotImplementedError (Fixes #215). Thanks @yetval
  • Fixed a page pool leak when using per-request proxy rotation with browser sessions. Pages created inside temporary contexts were not removed from the pool on cleanup, leading to stale references accumulating over time. By @yetval in #223
  • Fixed a missing type assertion in the static fetcher where curl_cffi could return None from session.request(), causing downstream errors.

Other

  • Updated dependencies, so expect the latest fingerprints and other stuff.
  • Added protego as a new dependency under the fetchers optional group for robots.txt parsing.

🙏 Special thanks to the community for all the continuous testing and feedback


Big shoutout to our Platinum Sponsors

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

About

Stars
59,761
Forks
5,767
Languages
Python Dockerfile

Install & Platforms

Install via
pip

Community & Support

Beta — feedback welcome: [email protected]