Skip to content

Release history

Scrapling releases

All releases

24 shown

v0.4.8 New feature
⚠ Upgrade required
  • Adaptive relocation now defaults to a 40% similarity threshold; lower the threshold if needed and heed the new warning on weak matches.
  • Run `scrapling install --force` after updating to refresh browsers and fingerprints.
Notable features
  • Added `LinkExtractor` primitive in `scrapling.spiders.LinkExtractor` for URL extraction with fine‑grained controls.
  • Introduced `CrawlSpider` and `CrawlRule` templates to simplify "follow links matching a pattern" boilerplate.
  • Provided `SitemapSpider` template that seeds crawls from sitemaps or `robots.txt`, handling gzip‑compressed sitemaps.
Full changelog

A big spider update that takes the crawling framework to the next level 🕷️

[!NOTE]
Follow us on X for daily tips and tricks

🚀 New Stuff and quality of life changes

  • Added a LinkExtractor primitive in scrapling.spiders.LinkExtractor to pull URLs out of a Response. There are a lot of controls (Check the docs)

    from scrapling.spiders import LinkExtractor
    
    extractor = LinkExtractor(allow=r"/posts/", deny_domains=["ads.example.com"])
    
  • Added CrawlSpider and CrawlRule generic spider templates so you no longer have to hand-write the same "follow links matching this pattern" boilerplate. Override rules() to return a list of CrawlRule objects, each pairing a LinkExtractor. (Check the docs)

    from scrapling.spiders import CrawlSpider, CrawlRule, LinkExtractor
    
    class QuotesSpider(CrawlSpider):
        name = "blog"
        start_urls = ["https://quotes.toscrape.com/"]
    
        def rules(self):
            return [
                CrawlRule(LinkExtractor(allow=r"/author/"), callback=self.parse_author),
                CrawlRule(LinkExtractor(allow=r"/page/\d+/")),  # pagination, no callback
            ]
    
        async def parse_author(self, response):
            yield {
                "name": response.css(".author-title::text").get(),
                "birthday": response.css(".author-born-date::text").get(),
                "url": response.url,
            }
    
  • Added a SitemapSpider template that seeds a crawl directly from a sitemap, or robots.txt URLs. Handles gzip-compressed sitemaps, and a lot of controls and options. URLs are dispatched via the crawl rules as shown above for CrawlSpider. (Check the docs)

    from scrapling.spiders import SitemapSpider, CrawlRule, LinkExtractor
    
    class NewsSitemap(SitemapSpider):
        name = "news"
        sitemap_urls = ["https://example.com/robots.txt"]
    
        def rules(self):
            return [
                CrawlRule(LinkExtractor(allow=r"/articles/"), callback=self.parse_article),
            ]
    
        async def parse_article(self, response):
            yield {"url": response.url, "title": response.css("h1::text").get()}
    
  • Adaptive relocation now defaults to a 40% similarity threshold instead of 0 across all methods. This will make the adaptive feature work better. When nothing crosses the threshold, a warning now tells you the top score it did see, so you can lower percentage deliberately if needed.

  • Updated all browsers and fingerprints. Run a new scrapling install --force after updating to refresh the browsers and fingerprints.

🐛 Bug Fixes

  • Fixed Fetcher.configure(...) not applying to per-request calls. Same fix applied to AsyncFetcher.
  • Fixed incorrect request fingerprinting that caused duplicate requests in spiders by @yetval in #255.
  • Fixed the Adaptive scraping engine staying silent on weak matches. Combined with the threshold change above, you now get a warning instead of a misleading "best guess" element when relocation fails.

Docs

  • Refreshed older code examples across the documentation to match the current version.
  • Improved the code copy-paste experience on the docs site and trimmed the agent skill so it uses fewer tokens per invocation.

🙏 Special thanks to the community for all the continuous testing and feedback


Big shoutout to our Platinum Sponsors

v0.4.7 Mixed
Notable features
  • screenshot MCP tool capturing pages as ImageContent
  • Custom session_id parameter for named sessions
Full changelog

A focused update bringing eyes to your AI agents 📸

[!NOTE]
Follow us on X for daily tips and tricks

🚀 New Stuff and quality of life changes

  • Added a screenshot MCP tool that captures a page and returns it as a real MCP ImageContent block so the model can actually see it. The tool requires an open browser session, so you call open_session first (either dynamic or stealthy) and pass the session_id here. Supports PNG and JPEG, full-page captures, JPEG quality, and the usual readiness controls (wait, wait_selector, network_idle, timeout). (implements #244)
  • Added a custom session_id parameter to open_session so you can name sessions meaningfully ("search", "checkout") instead of the random 12-character hex default. By @hauntedhost in #243

🐛 Bug Fixes

  • Fixed FetcherSession state corruption and a lazy session close crash. By @yetval in #245
  • Fixed TypeError: Session.request() got an unexpected keyword argument 'block_ads' when using the CLI's --ai-targeted flag with HTTP commands. By @voidborne-d in #249 (Fixes #247)

Translations

  • Added a Brazilian Portuguese README translation By @rgomids in #250

🙏 Special thanks to the community for all the continuous testing and feedback


Big shoutout to our Platinum Sponsors

v0.4.6 Mixed
Notable features
  • Built-in ad blocking with ~3,500 known ad and tracker domains
  • DNS-over-HTTPS support for privacy with proxies
  • page_setup callback for pre-navigation setup
Full changelog

A focused update on browser stealth, privacy, and developer experience 🔒

[!NOTE]
Follow us on X for daily tips and tricks

🚀 New Stuff and quality of life changes

  • Added built-in ad blocking for browser fetchers. Pass block_ads=True to block requests to ~3,500 known ad and tracker domains at the route interception level -- no DNS, no TCP, instant abort. Can be combined with blocked_domains for custom lists. The MCP server and CLI --ai-targeted mode enable this automatically to save tokens and speed up page loads.
    page = StealthyFetcher.fetch('https://example.com', block_ads=True)
    
  • Added DNS-over-HTTPS support to prevent DNS leaks when using proxies. Pass dns_over_https=True to route DNS queries through Cloudflare's DoH, so your real location isn't exposed through DNS resolution even when your HTTP traffic goes through a proxy.
    page = StealthyFetcher.fetch('https://example.com', proxy='http://proxy:8080', dns_over_https=True)
    
  • Added page_setup callback for browser fetchers. A function that runs before page.goto(), letting you register event listeners, routes, or scripts that must be set up before the page navigates. Pairs with page_action (which runs after navigation). (Solves #237)
    def capture_websockets(page):
        page.on("websocket", lambda ws: print(f"WS: {ws.url}"))
    
    page = DynamicFetcher.fetch('https://example.com', page_setup=capture_websockets)
    
  • Added --block-ads and --dns-over-https CLI options to both fetch and stealthy-fetch commands.

🐛 Bug Fixes

  • Fixed Seconds type alias rejecting float values. Passing wait=1.5 or timeout=500.0 to browser fetchers would fail with a type error because the type alias incorrectly treated float as metadata instead of a type. by @kuishou68 in #240
  • Fixed duplicate ID segments in full-path selector generation. Elements with id attributes had their selector appended twice when generating full CSS/XPath paths, producing selectors like body > #main > #main > #target > #target. Also fixed full-path XPath emitting bare [@id='x'] predicates (invalid XPath) instead of *[@id='x']. by @sjhddh in #241
  • Fixed missing shell signature parameters. The interactive shell was missing blocked_domains, block_ads, retries, retry_delay, capture_xhr, executable_path, and dns_over_https from its function signatures.

🙏 Special thanks to the community for all the continuous testing and feedback


Big shoutout to our Platinum Sponsors

v0.4.5 Security relevant
Security fixes
  • Default follow_redirects='safe' prevents SSRF attacks on internal IPs
Notable features
  • Spider development mode with response caching
Full changelog

A focused update with one big quality-of-life feature for spider developers and a couple of important fixes 🎉

[!NOTE]
Follow us on X for daily tips and tricks

🚀 New Stuff and quality of life changes

  • Spider Development Mode: Iterating on a spider's parse() logic used to mean re-hitting the target servers on every run, which is slow, noisy, and a great way to get rate-limited while you're still figuring out your selectors. The new development mode caches every response to disk on the first run and replays them from disk on every subsequent run, so you can tweak your callbacks and re-run as many times as you want without making a single network request. Enable it with one class attribute:

    class MySpider(Spider):
        name = "my_spider"
        start_urls = ["https://example.com"]
        development_mode = True
    
        async def parse(self, response):
            yield {"title": response.css("title::text").get("")}
    

    The cache lives in .scrapling_cache/{spider.name}/ by default and can be redirected anywhere with development_cache_dir. Two new stat counters, cache_hits and cache_misses, let you see how the cache performed. Cache replay bypasses download_delay, rate limiting, and the blocked-request retry path so iteration is as fast as the disk allows. Don't ship a spider with development_mode = True -- it's a development tool, not a production cache. See the docs for the full story.

  • Safer redirects by default: follow_redirects now defaults to "safe" across all HTTP fetchers, the MCP server, and the shell. Redirects are still followed, but ones targeting internal/private IPs (loopback, private networks, link-local) are rejected. This protects you from SSRF when scraping user-supplied URLs. Pass follow_redirects="all" to get the old behavior, or False to disable redirects entirely.

🐛 Bug Fixes

  • Force-stop no longer loses your checkpoint: Pressing Ctrl+C twice (force-stop) on a spider with crawldir enabled used to race against the checkpoint write -- the cancel scope would tear down the task before the pickle finished, leaving paused=False and triggering the cleanup path that deletes the previous checkpoint. The result was that force-stopping a long crawl could lose all the progress you were trying to save. The engine now writes the checkpoint before calling cancel_scope.cancel(), so a force-stop always preserves the latest pending state. By @voidborne-d in #230.

🙏 Special thanks to the community for all the continuous testing and feedback




v0.4.4 Breaking risk
Notable features
  • robots.txt compliance with Disallow, Crawl-delay, Request-rate
  • robots.txt cache pre-warming before crawl
  • robots_disallowed_count stat in CrawlStats
Full changelog

A new update with important spider improvements and bug fixes 🎉

🚀 New Stuff and quality of life changes

  • Added robots.txt compliance to the Spider framework with a new robots_txt_obey option. When enabled, the spider will automatically fetch and respect robots.txt rules before crawling, including Disallow, Crawl-delay, and Request-rate directives. Robots.txt files are fetched concurrently and cached per domain for the entire crawl. By @AbdullahY36 in #226
  • Added robots.txt cache pre-warming so all start_urls domains have their robots.txt fetched and parsed before the crawl loop begins, avoiding delays on the first request to each domain.
  • Added a new robots_disallowed_count stat to CrawlStats to track how many requests were blocked by robots.txt rules during a crawl.

Check it out on the website from here

🐛 Bug Fixes

  • Fixed a critical MRO issue with ProxyRotator where the _build_context_with_proxy stub was shadowing the real implementation from child classes, causing proxy rotation to always raise NotImplementedError (Fixes #215). Thanks @yetval
  • Fixed a page pool leak when using per-request proxy rotation with browser sessions. Pages created inside temporary contexts were not removed from the pool on cleanup, leading to stale references accumulating over time. By @yetval in #223
  • Fixed a missing type assertion in the static fetcher where curl_cffi could return None from session.request(), causing downstream errors.

Other

  • Updated dependencies, so expect the latest fingerprints and other stuff.
  • Added protego as a new dependency under the fetchers optional group for robots.txt parsing.

🙏 Special thanks to the community for all the continuous testing and feedback


Big shoutout to our Platinum Sponsors

v0.4.3 Mixed
Notable features
  • Persistent browser sessions via MCP tools
  • Session listing tool for tracking open browsers
  • Prompt injection sanitizer for MCP server
Full changelog

A new update with many important changes 🎉

🚀 New Stuff and quality of life changes

  • Added a new MCP tool to open a persistent normal/stealthy browser to keep using it with the rest of the tools, and another new tool to close it. (Examples)
  • Added a new MCP tool to list all existing browser sessions. Aimed to be used with the new tools.
  • Added a new option to browser sessions to automatically collect all background requests that happen during a request (Solves #159) [Examples].
  • Added a new sanitizer to protect the MCP server from common Prompt Injection attacks by removing hidden/invisible content.
  • Added a new commandline option called --ai-targeted to the Web Scraping commands to make content targeted to AI and safe against common Prompt Injection attacks like the MCP server.
  • Added a new option to browser sessions called executable_path to allow setting a custom browser path (Solves #202)
  • Refactored the MCP server code to be easily maintained and unified all tools to be async.
  • Refactored the CLI commands code to be easily maintained and shorter by 210 lines.

🐛 Bug Fixes

  • A fix to preserve HTTP method across retries in spider session by @karesansui-u in #201
  • Added a max retry limit to getting page content to prevent infinite loop by @haosenwang1018 & @D4Vinci in #197
  • Replace bare raise with return False in _restore_from_checkpoint by @haosenwang1018 in #196
  • Replaced get_all with getall in Texthandler to match the Selector class.

Coverage/tests improvement

  • Added _normalize_credentials edge case coverage tests by @Bortlesboat in #192
  • Added save/retrieve round-trip and core storage coverage tests by @haosenwang1018 in #193
  • Added coverage for TextHandler regex paths and TextHandlers.re() by @haosenwang1018 in #194
  • Added edge case tests for filter, iterancestors, and find_similar by @awanawana in #200

Agent Skill improvement

  • Fixed broken markdown links in skill references by @yetval in #204
  • Improved the skill structure to be more acceptable by Clawhub validation.
  • Forced the skill to use the --ai-targeted commandline option when scraping through commandline commands.

Docs improvement

  • Added Korean README translation by @greatsk55 in #187
  • CJK Latin spacing fixes for the Chinese and Japanese READMEs.
  • Fixed broken links from the old website design.

🙏 Special thanks to the community for all the continuous testing and feedback


Big shoutout to our Platinum Sponsors

v0.4.2 Mixed
Notable features
  • Agent Skill for Claude Code and AI tools
Full changelog

A new maintenance update with important changes

Bug fixes

  • The function get_all_text() now captures tail text nodes. This will make the MCP server and commands see text that was missed before (#168). Thanks @mhillebrand
  • Referer now returns a bare Google url instead of a Google search URL. The previous logic was incorrect and may have produced a fingerprinting signal (#179). Thanks @Bortlesboat
  • Fixed an issue with extra flags concatenation in all browsers. Thanks @rostchri
  • Fixed a type hints issue with Python versions below 3.12 that caused it to crash. (Solves #163)

Other

  • Added an Agent Skill for Claude Code / OpenClaw and other AI agentic tools.
  • Added the Agent Skill to Clawhub.
  • Updates all browsers and Playwright versions to the latest.
  • Added a French translation to the main README file.

🙏 Special thanks to the community for all the continuous testing and feedback


Big shoutout to our Platinum Sponsors

v0.4.1 New feature
Notable features
  • Improved Cloudflare detection and ~2x faster solver
  • Enhanced MCP schema for AI tool compatibility
  • Reduced token consumption via HTML tag stripping
Full changelog

A new update with many important changes

🚀 New Stuff and quality of life changes

  • Improved regex precision for Cloudflare challenge detection (Thanks to @Rinz27 #133)
  • Improved the speed and efficiency of the Cloudflare solver. Now it is nearly twice as fast.
  • Improved the Cloudflare solver to handle the case where websites sometimes show the Cloudflare page twice before redirecting to the main website.
  • Improved the stealthy browser's stealth mode and speed by removing the injected JS files.
  • Improved the MCP schema to be acceptable by OpenCode (Thanks to @robin-ede #137)
  • Made the MCP schema even more MCP-friendly to be accepted by VS Code Copilot and other strict tools. (Solves #150 )
  • Improved the MCP server tokens consumption by a large margin through stripping useless HTML tags while the main_content_only option is activated.
  • Fixed the PyPI page and added the files to register the MCP server to the MCP servers registry.
  • Added a new code snippet to show how to install the browsers deps through code instead of using the commandline to allow easier automation.
  • Improved all workflows by using the latest actions versions (Thanks to @salmanmkc #143/#144)

🙏 Special thanks to the community for all the continuous testing and feedback

v0.4 Breaking risk
Breaking changes
  • css_first/xpath_first removed; use css().first or css().get()
  • All selection now returns Selectors not TextHandlers
  • Response.body always returns bytes
Notable features
  • Spider framework for structured crawling
  • ProxyRotator for thread-safe proxy rotation
  • Domain blocking in browser fetchers
Full changelog

The biggest release of Scrapling yet — introducing the Spider framework, proxy rotation, and major parser improvements

This release brings a fully async spider/crawling framework, intelligent proxy management, and significant API changes that make Scrapling more powerful and consistent. Please review the breaking changes section carefully before upgrading.

🕷️ Spider Framework

A new async crawling framework built on top of anyio for structured, large-scale scraping:

from scrapling.spiders import Spider, Response

class MySpider(Spider):
  name = "demo"
  start_urls = ["https://example.com/"]

  async def parse(self, response: Response):
      for item in response.css('.product'):
          yield {"title": item.css('h2::text').get()}

MySpider().start()
  • Scrapy-like Spider API: Define spiders with start_urls, async parse callbacks, Request/Response objects, and priority queue.
  • Concurrent Crawling: Configurable concurrency limits, per-domain throttling, and download delays.
  • Multi-Session Support: Unified interface for HTTP requests, and stealthy headless browsers in a single spider - route requests to different sessions by ID. Supports lazy session initialization.
  • Pause & Resume: Checkpoint-based crawl persistence. Press Ctrl+C to gracefully shut down; then restart to resume from where you left off.
  • Streaming Mode: Stream scraped items as they arrive via async for item in spider.stream() with real-time stats - ideal for UI, pipelines, and long-running crawls.
  • Blocked Request Detection: Automatic detection and retry of blocked requests with customizable logic.
  • Built-in Export: Export results through hooks and your own pipeline or the built-in JSON/JSONL with result.items.to_json() / result.items.to_jsonl() respectively.
  • Lifecycle hooks: on_start(), on_close(), on_error(), on_scraped_item(), and more hooks for full control over the crawl lifecycle.
  • Detailed crawl stats: track requests, responses, bytes, status codes, proxies, per-domain/session breakdowns, log level counts, and more.
  • uvloop support: Pass use_uvloop=True to spider.start() for faster async execution when available.

A new section has been added to the website with the Full details. Click here

🔄 Proxy Rotation

  • New ProxyRotator class with thread-safe rotation. Works with all fetchers and sessions:
    from scrapling import ProxyRotator
    rotator = ProxyRotator(["http://proxy1:8080", "http://proxy2:8080"])
    Fetcher.get(url, proxy_rotator=rotator)
    
  • Custom rotation strategies: Make your own proxy rotation logic
  • Per-request proxy override: Pass proxy= to any individual get()/post()/fetch() call to override the session proxy for that request.

🌐 Browser Fetcher Improvements

  • Domain blocking: New blocked_domains parameter on DynamicFetcher/StealthyFetcher to block requests to specific domains (subdomains matched automatically).
  • Automatic retries: Browser fetchers now retry on failure with retries (default: 3) and retry_delay (default: 1s) parameters. Includes proxy-aware error detection.
  • Response metadata: Response.meta dict automatically stores the proxy used, and merges request metadata.
  • Response.follow(): Create follow-up Request objects with automatic referer flow, designed for the spider system.
  • No autoplay: Browser sessions are now blocking autoplay content, which caused issues before.
  • Speed: Improved stealth and speed by adjusting browser flags.

🔧 Bug Fixes & Improvements

  • Parser optimization: Optimized the parser for repeated operations, improving performance.
  • Errored pages: Fixed a bug that caused the browser to not close when pages gave errors.
  • Empty body: Handle responses with empty body.
  • Playwright loop: Solving an issue with leaving the Playwright loop open when CDP connection fails
  • Type safety: Fixed all mypy errors and added type hints across untyped function bodies. Added mypy and pyright to the CI workflow.

⚠️ Breaking Changes

  • css_first/xpath_first removed: Use css('.selector').first, css('.selector')[0], or css('.selector').get() instead.
  • All selection now returns Selectors: css('::text'), xpath('//text()'), css('::attr(href)'), and xpath('//@href') now return Selectors (wrapping text nodes in Selector objects with tag="#text") instead of TextHandlers. This makes the API consistent across all selection methods and the type hints.
  • Response.body is always bytes: Previously could be str or bytes, now always returns bytes.
  • get()/getall() behavior: On Selector: get() returns TextHandler (serialized HTML or text value), getall() returns TextHandlers. Aliases: extract_first = get, extract = getall. Old get_all() on Selectors is removed.
  • Selectors.first/.last: Safe accessors that return Selector | None instead of raising IndexError.
  • Internal constants renamed: DEFAULT_FLAGSDEFAULT_ARGS, DEFAULT_STEALTH_FLAGSSTEALTH_ARGS, HARMFUL_DEFAULT_ARGSHARMFUL_ARGS, DEFAULT_DISABLED_RESOURCESEXTRA_RESOURCES.

🔨 Other Changes

  • Dependency changes: Replaced tldextract with tld, removed internal _html_utils.py in favor of w3lib.html.replace_entities, added typing_extensions as a hard requirement.
  • Docs overhaul: Full switch from MkDocs to Zensical, new spider documentation section, updated all existing pages, and added new API references.

🙏 Special thanks to our Discord community for all the continuous testing and feedback


Big shoutout to our biggest Sponsors

v0.3.14 Bug fix

Fixed cookie persistence in StealthyFetcher on Windows

Full changelog

A minor maintenance update to fix issues that happened with some devices in v0.3.13

  • Disabled the incognito mode in StealthyFetcher and its session classes since it made cookies not persistent across pages on Windows devices. It didn't happen on MacOS and Linux (Fixes #123, thanks to @frugality4121 for bringing it up and to @gembleman for pointing out the solution).
  • Pinned down the last version of browserforge to solve the issue with old header models for users with an already old browserforge version.

🙏 Special thanks to our Discord community for all the continuous testing and feedback


Big shoutout to our biggest Sponsors

v0.3.13 Breaking risk
Breaking changes
  • Removed Camoufox dependency
  • Removed stealth argument from DynamicFetcher
  • disable_webgl moved to StealthyFetcher and renamed allow_webgl
Full changelog

This is a big update with many improvements across many places, but also many breaking changes for good reasons. Please read the below before updating

  • For many reasons, we decided that from now on, we will stop using Camoufox entirely, and we might switch back to it in the future if its development continues. If you prefer to continue using Camoufox as before this release, there are instructions for that in this section.

  • Previously, we were using patchright in the stealth mode inside DynamicFetcher and its session classes. Now we removed the stealth mode from them and started using patchright inside StealthyFetcher and its session classes, with A LOT of improvements, as you will see, improving the stealth overall on top of patchright.

This makes StealthyFetcher and its session classes 101% faster than before, use less memory and space, and have ~400 lines of code shorter, but, most importantly, are more stable than when we used Camoufox before.

This will also shorten the installation time of the scrapling install command, reduce the size of the Docker image, improve test smoothness in GitHub's CI, and make scrapling less confusing for new users.

Breaking changes

  1. The stealth argument was removed from the DynamicFetcher class and its session class, while the hide_canvas argument was moved to the StealthyFetcher and its session classes.
  2. The disable_webgl argument has been moved from DynamicFetcher to the StealthyFetcher class and renamed as allow_webgl. All session classes as well.
  3. The StealthyFetcher class is now basically the new stealthy version of DynamicFetcher, so the following arguments are removed: block_images, humanize, addons, os_randomize, disable_ads, and geoip. I tried to replicate them in Chromium, but each had its own problem. This might change with upcoming releases before v0.4.

Now to the good news, we have improved and fixed a lot of stuff :)

Improvements

  • You already know that the StealthyFetcher class and its session classes are now 101% faster than before, but now also the DynamicFetcher class and its session class are 20% faster.
  • Cloudflare's solver algorithm has been improved over before now to finish faster and handle more cases. Also, thanks to the new refactor, expect the solver to solve the captcha twice as fast!
  • All fetchers now use less memory.
  • The MCP server now uses fewer tokens to save more money!
  • The Docker image is now 60% smaller.
  • The whole documentation website has been updated with the new stuff. At the same time, it was made more explicit, many sections were shortened, more examples were added, missing arguments were included, the API reference section was updated with graphs, and many other improvements were made. The Website now loads 130% faster, uses less data, and is better for SEO.

Fixes

  • Added the arguments that were missing before in the Web Scraping shell shortcuts and made them more accurate.
  • Fixed the issue where the google_search argument was creating a Google referrer even if the URL is a localhost/IP.

🙏 Special thanks to our Discord community for all the continuous testing and feedback


Big shoutout to our biggest Sponsors

v0.3.12 New feature
Notable features
  • timezone_id argument for browser timezone matching
Full changelog

What's Changed

  • Added a new argument to DynamicSession/AsyncDynamicSession classes called timezone_id, which allows you to set the timezone of the browser so that it matches the timezone of the Proxy/VPN you are using. That way, the websites can't detect that you are using a proxy through the timezone mismatch technique.
  • Improved the automated conversion of response to JSON.
  • Renamed the internal function __create__ to start inside fetchers' session classes to make it easier to use them outside the with context.
  • Updated curl_cffi and other deps to the latest versions.

🙏 Special thanks to our Discord community for all the continuous testing and feedback


Big shoutout to our biggest Sponsors

v0.3.11 Bug fix

Improves timeout handling and fixes autocompletion.

Full changelog

What's Changed

  • Added a better logic for handling timeout errors when the network_idle argument is used on an unstable website (websites with media playing, etc.)
  • Fixed the autocompletion for the stealthy_fetch shortcut in the Web Scraping Shell

🙏 Special thanks to our Discord community for all the continuous testing and feedback


Big shoutout to our biggest Sponsors

v0.3.10 Breaking risk
Breaking changes
  • All fetchers require keyword-only arguments (only url as positional)
  • custom_config parameter unified to selector_config
Full changelog

A maintenance update with many significant changes and possible breaking changes

  • Solved all encoding issues by using a better approach which will handle web pages where encoding is not correctly declared (Thanks to @Kemsty2's efforts for pointing that out in #110 #111 )
  • Solved a logical issue with overriding session-level parameters with request-level parameters in all browser-based fetchers that was present since v0.3
  • Fixed the signatures of the shortcuts in the interactive web scraping shell, which made a perfect autocompletion experience for the shortcuts in the shell. This issue has been present since v0.3 as well.
  • Pumped up the version for the Maxmind database, which will improve the geoip argument for StealthyFetcher and its session classes.
  • Updated all used browser versions to the latest available ones.
  • BREAKING - all fetchers had gone through a big refactor, which resulted in some interesting things that might break your code:
    1. Scrapling codebase is now smaller by ~750 lines and many changes which would make maintenance very much easier in the future and use a bit less resources.
    2. The validation for all fetchers and their session classes became much faster, which will reflect on their overall speed.
    3. To achieve this, now all fetchers can't accept standard arguments other than the url argument; the rest of the arguments must be keyword-arguments so your code must be like Fetcher.get('https://google.com', stealthy_headers=True) not Fetcher.get('https://google.com', True) if you were doing that for some reason!
    4. An annoying difference between browser-based fetchers and their session classes since v0.3 was that the argument used to pass custom parser settings per request was called custom_config, while it was named selector_config in the session classes. This refactor allowed us to unify the naming to selector_config without breaking your code, so the main one is now selector_config with backward compatibility for the custom_config argument. The autocompletion support will be available only for the selector_config argument.
    5. Also, to achieve all of this, we had to make the type hints of the fetchers' functions dynamically generated, so if you don't get a proper autocompletion in your IDE, make sure you are using a modern version of it. We have tested almost all known IDEs/editors.

We have also updated all benchmark tables with the current numbers against the latest versions of all alternative libraries.

🙏 Special thanks to our Discord community for all the continuous testing and feedback


Big shoutout to our biggest Sponsors

v0.3.9 Bug fix
Notable features
  • Random browser selection in impersonate argument
  • HTML entity removal in clean method
Full changelog

A new update with many important changes

🚀 New Stuff and quality of life changes

  • Now the impersonate argument in Fetcher and FetcherSession can accept a list of browsers that the library will choose a random browser from them with each request.
from scrapling.fetchers import FetcherSession

with FetcherSession(impersonate=['chrome', 'firefox', 'safari']) as s:
  s.get('https://github.com/D4Vinci/Scrapling')
  • A new argument to the clean method in TextHandler to remove html entities from the current text easily.
  • Huge improvements to the documentation with more precise explanations of many parts and automatic translations of the main README.md file.

🐛 Bug Fixes

  • Fixed a big issue with retrieving responses from browser-based fetchers. Now, there is intelligent content type detection that ensures response.body contains the rendered browser content only if the content is HTML; otherwise, it contains the raw content of the last request made. This allows you to download binary files and text-based files without having to find them wrapped in HTML tags, while being able to retrieve the rendered content you want from the website when fetching it.

🔨 Misc

  • Updated the contributing guide to make it clearer and easier.
  • Add a new workflow to enforce code quality tools (Same ones used as pre-commit hooks).

🙏 Special thanks to our Discord community for all the continuous testing and feedback


Big shoutout to our biggest Sponsors

v0.3.8 Bug fix
Notable features
  • extra_flags for custom Chrome flag configuration
Full changelog

A new update with many important changes

🚀 New Stuff and quality of life changes

  • For all browser-based fetchers: websites that never finish loading their requests won't crash the code now if you used network_idle with them.
  • The logic for collecting/checking for page content in browser-based fetchers has been changed to make browsers more stable on Windows systems now, as Linux/MacOS (All this difference in behaviour is because of Playwright's different implementation on Windows systems).
  • Refactored all the validation logic, which made all requests done from all browser-based fetchers faster by 8-15%
  • A New option called extra_flags has been added to DynamicFetcher and its session to allow users to add custom Chrome flags to the existing ones while launching the browser.
  • Reverted the route logic for catching responses (changed in the last version) to use the old routing version when page_action is used. This was added to collect the latest version of a page's content in case page_action changes it without making a request. (Thanks for @gembleman to pointing it in #100 and #102 )

🐛 Bug Fixes

  • Fixed a typo in load_dom in DynamicSession's async_fetch
  • Fixed an issue with Cloudflare solver that made the solver wait forever for embedded captchas that don't disappear after solving. Now it will wait for the captcha to disappear for 30 seconds, then assume it's the type that doesn't disappear (Fixes #100 )

🔨 Misc

  • Now the Docker image is automatically pushed to Dockerhub and GitHub's container registry for user convenience.
  • Added a new documentation page to show how to use Scrapeless browser with Scrapling.

🙏 Special thanks to our Discord community for all the continuous testing and feedback


Big shoutout to our biggest Sponsors

v0.3.7 New feature
Notable features
  • Full Pyright type compliance
  • user_data_dir for session persistence
  • additional_args for Dynamic fetcher customization
Full changelog

A new update with many important changes

🚀 New Stuff and quality of life changes

  • Reworked solve_cloudflare argument in StealthyFetcher to make it able to solve all kinds of custom implementations of Turnstile.
  • Refactored the entire codebase to be acceptable by Pyright, so expect a flawless IDE experience now with all software and many bugs solved.
  • Refactored the requests logic to be cleaner and faster (Also solves #97 )
  • Added a new option user_data_dir to all browser-based session classes to allow the user to reuse the browser session data (cookies/storage/etc...) from previous sessions. Leaving it will cause Playwright to use a random directory on each run, as was happening before.
  • Added a new customization option additional_args to Dynamic fetcher and its session class to enable the user to pass extra arguments to Playwright's context, as we had with StealthyFetcher before.
  • The route logic for collecting the last navigation response for all browsers has been improved, which allows the raw responses to be passed to the parser before being processed by the browsers as before. This will be very helpful with text/JSON responses.

🐛 Bug Fixes

  • The rework of the route logic solved an issue with retrieving the content of unstable websites on some Windows devices.
  • All the refactors that happened in this version solved a lot of bugs along the way that were hard to spot before, and weird autocompletion issues with some IDEs.
  • Many fixes to the documentation website

🙏 Special thanks to our Discord community for all the continuous testing and feedback


Big shoutout to our biggest Sponsors

v0.3.6 New feature
Notable features
  • Docker image with all browsers
  • Streamable HTTP option for MCP server
  • Universal Cloudflare/Turnstile challenge solving
Full changelog

🚀 New Stuff

  • Improved the solve_cloudflare argument in StealthyFetcher and its session classes to be able to solve all types of both Turnstile and interstitial Cloudflare challenges 🎉
  • Now the MCP server has the option to use Streamable HTTP, so you can easily expose the server.
  • Added Docker support, so now an image is built and pushed to Docker Hub automatically with each release (contains all browsers)

🐛 Bug Fixes

  • Fixed an encoding issue with the parser that happened in some cases (the famous invalid start byte error)
  • Restructured multiple parts of the library to fix some memory leaks, so now enjoy noticably lower memory usage based on your config (Also solves #92 )
  • Improved type annotation in many parts of the code so you can have a better IDE experience (Also solves #93 )

🙏 Special thanks to our Discord community for all the continuous testing and feedback


Big shoutout to our biggest Sponsors

v0.3.5 New feature
Notable features
  • 15-20% speed improvement in browser fetchers
  • Improved stealth mode with PatchRight
Full changelog

Necessary release that fixes multiple issues

🚀 New Stuff

  • All browser-based fetchers (DynamicFetcher/StealthyFetcher/...) and their session classes are now fetching websites 15-20%:

    1. Page management is now much faster due to the logic improvement by @AbdullahY36 in #87
    2. Optimized the validation logic overall and improved page creation for sync fetches, which together introduced a lot of speed improvements
  • Big improvements to the stealth mode in DynamicFetcher and its session classes by replacing rebrowser-playwright with PatchRight:

    1. Before this update, rebrowser-playwright was turned off when you enabled stealth and real_chrome because they weren't doing well together, but now we don't have this issue with PatchRight
    2. You can now interact with Closed-Shadow Roots since PatchRight can handle them automatically.

🐛 Bug Fixes

  • Fixed a bug that happens while using the re method from the Selectors class.
  • Fixed a bug with uncurl and curl2fetcher commands in the Web Scraping Shell that made curl's --data-raw flag parse incorrectly.
  • Fixed a bug with the view command in the Web Scraping Shell that depended on the website's encoding to happen.
  • Fixed a bug with content converting that affected the mcp mode and extract commands.

New Contributors

  • @AbdullahY36 made their first contribution in #87

🙏 Special thanks to our Discord community for all the continuous testing and feedback


Big shoutout to our biggest Sponsors

v0.3.4 Bug fix
Notable features
  • Fetcher session classes available in interactive shell
Full changelog

Necessary release that fixes multiple issues

🚀 New Stuff

  • Added all the fetchers session classes to the interactive shell to be available right away without import.

🐛 Bug Fixes

  • Added a workaround for a bug with the Playwright API on Windows that happened while retrieving content while solving Cloudflare.
  • Fixed an encoding issue with the view command in the interactive shell
  • Fixed a bug with the max_pages argument in AsyncStealthySession that was crashing the code.
  • Fixed an issue that happened with the last updates that made the html_content and prettify properties in the Selector class return bytes, depending on the encoding. Both are returning strings as they were.

🙏 Special thanks to our Discord community for all the continuous testing and feedback


Big shoutout to our biggest Sponsors

v0.3.3 Breaking

Fixed browser crash from default tab removal logic

Full changelog
  • Removed the logic that is removing the default browser tab on browser-based fetchers since it caused a crashing error (Not happening on Mac, only managed to produce on Windows and Linux)

Big shoutout to our biggest Sponsors

v0.3.2 Breaking risk
Breaking changes
  • Removed max_pages parameter from sync StealthySession
Notable features
  • Optional dependency groups for fetchers
  • Per-page configuration in browser sessions
  • Enhanced .body property for file downloads
Full changelog

Release Notes for v0.3.2

🚀 New Stuff

  • Optional fetcher dependencies: All fetchers are now part of optional dependency groups, reducing core package size. So the base scrapling module is now the parser only, and to use the fetchers or the commandline options, you have to do: pip install "scrapling[fetchers]". Check out the detailed installation instructions from here

  • Per-page configuration in sessions: Session classes for browser fetchers now support individual configuration per page in sessions. All fetch-level parameters are now validated like session-level ones. More details on the documentation website here

    Example:

    with StealthySession(headless=True, solve_cloudflare=True) as session:
        page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
    
  • Improved browser-based fetchers

    • A new option to control whether to wait for JavaScript execution to finish in pages or not (it's enabled by default now, as it was before)
      with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session:
         page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
      
    • The Stealth mode is now more reliable in DynamicFetcher and its session classes.
    • Both DynamicFetcher and StealthyFetcher are now using fewer resources (Automatically finding and closing the default tab opened by Persistent contexts in Playwright API)
    • Fixed a vital logic bug in browser-based fetchers' pages rotation - previous pages are now replaced with fresh ones. (Tabs that get reused in rotation are possibly contaminated from previous settings used on them)
    • StealthyFetcher and its session classes are now slightly faster (5%)
  • Enhanced .body property: Now returns the passed content as-is without processing, enabling file downloads and handling non-HTML requests. Below is an example of downloading a photo:

    from scrapling.fetchers import Fetcher
    
    page = Fetcher.get('https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/poster.png')
    with open(file='poster.png', mode='wb') as f:
       f.write(page.body)
    

🐛 Bug Fixes

  • Encoding issues resolved: Fixed multiple encoding problems that happened with some websites in parser, mcp mode, and extract commands (Also solves #80 and #81)
  • Faster parsing: Due to many changes here and there, the library is now faster, and it's reflected in the updated benchmarks

🔨 Misc

  • Updated benchmarks: Refreshed performance benchmarks to compare the current speed improvements to the latest versions of similar libraries
  • Refactored a lot of the code and replaced dead code with better implementations: Fewer code, cleaner code, easier maintenance
  • Added YouTube video: Included video content for MCP documentation.
  • A new issues template: Easy new template for users who can't use the current templates.
  • CI workflow optimization: Tests workflow now skips runs when only documentation or non-code files are changed.
  • Updated dependencies: Bumped up various dependencies to the latest versions.
  • Code style improvements: Applied new ruff rules across all files.
  • Pre-commit hooks: Updated pre-commit configuration.

🎯 Breaking Changes

  • Removed max_pages parameter from sync StealthySession to match DynamicSession (it's meaningless to have in the sync version)

🙏 Special thanks to our Discord community for all the continuous testing and feedback


Big shoutout to our biggest Sponsors

v0.3.1 New feature
Notable features
  • init_script argument for custom JS execution on page creation
Full changelog

Scrapling v0.3.1 release notes

  1. Fixed an issue with scrapling installation when you install it without the shell extra (#76 )
  2. Added a new argument to all browser-based fetchers and their session classes to add a JS file to be executed on page creation (#56) :
from scrapling.fetchers import StealthyFetcher

StealthyFetcher.fetch('https://example.com', init_script="/absolute/path/to/js/script.js")

Big shoutout to our biggest Sponsors

v0.3 Breaking risk
Breaking changes
  • Python 3.10+ required
  • Adaptor renamed to Selector
  • Adaptors renamed to Selectors
Notable features
  • Session-based architecture with persistent browser tabs
  • Cloudflare Turnstile solver
  • MCP server with 6 tools
Full changelog

Scrapling v0.3.0 Release Notes

🎉 Major Release — Complete Architecture Overhaul

Scrapling v0.3 represents the most significant update in the project's history, featuring a complete architectural rewrite, considerable performance improvements, and powerful new features, including AI integration and interactive Web Scraping shell capabilities.

This release includes multiple breaking changes; please review the release notes carefully.

🚀 Major New Features

Session-Based Architecture

  • New Session Classes: Complete rewrite introducing persistent session support
    • FetcherSession - HTTP requests with persistent state management that works with both sync and async code
    • DynamicSession/AsyncDynamicSession - Browser automation while keeping the browser open till you finish
    • StealthySession/AsyncStealthySession - Stealth browsing while keeping the browser open till you finish
  • Async Browser Tabs Management: A new pool of tabs feature through the max_pages argument that rotates browser tabs for concurrent browser fetches
  • Concurrent Sessions: Run multiple isolated sessions simultaneously

Refer to the Fetching section on the website for more details.

A lot of new stealth/anti-bot Capabilities

  • 🤖 Cloudflare Solver: Automatic Cloudflare Turnstile challenge solving in StealthyFetcher and its session classes
  • Browser fingerprint impersonation: Mimic real browsers' TLS fingerprints, version-matching browser headers, HTTP/3 support, and more with the all-new Fetcher class
  • Improved stealth mode: The stealth mode for DynamicFetcher and its session classes is now more robust and reliable (AKA PlayWrightFetcher)

AI Integration & MCP Server

  • Built-in MCP Server: Model Context Protocol server for AI-assisted web scraping
  • 6 Powerful Tools: get, bulk_get, fetch, bulk_fetch, stealthy_fetch, bulk_stealthy_fetch
  • Smart Content Extraction: Convert web pages/elements to Markdown, HTML, or extract a clean version of the text content
  • CSS Selector Support: Use the Scrapling engine to target specific elements with precision before handing the content to the AI
  • Anti-Bot Bypass: Handle Cloudflare Turnstile and other protections
  • Proxy Support: Use proxies for anonymity and geo-targeting
  • Browser Impersonation: Mimic real browsers with TLS fingerprinting, real browser headers matching that version, and more
  • Parallel Processing: Scrape multiple URLs concurrently for efficiency
  • and more...

New Interactive Web Scraping Shell

  • A New Shell: Custom IPython shell with many smart Built-in Shortcuts like get, post, put, delete, fetch, and stealthy_fetch
  • Smart Page Management: New commands page and pages to automatically store the current page and history for all requests done through the shell
  • Curl Integration: Convert browser DevTools curl commands with uncurl and curl2fetcher functions to Fetcher requests
  • and more...

Scrape from the terminal without programming

  • New Extract Commands: Terminal-based scraping without programming
    • scrapling extract get/post/put/delete - Simple HTTP requests
    • scrapling extract fetch - Dynamic content scraping
    • scrapling extract stealthy-fetch - Anti-bot bypass
  • Downloads web pages and saves their content to files.
  • Converts HTML to readable formats like Markdown, keeps it as HTML, or just extracts the text content of the page.
  • Supports custom CSS selectors to extract specific parts of the page.
  • Handles HTTP requests and fetching through browsers.
  • Highly customizable with custom headers, cookies, proxies, and the rest of the options. Almost all the options available through the code are also accessible through the terminal.
  • and more...

🔧 Technical Improvements

Performance Enhancements

  • Fetcher is now 4 times faster - Yes you have read it right!
  • DynamicFetcher is now ~60% faster - A much faster version depending on your config (especially stealth mode)
  • StealthyFetcher is now 20–30% faster - Using the new structure, and starting to use our implementation instead of Camoufox Python interface
  • 50%+ combined speed gains across core selection methods (find_by_text, find_similar, find_by_regex, relocate, etc.) 🚀
  • ~10% CSS/XPath first methods speed increase - css_first and xpath_first are now faster than css and xpath
  • 40% faster get_all_text() method for content extraction
  • 20% speed improvement in adaptive element relocation
  • Navigation properties optimization — Properties like next, previous, below_elements, and more are now noticeably faster
  • 5x faster text cleaning operations
  • Memory efficiency improvements with optimized imports and reduced overhead
  • ⚡ Lightning-fast imports: Reduced startup time with optimized module loading
  • Better benchmarks: All the speed improvements Scrapling got made it much faster than before, compared to other libraries (1775x faster than BeautifulSoup and 5.1x faster than AutoScraper, check benchmarks)

Architecture/Code Quality, and Quality of life

  • Persistent Context: All browser-based fetchers now use persistent context by default. (Solves #64 too)
  • Using msgspec to validate all browser-based fetchers very fast before running the requests, so now it's easier to debug errors.
  • All cookies returned from fetchers are now matching the format accepted by the same fetcher. So you can retrieve cookies and pass them again to all fetchers and their session classes.
  • Faster linting and formatting due to migrating to ruff
  • Modern Build System: Migrated from setup.py to pyproject.toml 📦
  • Better GitHub actions and workflows for smoother development and testing
  • 🎨 Enhanced Type Hints: Complete type coverage with modern Python standards for better IDE support and reliability
  • Cleaner Codebase: Removed dead code and optimized core functions 🧹
  • 🚀 Backward Compatibility: Added shortcuts to maintain compatibility with older code

Breaking Changes

Minimum Python Version

  • Python 3.10+ Required: Dropped support for Python 3.9 and below

Class and Method Naming

These renamings are intended to improve clarity and consistency, particularly for new users.

  • AdaptorSelector: Core parsing class renamed (But still can be imported as Adaptor for backward compatibility)
  • AdaptorsSelectors: Collection class renamed (But still can be imported as Adaptors for backward compatibility)
  • auto_matchadaptive: Parameter renamed across all methods
  • adaptor_argumentsselector_config: Configuration parameter renamed
  • automatch_domainadaptive_domain: Domain parameter renamed
  • additional_argumentsadditional_args: Shortened parameter name
  • ⚠️ text/bodycontent: Selector constructor parameter is now accepting both str and bytes format
  • PlayWrightFetcherDynamicFetcher: Browser automation class renamed (But still can be imported as PlayWrightFetcher for backward compatibility)
  • DynamicFetcher doesn't have the NSTBrowser logic/arguments anymore since it's pointless to leave this logic now anyway.
  • StealthyFetcher's headless argument can't accept 'virtual' as an argument anymore since we are not using Camoufox's library right now in anything other than getting the browser installation path and the rest of the launch options

🐛 Bug Fixes

  • Fixed nested children counting in ignored tags for get_all_text (#61)
  • Fixed the issue with installation due to spaces in Python's executable path (#57)
  • Resolved threading issues in storage with recursion handling while the adaptive feature is enabled
  • Fixed argument precedence issues using the Sentinel pattern in FetcherSession
  • Resolved proxy type handling in StealthyFetcher
  • Fixed referer and google_search argument conflicts
  • Fixed async stealth script injection problems

🙏 Special thanks to our Discord community for all the continuous testing, feedback, and contributions across the last four months


Big shoutout to our biggest Sponsors

Beta — feedback welcome: [email protected]