Config change

v0.4.12 Breaking risk 11h

Auth

AutoThrottle + Export + Auth + Bugfixes + Performance

Open

No immediate action

v0.4.11 Mixed 14d

ShopifySpider + executable-path + crash fix

Open

No immediate action

v0.4.10 Mixed 22d

Scrapy integration + Chromium support + bug fixes

Open

Review required

v0.4.9 Bug fix 1mo

Auth

Proxy argument handling fix

Open

v0.4.8 New feature 2mo

⚠ Upgrade required

Adaptive relocation now defaults to a 40% similarity threshold; lower the threshold if needed and heed the new warning on weak matches.
Run `scrapling install --force` after updating to refresh browsers and fingerprints.

Notable features

Added `LinkExtractor` primitive in `scrapling.spiders.LinkExtractor` for URL extraction with fine‑grained controls.
Introduced `CrawlSpider` and `CrawlRule` templates to simplify "follow links matching a pattern" boilerplate.
Provided `SitemapSpider` template that seeds crawls from sitemaps or `robots.txt`, handling gzip‑compressed sitemaps.

Full changelog

A big spider update that takes the crawling framework to the next level 🕷️

[!NOTE]
Follow us on X for daily tips and tricks

🚀 New Stuff and quality of life changes

Added a LinkExtractor primitive in scrapling.spiders.LinkExtractor to pull URLs out of a Response. There are a lot of controls (Check the docs)
```
from scrapling.spiders import LinkExtractor

extractor = LinkExtractor(allow=r"/posts/", deny_domains=["ads.example.com"])
```

Added CrawlSpider and CrawlRule generic spider templates so you no longer have to hand-write the same "follow links matching this pattern" boilerplate. Override rules() to return a list of CrawlRule objects, each pairing a LinkExtractor. (Check the docs)

from scrapling.spiders import CrawlSpider, CrawlRule, LinkExtractor

class QuotesSpider(CrawlSpider):
    name = "blog"
    start_urls = ["https://quotes.toscrape.com/"]

    def rules(self):
        return [
            CrawlRule(LinkExtractor(allow=r"/author/"), callback=self.parse_author),
            CrawlRule(LinkExtractor(allow=r"/page/\d+/")),  # pagination, no callback
        ]

    async def parse_author(self, response):
        yield {
            "name": response.css(".author-title::text").get(),
            "birthday": response.css(".author-born-date::text").get(),
            "url": response.url,
        }

Added a SitemapSpider template that seeds a crawl directly from a sitemap, or robots.txt URLs. Handles gzip-compressed sitemaps, and a lot of controls and options. URLs are dispatched via the crawl rules as shown above for CrawlSpider. (Check the docs)

from scrapling.spiders import SitemapSpider, CrawlRule, LinkExtractor

class NewsSitemap(SitemapSpider):
    name = "news"
    sitemap_urls = ["https://example.com/robots.txt"]

    def rules(self):
        return [
            CrawlRule(LinkExtractor(allow=r"/articles/"), callback=self.parse_article),
        ]

    async def parse_article(self, response):
        yield {"url": response.url, "title": response.css("h1::text").get()}

Adaptive relocation now defaults to a 40% similarity threshold instead of 0 across all methods. This will make the adaptive feature work better. When nothing crosses the threshold, a warning now tells you the top score it did see, so you can lower percentage deliberately if needed.
Updated all browsers and fingerprints. Run a new scrapling install --force after updating to refresh the browsers and fingerprints.

🐛 Bug Fixes

Fixed Fetcher.configure(...) not applying to per-request calls. Same fix applied to AsyncFetcher.
Fixed incorrect request fingerprinting that caused duplicate requests in spiders by @yetval in #255.
Fixed the Adaptive scraping engine staying silent on weak matches. Combined with the threshold change above, you now get a warning instead of a misleading "best guess" element when relocation fails.

Docs

Refreshed older code examples across the documentation to match the current version.
Improved the code copy-paste experience on the docs site and trimmed the agent skill so it uses fewer tokens per invocation.

🙏 Special thanks to the community for all the continuous testing and feedback

Big shoutout to our Platinum Sponsors

View release on GitHub

Scrapling

Features

Recent releases

🚀 New Stuff and quality of life changes

🐛 Bug Fixes

Docs

Big shoutout to our Platinum Sponsors

About

Install & Platforms

Community & Support

Similar tools