This release adds 3 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
+14 more
ReleasePort's take
Moderate signalScrapling 0.4.8 adds LinkExtractor, CrawlSpider, and SitemapSpider spider templates, plus critical fixes for request fingerprinting and Fetcher.configure application.
Why it matters: Fingerprinting bug causes duplicate requests; apply fixes immediately if affected. New templates accelerate development; adaptive relocation defaults to 40% similarity threshold. Test templates in dev.
Summary
AI summaryAdded LinkExtractor, CrawlSpider/CrawlRule, and SitemapSpider templates plus adaptive relocation threshold change.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Feature | Medium |
Added LinkExtractor primitive to scrapling.spiders.LinkExtractor for URL extraction. Added LinkExtractor primitive to scrapling.spiders.LinkExtractor for URL extraction. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
Introduced CrawlSpider and CrawlRule templates for automated link following. Introduced CrawlSpider and CrawlRule templates for automated link following. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
Added SitemapSpider template to crawl from sitemaps or robots.txt URLs. Added SitemapSpider template to crawl from sitemaps or robots.txt URLs. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Performance | Medium |
Adaptive relocation now defaults to 40% similarity threshold for better accuracy. Adaptive relocation now defaults to 40% similarity threshold for better accuracy. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Bugfix | Medium |
Fixed Fetcher.configure not applying to per-request calls; also fixed in AsyncFetcher. Fixed Fetcher.configure not applying to per-request calls; also fixed in AsyncFetcher. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Bugfix | Medium |
Resolved incorrect request fingerprinting causing duplicate requests in spiders. Resolved incorrect request fingerprinting causing duplicate requests in spiders. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Bugfix | Medium |
Fixed Adaptive scraping engine staying silent on weak matches; now warns instead. Fixed Adaptive scraping engine staying silent on weak matches; now warns instead. Source: llm_adapter@2026-05-21 Confidence: high |
— |
Full changelog
A big spider update that takes the crawling framework to the next level 🕷️
🚀 New Stuff and quality of life changes
-
Added a
LinkExtractorprimitive inscrapling.spiders.LinkExtractorto pull URLs out of aResponse. There are a lot of controls (Check the docs)from scrapling.spiders import LinkExtractor extractor = LinkExtractor(allow=r"/posts/", deny_domains=["ads.example.com"]) -
Added
CrawlSpiderandCrawlRulegeneric spider templates so you no longer have to hand-write the same "follow links matching this pattern" boilerplate. Overriderules()to return a list ofCrawlRuleobjects, each pairing aLinkExtractor. (Check the docs)from scrapling.spiders import CrawlSpider, CrawlRule, LinkExtractor class QuotesSpider(CrawlSpider): name = "blog" start_urls = ["https://quotes.toscrape.com/"] def rules(self): return [ CrawlRule(LinkExtractor(allow=r"/author/"), callback=self.parse_author), CrawlRule(LinkExtractor(allow=r"/page/\d+/")), # pagination, no callback ] async def parse_author(self, response): yield { "name": response.css(".author-title::text").get(), "birthday": response.css(".author-born-date::text").get(), "url": response.url, } -
Added a
SitemapSpidertemplate that seeds a crawl directly from a sitemap, orrobots.txtURLs. Handles gzip-compressed sitemaps, and a lot of controls and options. URLs are dispatched via the crawl rules as shown above for CrawlSpider. (Check the docs)from scrapling.spiders import SitemapSpider, CrawlRule, LinkExtractor class NewsSitemap(SitemapSpider): name = "news" sitemap_urls = ["https://example.com/robots.txt"] def rules(self): return [ CrawlRule(LinkExtractor(allow=r"/articles/"), callback=self.parse_article), ] async def parse_article(self, response): yield {"url": response.url, "title": response.css("h1::text").get()} -
Adaptive relocation now defaults to a 40% similarity threshold instead of
0across all methods. This will make the adaptive feature work better. When nothing crosses the threshold, a warning now tells you the top score it did see, so you can lowerpercentagedeliberately if needed. -
Updated all browsers and fingerprints. Run a new
scrapling install --forceafter updating to refresh the browsers and fingerprints.
🐛 Bug Fixes
- Fixed
Fetcher.configure(...)not applying to per-request calls. Same fix applied toAsyncFetcher. - Fixed incorrect request fingerprinting that caused duplicate requests in spiders by @yetval in #255.
- Fixed the Adaptive scraping engine staying silent on weak matches. Combined with the threshold change above, you now get a warning instead of a misleading "best guess" element when relocation fails.
Docs
- Refreshed older code examples across the documentation to match the current version.
- Improved the code copy-paste experience on the docs site and trimmed the agent skill so it uses fewer tokens per invocation.
🙏 Special thanks to the community for all the continuous testing and feedback
Big shoutout to our Platinum Sponsors
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About Scrapling
All releases →Related context
Related tools
Beta — feedback welcome: [email protected]