We are pushing daily updates and improvements.
Notable changes to this actor and our webscraping framework will be documented here.
- Added graph mode to all crawler. This allows combining structured extraction for item pages as well as capturing the relashionship between all pages of the website
- Improved extracted item metadata to always include the page title, page kind (home/list/item/other) and the list of outbound links, in addition to the url and extraction timestamp
- Implemented direct API scraping for item pages to replace JS rendering. This combines the speed of simple http requests with the richness of a fully rendered page
- Reduced cold start time for new jobs
- Increased extraction speed by ~10%
- Improved crawler stealth
- Improved chrome fingerprint spoofing
- Improved http client impersonation
- Added direct API scraping for paginated list pages (alternative to chrome rendering with infinite scroll), resulting in 10x speedup for crawling catalog pages
- Improved URL detection logic for different page types
- Added universal extraction modes, in addition to structured (schema-driven) extraction
- Improved crawler stealth and robustness to Anti-Bot scripts