AI Content Scraper & Cleaner — Scrapes structured content (documentation, articles, FAQs, blog posts) and converts it into clean, normalized JSON datasets for LLM training. Extracts text, detects content types, estimates tokens, and removes boilerplate to produce ready-to-use training data.
Comma-separated list of URLs to start crawling from.
Default value of this property is "https://crawlee.dev/docs/introduction, https://docs.apify.com/platform/actors, https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Introduction, https://react.dev/learn, https://nextjs.org/docs/getting-started"
Max requests per crawl
maxRequestsPerCrawlstringOptional
Maximum number of requests allowed for this run.
Default value of this property is "50"
Content CSS selectors (comma-separated)
contentSelectorsstringOptional
Comma-separated CSS selectors used to extract main content.
Default value of this property is "article, .doc-content, .post-content"
Title CSS selectors (comma-separated)
titleSelectorsstringOptional
Comma-separated CSS selectors used to extract page title.
Default value of this property is "h1, .post-title"