AI Website Content Extractor
Pricing
$5.00/month + usage
Go to Apify Store
AI Website Content Extractor
Crawl website pages, strip noise, and convert the main content to clean Markdown for RAG/LLM training.
Pricing
$5.00/month + usage
Rating
0.0
(0)
Developer

ScrapeAI
Maintained by Community
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Apify Actor that crawls one or more website pages using Playwright, removes navigation, ads, and other noise, then converts the main content to clean Markdown — ready for RAG pipelines, vector databases, and LLM training datasets.
Features
- Crawl any public website page(s)
- Automatically dismiss cookie / consent dialogs
- Strip navigation bars, headers, footers, sidebars, ads, and modals
- Detect the main content area using semantic HTML selectors (
main,article,[role="main"], etc.) - Convert HTML to clean Markdown via
turndown - Skip low-content pages (login walls, redirects) automatically
- Outputs a structured dataset ready for AI use-cases
Input
| Field | Type | Description | Default |
|---|---|---|---|
| startUrls | Array | List of {url} objects or plain URL strings to crawl | [{url: "https://example.com"}] |
| maxPages | Number | Maximum number of pages to process | 20 |
| proxyConfiguration | Object | Apify proxy settings (optional) | {} |
Example Input
{"startUrls": [{ "url": "https://en.wikipedia.org/wiki/Artificial_intelligence" },{ "url": "https://openai.com/blog" }],"maxPages": 10}
Output
Each extracted page produces one dataset record:
| Field | Type | Description |
|---|---|---|
| url | String | URL of the crawled page |
| title | String | Page <title> |
| markdown | String | Clean Markdown of the main content |
| text | String | |
| wordCount | Number | Approximate word count of the Markdown |
| extractedAt | String | ISO 8601 timestamp |
Example Output
{"url": "https://en.wikipedia.org/wiki/Artificial_intelligence","title": "Artificial intelligence - Wikipedia","markdown": "# Artificial intelligence\n\nArtificial intelligence (AI) is the simulation of human intelligence...","text": "Example Domain\n\nThis domain is for use in documentation examples without needing permission. Avoid use in operations.\n\nLearn more","wordCount": 4312,"extractedAt": "2026-03-13T08:00:00.000Z"}
Use Cases
- RAG pipelines — ingest Markdown directly into your vector store
- LLM fine-tuning — build clean text corpora from any website
- AI chatbots — feed domain knowledge to your assistant
- Research — extract and archive article content at scale
Tips
- Supply multiple
startUrlsto crawl several pages in one run - Increase
maxPagesto crawl an entire site (combine with Apify's link-following features) - For authenticated pages, configure a proxy or session in
proxyConfiguration