📄 Website Content Extractor
Pricing
Pay per event
📄 Website Content Extractor
Extract clean main content from any webpage as text, markdown, or HTML. Removes navigation, ads, and scripts. Perfect for RAG pipelines, LLM training data, and content aggregation workflows.
Pricing
Pay per event
Rating
0.0
(0)
Developer
太郎 山田
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Extract clean main content from any webpage as text, markdown, or HTML. Removes nav, ads, scripts. Perfect for RAG pipelines and LLM training.
Store Quickstart
Start with the Quickstart template (3 demo pages, markdown output). For LLM data prep, use RAG Pipeline (200 URLs, markdown + metadata).
Key Features
- 🧠 Readability-style extraction — Removes nav, sidebar, ads, scripts — keeps main content only
- 📝 Multiple output formats — Plain text, markdown, or cleaned HTML
- 🏷️ Rich metadata — Title, author, publish date, description, canonical URL
- 📊 Word count — Per-page stats for content analysis
- 🌐 Any webpage — Blog posts, articles, documentation, product pages
- 🔑 No API key needed — Pure HTTP + heuristic content extraction
Use Cases
| Who | Why |
|---|---|
| AI engineers | Pre-process web content for LLM/RAG pipelines at scale |
| Content aggregators | Clean article extraction without ad clutter |
| Research teams | Bulk content gathering for NLP datasets |
| SEO analysts | Compare content across competitor pages |
| Accessibility auditors | Check reading-only content structure |
Input
| Field | Type | Default | Description |
|---|---|---|---|
| urls | string[] | (required) | URLs to extract (max 200) |
| outputFormat | string | markdown | text, markdown, or html |
| includeMetadata | boolean | true | Include metadata in output |
Input Example
{"urls": ["https://blog.example.com/post-1", "https://docs.example.com/guide"],"outputFormat": "markdown","includeMetadata": true}
Output Example
{"url": "https://blog.example.com/post-1","title": "How to Build a SaaS","author": "Jane Doe","publishedDate": "2026-03-15","content": "In this article we explore...","contentMarkdown": "# How to Build a SaaS\n\nIn this article we explore...","wordCount": 2450,"metadata": {"description": "...", "language": "en"}}
FAQ
How is this different from apify/website-content-crawler?
No browser = much faster + cheaper. This uses HTTP + heuristic extraction, good for standard HTML sites.
Does it work on JavaScript-heavy sites?
Only server-rendered content is extracted. SPAs that render content client-side won't work.
What's the extraction accuracy?
~90% for news/blog/docs. Product pages and complex layouts may need custom extraction.
Can I customize which elements to remove?
Not in current version. Standard removal: nav, header, footer, aside, script, style, ads.
Related Actors
News & Content cluster — explore related Apify tools:
- 📰 Google News Scraper — Scrape Google News articles for any search query via official RSS feed.
- 📰 Article Extractor — Extract clean article content with title, author, publish date, images from news and blog pages.
- 📡 RSS Feed Aggregator — Aggregate multiple RSS and Atom feeds with keyword filtering and deduplication.
- 📰 Hacker News Scraper — Fetch Hacker News top, new, best, ask, show, job stories via official Firebase API.
Cost
Pay Per Event:
actor-start: $0.01 (flat fee per run)dataset-item: $0.005 per output item
Example: 1,000 items = $0.01 + (1,000 × $0.005) = $5.01
No subscription required — you only pay for what you use.