Website Main Content Extractor
Pricing
from $0.50 / 1,000 url processeds
Go to Apify Store
Website Main Content Extractor
Pricing
from $0.50 / 1,000 url processeds
Rating
0.0
(0)
Developer

Alam
Maintained by Community
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
Extracts clean main content from web pages by stripping navigation, ads, footers, and sidebars. Returns structured text perfect for AI/RAG applications, LLM training data, and content analysis.
Features
- 🎯 Main Content Extraction - Uses
readabilityalgorithm to identify main content - 🧹 Automatic Cleanup - Removes nav, sidebar, ads, footer, scripts, styles
- 📊 Metadata Extraction - Title, description, Open Graph tags, canonical URL
- 📝 Multiple Formats - Markdown, plain text, or HTML output
- ⚡ Fast & Lightweight - Pure HTTP/HTML processing (no browser overhead)
- 🔒 Link Control - Optional link preservation or removal
Use Cases
- AI/RAG Applications - Feed clean text to LLMs and vector databases
- Content Analysis - Extract articles for NLP, sentiment analysis
- Training Data - Prepare web content for ML models
- Knowledge Bases - Clean documentation for chatbots
- SEO Tools - Extract page content for analysis
- Article Scraping - Get clean article text from news sites and blogs
Input
{"urls": ["https://example.com", "https://example.com/page"],"outputFormat": "markdown","preserveLinks": false,"includeMetadata": true,"maxContentLength": 100000}
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
urls | array | [] | List of URLs to process |
outputFormat | string | "markdown" | Format: markdown, plain, or both |
preserveLinks | boolean | false | Keep links in output |
includeMetadata | boolean | true | Include page metadata |
maxContentLength | integer | 100000 | Max characters per page (0 = no limit) |
Output
{"url": "https://example.com","status": "success","cleaned_markdown": "# Article Title\n\nThis is the main content...","cleaned_text": "Article Title\n\nThis is the main content...","metadata": {"title": "Article Title","description": "Page description","language": "en","canonical": "https://example.com/canonical","og_title": "Open Graph Title","og_description": "Open Graph Description","og_image": "https://example.com/og-image.jpg"},"stats": {"word_count": 1234,"char_count": 5678,"paragraph_count": 45,"has_title": true,"has_description": true}}
Pricing
$0.50 per 1,000 results (pay per result)
Each URL processed counts as one result.
Dependencies
- Python 3.12+
- beautifulsoup4
- lxml
- markdownify
- readability-lxml
- aiohttp
Local Testing
# Install dependenciespython3 -m venv venv./venv/bin/pip install -r requirements.txt# Run tests./venv/bin/python test_local.py./venv/bin/python test_cleanup.py
Development
Built for Apify platform. See TEST_REPORT.md for test results.
License
MIT