Website Main Content Extractor avatar

Website Main Content Extractor

Pricing

from $0.50 / 1,000 url processeds

Go to Apify Store
Website Main Content Extractor

Website Main Content Extractor

Pricing

from $0.50 / 1,000 url processeds

Rating

0.0

(0)

Developer

Alam

Alam

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

Extracts clean main content from web pages by stripping navigation, ads, footers, and sidebars. Returns structured text perfect for AI/RAG applications, LLM training data, and content analysis.

Features

  • 🎯 Main Content Extraction - Uses readability algorithm to identify main content
  • 🧹 Automatic Cleanup - Removes nav, sidebar, ads, footer, scripts, styles
  • 📊 Metadata Extraction - Title, description, Open Graph tags, canonical URL
  • 📝 Multiple Formats - Markdown, plain text, or HTML output
  • Fast & Lightweight - Pure HTTP/HTML processing (no browser overhead)
  • 🔒 Link Control - Optional link preservation or removal

Use Cases

  • AI/RAG Applications - Feed clean text to LLMs and vector databases
  • Content Analysis - Extract articles for NLP, sentiment analysis
  • Training Data - Prepare web content for ML models
  • Knowledge Bases - Clean documentation for chatbots
  • SEO Tools - Extract page content for analysis
  • Article Scraping - Get clean article text from news sites and blogs

Input

{
"urls": ["https://example.com", "https://example.com/page"],
"outputFormat": "markdown",
"preserveLinks": false,
"includeMetadata": true,
"maxContentLength": 100000
}

Input Parameters

ParameterTypeDefaultDescription
urlsarray[]List of URLs to process
outputFormatstring"markdown"Format: markdown, plain, or both
preserveLinksbooleanfalseKeep links in output
includeMetadatabooleantrueInclude page metadata
maxContentLengthinteger100000Max characters per page (0 = no limit)

Output

{
"url": "https://example.com",
"status": "success",
"cleaned_markdown": "# Article Title\n\nThis is the main content...",
"cleaned_text": "Article Title\n\nThis is the main content...",
"metadata": {
"title": "Article Title",
"description": "Page description",
"language": "en",
"canonical": "https://example.com/canonical",
"og_title": "Open Graph Title",
"og_description": "Open Graph Description",
"og_image": "https://example.com/og-image.jpg"
},
"stats": {
"word_count": 1234,
"char_count": 5678,
"paragraph_count": 45,
"has_title": true,
"has_description": true
}
}

Pricing

$0.50 per 1,000 results (pay per result)

Each URL processed counts as one result.

Dependencies

  • Python 3.12+
  • beautifulsoup4
  • lxml
  • markdownify
  • readability-lxml
  • aiohttp

Local Testing

# Install dependencies
python3 -m venv venv
./venv/bin/pip install -r requirements.txt
# Run tests
./venv/bin/python test_local.py
./venv/bin/python test_cleanup.py

Development

Built for Apify platform. See TEST_REPORT.md for test results.

License

MIT