Pricing

from $0.50 / 1,000 url processeds

Go to Apify Store

Website Main Content Extractor

Try for free

Pricing

from $0.50 / 1,000 url processeds

Rating

0.0

(0)

Developer

Alam

Actor stats

Bookmarked

Total users

Monthly active users

4 months ago

Last modified

Features

🎯 Main Content Extraction - Uses readability algorithm to identify main content
🧹 Automatic Cleanup - Removes nav, sidebar, ads, footer, scripts, styles
📊 Metadata Extraction - Title, description, Open Graph tags, canonical URL
📝 Multiple Formats - Markdown, plain text, or HTML output
⚡ Fast & Lightweight - Pure HTTP/HTML processing (no browser overhead)
🔒 Link Control - Optional link preservation or removal

Use Cases

AI/RAG Applications - Feed clean text to LLMs and vector databases
Content Analysis - Extract articles for NLP, sentiment analysis
Training Data - Prepare web content for ML models
Knowledge Bases - Clean documentation for chatbots
SEO Tools - Extract page content for analysis
Article Scraping - Get clean article text from news sites and blogs

Input

{
  "urls": ["https://example.com", "https://example.com/page"],
  "outputFormat": "markdown",
  "preserveLinks": false,
  "includeMetadata": true,
  "maxContentLength": 100000
}

Input Parameters

Parameter	Type	Default	Description
`urls`	array	`[]`	List of URLs to process
`outputFormat`	string	`"markdown"`	Format: `markdown`, `plain`, or `both`
`preserveLinks`	boolean	`false`	Keep links in output
`includeMetadata`	boolean	`true`	Include page metadata
`maxContentLength`	integer	`100000`	Max characters per page (0 = no limit)

Output

{
  "url": "https://example.com",
  "status": "success",
  "cleaned_markdown": "# Article Title\n\nThis is the main content...",
  "cleaned_text": "Article Title\n\nThis is the main content...",
  "metadata": {
    "title": "Article Title",
    "description": "Page description",
    "language": "en",
    "canonical": "https://example.com/canonical",
    "og_title": "Open Graph Title",
    "og_description": "Open Graph Description",
    "og_image": "https://example.com/og-image.jpg"
  },
  "stats": {
    "word_count": 1234,
    "char_count": 5678,
    "paragraph_count": 45,
    "has_title": true,
    "has_description": true
  }
}

Pricing

$0.50 per 1,000 results (pay per result)

Each URL processed counts as one result.

Dependencies

Python 3.12+
beautifulsoup4
lxml
markdownify
readability-lxml
aiohttp

Local Testing

# Install dependencies
python3 -m venv venv
./venv/bin/pip install -r requirements.txt

# Run tests
./venv/bin/python test_local.py
./venv/bin/python test_cleanup.py

Development

Built for Apify platform. See TEST_REPORT.md for test results.

License

MIT

AI Website Content Extractor

scrapeai/ai-website-content-extractor

Crawl website pages, strip noise, and convert the main content to clean Markdown for RAG/LLM training.

ScrapeAI

5.0

Website Email Extractor

alex_claw/website-email-extractor

Alex Claw

Content Intelligence Extractor

whole_butterwort/content-intelligence-extractor

Gerald

Vrbo Main Link Scraper

decorative_chimta/vrbo-main-link-scraper

Philipp Reuter

Generic Articles Main Content Extractor

nlp_data_lni/generic-articles-content-extractor

Extract the main content of articles. Input can be article links or pages from which to identify and extract article links. Articles are scraped and cleaned to extract the main text and many useful metadatas. Search terms and date post filters can be applied and highlighted snippets produced.

LilaK

Webpage Text Extractor

automation-lab/webpage-text-extractor

This actor fetches web pages and extracts their clean text content by stripping all HTML tags, scripts, and styles. It identifies the main content area (article, main, etc.), extracts headings structure, page links, and metadata like author, publish date, and language. Use it for LLM input...

Stas Persiianenko

101

Website Contact Extractor

ricardourtubey/website-contact-extractor

Apify Actor

Ricardo Urtubey

Website Content Crawler

rupom888/website-content-crawler

Syed Rupom

Keywords Extractor

lukaskrivka/keywords-extractor

Use our free website keyword extractor to crawl any website and extract keyword counts on each page.

Lukáš Křivka

846

4.8

Website Metadata Extractor

scrapers-hub/website-metadata-extractor

Website metadata extractor to extract titles, descriptions, keywords, and meta tags from any website 🌐📊 Perfect for SEO analysis, auditing, and research. Fast, accurate, and scalable extraction.