📄 Website Content Extractor avatar

📄 Website Content Extractor

Pricing

Pay per event

Go to Apify Store
📄 Website Content Extractor

📄 Website Content Extractor

Extract clean main content from any webpage as text, markdown, or HTML. Removes navigation, ads, and scripts. Perfect for RAG pipelines, LLM training data, and content aggregation workflows.

Pricing

Pay per event

Rating

0.0

(0)

Developer

太郎 山田

太郎 山田

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Categories

Share

Extract clean main content from any webpage as text, markdown, or HTML. Removes nav, ads, scripts. Perfect for RAG pipelines and LLM training.

Store Quickstart

Start with the Quickstart template (3 demo pages, markdown output). For LLM data prep, use RAG Pipeline (200 URLs, markdown + metadata).

Key Features

  • 🧠 Readability-style extraction — Removes nav, sidebar, ads, scripts — keeps main content only
  • 📝 Multiple output formats — Plain text, markdown, or cleaned HTML
  • 🏷️ Rich metadata — Title, author, publish date, description, canonical URL
  • 📊 Word count — Per-page stats for content analysis
  • 🌐 Any webpage — Blog posts, articles, documentation, product pages
  • 🔑 No API key needed — Pure HTTP + heuristic content extraction

Use Cases

WhoWhy
AI engineersPre-process web content for LLM/RAG pipelines at scale
Content aggregatorsClean article extraction without ad clutter
Research teamsBulk content gathering for NLP datasets
SEO analystsCompare content across competitor pages
Accessibility auditorsCheck reading-only content structure

Input

FieldTypeDefaultDescription
urlsstring[](required)URLs to extract (max 200)
outputFormatstringmarkdowntext, markdown, or html
includeMetadatabooleantrueInclude metadata in output

Input Example

{
"urls": ["https://blog.example.com/post-1", "https://docs.example.com/guide"],
"outputFormat": "markdown",
"includeMetadata": true
}

Output Example

{
"url": "https://blog.example.com/post-1",
"title": "How to Build a SaaS",
"author": "Jane Doe",
"publishedDate": "2026-03-15",
"content": "In this article we explore...",
"contentMarkdown": "# How to Build a SaaS\n\nIn this article we explore...",
"wordCount": 2450,
"metadata": {"description": "...", "language": "en"}
}

FAQ

How is this different from apify/website-content-crawler?

No browser = much faster + cheaper. This uses HTTP + heuristic extraction, good for standard HTML sites.

Does it work on JavaScript-heavy sites?

Only server-rendered content is extracted. SPAs that render content client-side won't work.

What's the extraction accuracy?

~90% for news/blog/docs. Product pages and complex layouts may need custom extraction.

Can I customize which elements to remove?

Not in current version. Standard removal: nav, header, footer, aside, script, style, ads.

News & Content cluster — explore related Apify tools:

Cost

Pay Per Event:

  • actor-start: $0.01 (flat fee per run)
  • dataset-item: $0.005 per output item

Example: 1,000 items = $0.01 + (1,000 × $0.005) = $5.01

No subscription required — you only pay for what you use.