πŸ“„ Web Content Extractor avatar

πŸ“„ Web Content Extractor

Pricing

Pay per event

Go to Apify Store
πŸ“„ Web Content Extractor

πŸ“„ Web Content Extractor

Extract clean text, markdown, and HTML from websites. Scrape article details and page content to feed AI models, RAG pipelines, and LLM applications.

Pricing

Pay per event

Rating

0.0

(0)

Developer

ε€ͺιƒŽ ε±±η”°

ε€ͺιƒŽ ε±±η”°

Maintained by Community

Actor stats

0

Bookmarked

5

Total users

1

Monthly active users

18 hours ago

Last modified

Share

πŸ“„ Website Content Extractor

Extract clean main content from any webpage as text, markdown, or HTML. Removes nav, ads, scripts. Perfect for RAG pipelines and LLM training.

Store Quickstart

Start with the Quickstart template (3 demo pages, markdown output). For LLM data prep, use RAG Pipeline (200 URLs, markdown + metadata).

Key Features

  • 🧠 Readability-style extraction β€” Removes nav, sidebar, ads, scripts β€” keeps main content only
  • πŸ“ Multiple output formats β€” Plain text, markdown, or cleaned HTML
  • 🏷️ Rich metadata β€” Title, author, publish date, description, canonical URL
  • πŸ“Š Word count β€” Per-page stats for content analysis
  • 🌐 Any webpage β€” Blog posts, articles, documentation, product pages
  • πŸ”‘ No API key needed β€” Pure HTTP + heuristic content extraction

Use Cases

WhoWhy
AI engineersPre-process web content for LLM/RAG pipelines at scale
Content aggregatorsClean article extraction without ad clutter
Research teamsBulk content gathering for NLP datasets
SEO analystsCompare content across competitor pages
Accessibility auditorsCheck reading-only content structure

Input

FieldTypeDefaultDescription
urlsstring[](required)URLs to extract (max 200)
outputFormatstringmarkdowntext, markdown, or html
includeMetadatabooleantrueInclude metadata in output

Input Example

{
"urls": ["https://blog.example.com/post-1", "https://docs.example.com/guide"],
"outputFormat": "markdown",
"includeMetadata": true
}

Output

FieldTypeDescription
urlstringPage URL
titlestringExtracted page title
contentstringMain content body (markdown, html, or text per outputFormat)
wordCountintegerWord count of extracted content
languagestringDetected language code
publishedDatestringISO date if metadata available
authorstringAuthor name if metadata available
imagesstring[]Image URLs found in main content

Output Example

{
"url": "https://blog.example.com/post-1",
"title": "How to Build a SaaS",
"author": "Jane Doe",
"publishedDate": "2026-03-15",
"content": "In this article we explore...",
"contentMarkdown": "# How to Build a SaaS\n\nIn this article we explore...",
"wordCount": 2450,
"metadata": {"description": "...", "language": "en"}
}

API Usage

Run this actor programmatically using the Apify API. Replace YOUR_API_TOKEN with your token from Apify Console β†’ Settings β†’ Integrations.

cURL

curl -X POST "https://api.apify.com/v2/acts/taroyamada~website-content-extractor/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{ "urls": ["https://blog.example.com/post-1", "https://docs.example.com/guide"], "outputFormat": "markdown", "includeMetadata": true }'

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("taroyamada/website-content-extractor").call(run_input={
"urls": ["https://blog.example.com/post-1", "https://docs.example.com/guide"],
"outputFormat": "markdown",
"includeMetadata": true
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item)

JavaScript / Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('taroyamada/website-content-extractor').call({
"urls": ["https://blog.example.com/post-1", "https://docs.example.com/guide"],
"outputFormat": "markdown",
"includeMetadata": true
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Tips & Limitations

  • Use outputFormat: "markdown" for LLM/RAG ingestion β€” preserves structure without HTML noise.
  • Set includeMetadata: true to capture publish date, author, and OpenGraph data.
  • Concurrency 5 is a safe default. Increase to 10 only on bandwidth-rich sites.
  • Pair with a vector store to build a searchable knowledge base from any website.

FAQ

How is this different from apify/website-content-crawler?

No browser = much faster + cheaper. This uses HTTP + heuristic extraction, good for standard HTML sites.

Does it work on JavaScript-heavy sites?

Only server-rendered content is extracted. SPAs that render content client-side won't work.

What's the extraction accuracy?

~90% for news/blog/docs. Product pages and complex layouts may need custom extraction.

Can I customize which elements to remove?

Not in current version. Standard removal: nav, header, footer, aside, script, style, ads.

Does this work on JavaScript-heavy SPAs?

Static HTML extraction only. For SPAs, use a browser-based scraper.

Can I exclude navigation and ads?

Yes β€” the actor uses readability heuristics to extract main content and drop boilerplate.

News & Content cluster β€” explore related Apify tools:

Cost

Pay Per Event:

  • actor-start: $0.01 (flat fee per run)
  • dataset-item: $0.005 per output item

Example: 1,000 items = $0.01 + (1,000 Γ— $0.005) = $5.01

No subscription required β€” you only pay for what you use.