π Web Content Extractor
Pricing
Pay per event
π Web Content Extractor
Extract clean text, markdown, and HTML from websites. Scrape article details and page content to feed AI models, RAG pipelines, and LLM applications.
Pricing
Pay per event
Rating
0.0
(0)
Developer
ε€ͺι ε±±η°
Actor stats
0
Bookmarked
5
Total users
1
Monthly active users
18 hours ago
Last modified
Categories
Share
π Website Content Extractor
Extract clean main content from any webpage as text, markdown, or HTML. Removes nav, ads, scripts. Perfect for RAG pipelines and LLM training.
Store Quickstart
Start with the Quickstart template (3 demo pages, markdown output). For LLM data prep, use RAG Pipeline (200 URLs, markdown + metadata).
Key Features
- π§ Readability-style extraction β Removes nav, sidebar, ads, scripts β keeps main content only
- π Multiple output formats β Plain text, markdown, or cleaned HTML
- π·οΈ Rich metadata β Title, author, publish date, description, canonical URL
- π Word count β Per-page stats for content analysis
- π Any webpage β Blog posts, articles, documentation, product pages
- π No API key needed β Pure HTTP + heuristic content extraction
Use Cases
| Who | Why |
|---|---|
| AI engineers | Pre-process web content for LLM/RAG pipelines at scale |
| Content aggregators | Clean article extraction without ad clutter |
| Research teams | Bulk content gathering for NLP datasets |
| SEO analysts | Compare content across competitor pages |
| Accessibility auditors | Check reading-only content structure |
Input
| Field | Type | Default | Description |
|---|---|---|---|
| urls | string[] | (required) | URLs to extract (max 200) |
| outputFormat | string | markdown | text, markdown, or html |
| includeMetadata | boolean | true | Include metadata in output |
Input Example
{"urls": ["https://blog.example.com/post-1", "https://docs.example.com/guide"],"outputFormat": "markdown","includeMetadata": true}
Output
| Field | Type | Description |
|---|---|---|
url | string | Page URL |
title | string | Extracted page title |
content | string | Main content body (markdown, html, or text per outputFormat) |
wordCount | integer | Word count of extracted content |
language | string | Detected language code |
publishedDate | string | ISO date if metadata available |
author | string | Author name if metadata available |
images | string[] | Image URLs found in main content |
Output Example
{"url": "https://blog.example.com/post-1","title": "How to Build a SaaS","author": "Jane Doe","publishedDate": "2026-03-15","content": "In this article we explore...","contentMarkdown": "# How to Build a SaaS\n\nIn this article we explore...","wordCount": 2450,"metadata": {"description": "...", "language": "en"}}
API Usage
Run this actor programmatically using the Apify API. Replace YOUR_API_TOKEN with your token from Apify Console β Settings β Integrations.
cURL
curl -X POST "https://api.apify.com/v2/acts/taroyamada~website-content-extractor/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{ "urls": ["https://blog.example.com/post-1", "https://docs.example.com/guide"], "outputFormat": "markdown", "includeMetadata": true }'
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("taroyamada/website-content-extractor").call(run_input={"urls": ["https://blog.example.com/post-1", "https://docs.example.com/guide"],"outputFormat": "markdown","includeMetadata": true})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item)
JavaScript / Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('taroyamada/website-content-extractor').call({"urls": ["https://blog.example.com/post-1", "https://docs.example.com/guide"],"outputFormat": "markdown","includeMetadata": true});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items);
Tips & Limitations
- Use
outputFormat: "markdown"for LLM/RAG ingestion β preserves structure without HTML noise. - Set
includeMetadata: trueto capture publish date, author, and OpenGraph data. - Concurrency 5 is a safe default. Increase to 10 only on bandwidth-rich sites.
- Pair with a vector store to build a searchable knowledge base from any website.
FAQ
How is this different from apify/website-content-crawler?
No browser = much faster + cheaper. This uses HTTP + heuristic extraction, good for standard HTML sites.
Does it work on JavaScript-heavy sites?
Only server-rendered content is extracted. SPAs that render content client-side won't work.
What's the extraction accuracy?
~90% for news/blog/docs. Product pages and complex layouts may need custom extraction.
Can I customize which elements to remove?
Not in current version. Standard removal: nav, header, footer, aside, script, style, ads.
Does this work on JavaScript-heavy SPAs?
Static HTML extraction only. For SPAs, use a browser-based scraper.
Can I exclude navigation and ads?
Yes β the actor uses readability heuristics to extract main content and drop boilerplate.
Related Actors
News & Content cluster β explore related Apify tools:
- π° Google News Scraper β Scrape Google News articles for any search query via official RSS feed.
- π° Article Extractor β Extract clean article content with title, author, publish date, images from news and blog pages.
- π‘ RSS Feed Aggregator β Aggregate multiple RSS and Atom feeds with keyword filtering and deduplication.
- π° Hacker News Scraper β Fetch Hacker News top, new, best, ask, show, job stories via official Firebase API.
- π‘ Reddit All-in-One Scraper β Scrape Reddit subreddits, posts, comments, user profiles, and search results via public JSON endpoints.
- π¨ Reddit Keyword Monitor Alerts β Focused Reddit keyword and subreddit monitor built for recurring alerts, snapshot diffing, and webhook handoff.
Cost
Pay Per Event:
actor-start: $0.01 (flat fee per run)dataset-item: $0.005 per output item
Example: 1,000 items = $0.01 + (1,000 Γ $0.005) = $5.01
No subscription required β you only pay for what you use.