Pricing

Pay per event

📄 Website Content Extractor

Strip noise from general website pages to extract clean markdown and structured text. Perfect for building LLM datasets from docs, pricing, and product pages.

Pricing

Pay per event

Rating

0.0

(0)

Developer

太郎山田

Actor stats

Bookmarked

Total users

Monthly active users

2 hours ago

Last modified

Store Quickstart

Start here for broad website pages. Use store-input.example.json or Quickstart — Clean 3 Pages for the cheapest reliable proof run.
Then use the buyer upgrade ladder from store-input.templates.json:
1. Quickstart — Clean 3 Pages for first proof
2. Recurring Docs Watch for scheduled monitoring
3. Webhook → Content Ops Handoff for routed downstream delivery
Route article/blog/news URLs to Article Content Extractor instead of forcing them through a general page workflow.
Side presets stay available for job-specific lanes: Competitor Page Extract and Policy / Terms Diff Prep.
Buyer-facing proof assets live in sample-output.example.json and live-proof.example.json.

Which actor should I use?

Surface	Best for
Website Content Extractor	Flagship/default cleaner for docs, product, pricing, policy, help-center, and broad website pages
Article Content Extractor	Article-specialized feeder for news stories, blog posts, newsroom URLs, and pages where byline/date metadata matters
Google News Scraper	Upstream discovery when the buyer needs fresh article URLs by query
RSS Feed Aggregator	Upstream discovery when the buyer has known feeds and needs article URLs before cleanup

Key Features

📄 Generic page cleanup — Removes common boilerplate from standard HTML pages
🧭 Flagship routing — Default starting point for broad website content in the content cluster
📊 Buyer-trust signals — Returns contentQualityScore, mainElementHint, and truncatedOrThinContent
📝 Flexible output — Export text, markdown, or sanitized HTML
🔀 Cross-sell fit — Sends true article URLs to Article Content Extractor instead of diluting page-cleanup proof
⚡ HTTP-only — Fast first runs on public server-rendered pages

Use Cases

Who	Why
AI / RAG teams	Clean docs and help-center pages before indexing
RevOps / enablement	Capture product, pricing, and FAQ pages for internal search
Compliance teams	Normalize policy and legal pages before diffing
Competitive intelligence	Clean product pages before structured analysis
Content operations	Send cleaned page rows into review queues or webhook handoffs

Buyer Workflows and Upgrade Routing

Buyer workflow	Start here	Route next
Clean a known list of docs, help, product, pricing, or policy URLs	Quickstart — Clean 3 Pages	Scale to Recurring Docs Watch when the same pages need monitoring
Build an LLM/RAG corpus from broad website pages	Website Content Extractor	Keep `markdown` output and review `contentQualityScore` before indexing
Hand cleaned pages to another system	Webhook → Content Ops Handoff	Dataset/PPE output remains canonical; webhook delivery is downstream only
Mixed list contains blog or newsroom URLs	Split the list first	Send article URLs to Article Content Extractor and keep broad pages here
Buyer does not have URLs yet	Add Google News Scraper or RSS Feed Aggregator only for discovery	Route discovered article URLs to Article Content Extractor; route general pages back here

Input

Field	Type	Default	Description
`urls`	`string[]`	required	Public broad website page URLs (max 200); route article/news/blog URLs to Article Content Extractor
`outputFormat`	`string`	`markdown`	`text`, `markdown`, or `html`
`includeMetadata`	`boolean`	`true`	Include title/description/author/date/language when available
`concurrency`	`integer`	`5`	Parallel fetches
`timeoutMs`	`integer`	`15000`	Per-page timeout
`delivery`	`string`	`dataset`	`dataset` writes canonical dataset rows. `webhook` writes canonical dataset rows first, then sends the webhook after dataset/PPE output succeeds
`webhookUrl`	`string`	—	Webhook target when `delivery=webhook`
`dryRun`	`boolean`	`false`	Write only local output for validation; disables dataset writes and webhook delivery

Input Example

{
  "urls": [
    "https://docs.apify.com/platform/actors",
    "https://docs.apify.com/platform/storage/dataset",
    "https://docs.apify.com/platform/storage/key-value-store"
  ],
  "outputFormat": "markdown",
  "includeMetadata": true,
  "concurrency": 3,
  "delivery": "dataset",
  "dryRun": false
}

Delivery and PPE output

Non-dry-run runs always write canonical dataset rows first. This is true for both delivery=dataset and delivery=webhook.

When delivery=webhook, the webhook is a downstream handoff: it is sent only after the dataset write and PPE output succeed. If dataset/PPE output fails, no webhook request is sent.

dryRun=true writes only local output/result.json and disables both dataset writes and webhook delivery. Docker and local runtime require Node.js 20+; the actor Dockerfile uses node:20-slim.

Output

Field	Type	Description
`url`	string	Source page URL
`title`	string	Extracted page title
`content`	string	Main content in the selected format
`wordCount`	integer	Word count of the cleaned content
`contentLength`	integer	Character length of the cleaned content
`extractionMode`	string	Which main-content strategy won (`semantic-main`, `article-like`, `role-main`, `body-fallback`)
`mainElementHint`	string	Main HTML container that was used
`contentQualityScore`	integer	Heuristic confidence score from 0-100
`truncatedOrThinContent`	boolean	True when the page looks suspiciously short
`author`	string	Author when metadata exists
`publishedDate`	string	Publish date when metadata exists
`language`	string	HTML language hint
`status`	string	Result billing status: `success`, `partial`, `empty`, or `error_no_result`
`chargedEvent`	string	null

Output Example

{
  "url": "https://docs.apify.com/platform/actors",
  "title": "Actors overview",
  "extractionMode": "semantic-main",
  "mainElementHint": "main",
  "contentQualityScore": 88,
  "truncatedOrThinContent": false,
  "wordCount": 1642,
  "contentLength": 10384,
  "content": "# Actors overview\n\nActors are serverless programs...",
  "language": "en",
  "status": "success",
  "chargedEvent": "apify-default-dataset-item",
  "checkedAt": "2026-04-20T17:30:00.000Z"
}

First-run buyer experience

Run store-input.example.json or the Quickstart — Clean 3 Pages template on broad website pages.
Open the default dataset for charged rows or local output/result.json for the full attempted row set, then compare it with sample-output.example.json.
Check contentQualityScore, mainElementHint, and truncatedOrThinContent before scaling.
Move successful first runs to Recurring Docs Watch when the buyer needs monitoring.
Move handoff workflows to Webhook → Content Ops Handoff only after the dataset/PPE output shape is accepted.
If a URL is actually a blog/news/article page, route it to Article Content Extractor.

Tips & Limitations

Best on standard server-rendered HTML pages.
Use markdown for the clearest first-run proof and easiest reuse in LLM/RAG workflows.
This actor is not a full crawler and does not render JS-heavy SPAs.
HTTP errors are returned as error rows so bad demo URLs do not masquerade as valid content.

FAQ

How is this different from Article Content Extractor?

Use this actor as the flagship cleaner for broad website pages like docs, pricing, help, policy, and product pages. Use Article Content Extractor only when the URL is an article/blog/newsroom page and article-specific metadata or article confidence matters.

Can I use this after Google News or RSS discovery?

Yes — but only when the discovered URL is a general website page. News/blog/article URLs should route to Article Content Extractor.

Does it work on JavaScript-heavy sites?

No browser is used. If the page renders most content client-side, switch to a browser-based actor.

Start with Website Content Extractor when the buyer needs cleaned broad-page copy first. Cross-sell the next actor only when routing or enrichment changes the job:

📰 Article Content Extractor — Article-specialized feeder for newsroom, blog, and press URLs discovered inside a broad page list.
📰 Google News Scraper and 📡 RSS Feed Aggregator — Upstream discovery when the buyer does not already have URLs; route article URLs to Article Content Extractor and broad pages back here.
Shopify Store Intelligence API — Upgrade when the target is a Shopify storefront and the buyer needs products, collections, vendors, and merch rollups instead of page text alone.
📧 Contact Details Extractor — Add after page cleanup when public emails, phones, or social handles are needed from contact/about/support pages.
Domain Security Audit API — Add when the cleaned pages belong to owned domains that also need SSL, DMARC, expiry, or security-header trust checks.

Cost

Pay Per Event:

Actor start pricing: check the Apify Store Pricing tab for the current live rate.
Chargeable dataset rows: useful full and partial page results are pushed to the Apify default dataset and carry chargedEvent: "apify-default-dataset-item".
No-charge statuses: empty and error_no_result rows stay in local output/result.json and webhook payloads with chargedEvent: null; they are not pushed to the Apify default dataset and are not charged.
Role split: the default dataset is the billable charged-row surface; local output and webhook payloads preserve the full attempted row set for audit and repair.

⭐ Was this helpful?

If this actor saved you time, please leave a ★ rating on Apify Store.

Wordpress Post Scraper - NEW

eloquent_mountain/wordpress-post-scraper---new

This actor scrapes WordPress blog posts of one or more websites, cleans the HTML content, and pushes flattened JSON data (collects all data it can find in the post). It uses Selenium to handle pages requiring JavaScript rendering.

Paco

180

1.0

WordPress Scraper

jupri/wordpress

💫 Scrape WordPress and Woocommerce websites

cat

428

Cars Data API

easyapi/cars-data-api

🚗 Unlock a wealth of automotive information with our powerful Vehicle Data API. Access detailed specifications for a wide range of car models with ease. Supports multiple brands and models 🏁, flexible querying with various parameters 🔧, provides comprehensive vehicle specifications 📊

EasyApi

AI Website Content Extractor

scrapeai/ai-website-content-extractor

Crawl website pages, strip noise, and convert the main content to clean Markdown for RAG/LLM training.

ScrapeAI

5.0

LLM Markdown Crawler

sleek_waveform/llm-markdown-crawler

Crawl any website and extract clean, boilerplate-free Markdown optimized for LLMs, RAG pipelines, and AI training datasets. Uses Mozilla Readability to strip navigation and ads, then converts to clean Markdown. No browser required — fast and cheap.

Daniel Dimitrov

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

281

5.0

Website extract

mrahil/my-actor

It is website extractor

Mohammed Rahil

146

Website Content Crawler

crawlerbros/website-content-crawler

Crawls websites and extracts clean text, markdown, or HTML content. Ideal for LLM training data, RAG pipelines, and knowledge base building.

Crawler Bros

5.0

Website to Markdown — AI-Ready Content for RAG

ryanclinton/website-to-markdown

Website To Markdown. Available on the Apify Store with pay-per-event pricing.

ryan clinton

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.