📄 Website Content Extractor
Pricing
Pay per event
📄 Website Content Extractor
Strip noise from general website pages to extract clean markdown and structured text. Perfect for building LLM datasets from docs, pricing, and product pages.
Pricing
Pay per event
Rating
0.0
(0)
Developer
太郎 山田
Actor stats
1
Bookmarked
17
Total users
7
Monthly active users
2 hours ago
Last modified
Categories
Share
Extract clean, structured text and pristine markdown from docs, product pages, policy pages, help centers, and other public website pages without the heavy overhead of a headless browser. Website Content Extractor is the flagship content-cleaning actor in this cluster: start here when a buyer already has page URLs and needs canonical dataset rows ready for LLM, RAG, search, review, or content operations workflows.
The actor strips away navigation menus, footers, ads, and boilerplate code so buyers can validate clean page copy on the first run. Use it for recurring docs watches, product and FAQ knowledge-base ingestion, policy review prep, competitive page monitoring, or webhook handoff into content operations.
Because it bypasses the browser, it can process large URL batches quickly on public server-rendered pages. When the URL is a real article, blog post, newsroom item, or press release, route that URL to Article Content Extractor as the article-specialized feeder; keep docs, product, policy, help, and broad website pages here.
Store Quickstart
- Start here for broad website pages. Use
store-input.example.jsonor Quickstart — Clean 3 Pages for the cheapest reliable proof run. - Then use the buyer upgrade ladder from
store-input.templates.json:- Quickstart — Clean 3 Pages for first proof
- Recurring Docs Watch for scheduled monitoring
- Webhook → Content Ops Handoff for routed downstream delivery
- Route article/blog/news URLs to Article Content Extractor instead of forcing them through a general page workflow.
- Side presets stay available for job-specific lanes: Competitor Page Extract and Policy / Terms Diff Prep.
- Buyer-facing proof assets live in
sample-output.example.jsonandlive-proof.example.json.
Which actor should I use?
| Surface | Best for |
|---|---|
| Website Content Extractor | Flagship/default cleaner for docs, product, pricing, policy, help-center, and broad website pages |
| Article Content Extractor | Article-specialized feeder for news stories, blog posts, newsroom URLs, and pages where byline/date metadata matters |
| Google News Scraper | Upstream discovery when the buyer needs fresh article URLs by query |
| RSS Feed Aggregator | Upstream discovery when the buyer has known feeds and needs article URLs before cleanup |
Key Features
- 📄 Generic page cleanup — Removes common boilerplate from standard HTML pages
- 🧭 Flagship routing — Default starting point for broad website content in the content cluster
- 📊 Buyer-trust signals — Returns
contentQualityScore,mainElementHint, andtruncatedOrThinContent - 📝 Flexible output — Export text, markdown, or sanitized HTML
- 🔀 Cross-sell fit — Sends true article URLs to Article Content Extractor instead of diluting page-cleanup proof
- ⚡ HTTP-only — Fast first runs on public server-rendered pages
Use Cases
| Who | Why |
|---|---|
| AI / RAG teams | Clean docs and help-center pages before indexing |
| RevOps / enablement | Capture product, pricing, and FAQ pages for internal search |
| Compliance teams | Normalize policy and legal pages before diffing |
| Competitive intelligence | Clean product pages before structured analysis |
| Content operations | Send cleaned page rows into review queues or webhook handoffs |
Buyer Workflows and Upgrade Routing
| Buyer workflow | Start here | Route next |
|---|---|---|
| Clean a known list of docs, help, product, pricing, or policy URLs | Quickstart — Clean 3 Pages | Scale to Recurring Docs Watch when the same pages need monitoring |
| Build an LLM/RAG corpus from broad website pages | Website Content Extractor | Keep markdown output and review contentQualityScore before indexing |
| Hand cleaned pages to another system | Webhook → Content Ops Handoff | Dataset/PPE output remains canonical; webhook delivery is downstream only |
| Mixed list contains blog or newsroom URLs | Split the list first | Send article URLs to Article Content Extractor and keep broad pages here |
| Buyer does not have URLs yet | Add Google News Scraper or RSS Feed Aggregator only for discovery | Route discovered article URLs to Article Content Extractor; route general pages back here |
Input
| Field | Type | Default | Description |
|---|---|---|---|
urls | string[] | required | Public broad website page URLs (max 200); route article/news/blog URLs to Article Content Extractor |
outputFormat | string | markdown | text, markdown, or html |
includeMetadata | boolean | true | Include title/description/author/date/language when available |
concurrency | integer | 5 | Parallel fetches |
timeoutMs | integer | 15000 | Per-page timeout |
delivery | string | dataset | dataset writes canonical dataset rows. webhook writes canonical dataset rows first, then sends the webhook after dataset/PPE output succeeds |
webhookUrl | string | — | Webhook target when delivery=webhook |
dryRun | boolean | false | Write only local output for validation; disables dataset writes and webhook delivery |
Input Example
{"urls": ["https://docs.apify.com/platform/actors","https://docs.apify.com/platform/storage/dataset","https://docs.apify.com/platform/storage/key-value-store"],"outputFormat": "markdown","includeMetadata": true,"concurrency": 3,"delivery": "dataset","dryRun": false}
Delivery and PPE output
Non-dry-run runs always write canonical dataset rows first. This is true for both delivery=dataset and delivery=webhook.
When delivery=webhook, the webhook is a downstream handoff: it is sent only after the dataset write and PPE output succeed. If dataset/PPE output fails, no webhook request is sent.
dryRun=true writes only local output/result.json and disables both dataset writes and webhook delivery. Docker and local runtime require Node.js 20+; the actor Dockerfile uses node:20-slim.
Output
| Field | Type | Description |
|---|---|---|
url | string | Source page URL |
title | string | Extracted page title |
content | string | Main content in the selected format |
wordCount | integer | Word count of the cleaned content |
contentLength | integer | Character length of the cleaned content |
extractionMode | string | Which main-content strategy won (semantic-main, article-like, role-main, body-fallback) |
mainElementHint | string | Main HTML container that was used |
contentQualityScore | integer | Heuristic confidence score from 0-100 |
truncatedOrThinContent | boolean | True when the page looks suspiciously short |
author | string | Author when metadata exists |
publishedDate | string | Publish date when metadata exists |
language | string | HTML language hint |
status | string | Result billing status: success, partial, empty, or error_no_result |
chargedEvent | string | null |
Output Example
{"url": "https://docs.apify.com/platform/actors","title": "Actors overview","extractionMode": "semantic-main","mainElementHint": "main","contentQualityScore": 88,"truncatedOrThinContent": false,"wordCount": 1642,"contentLength": 10384,"content": "# Actors overview\n\nActors are serverless programs...","language": "en","status": "success","chargedEvent": "apify-default-dataset-item","checkedAt": "2026-04-20T17:30:00.000Z"}
First-run buyer experience
- Run
store-input.example.jsonor the Quickstart — Clean 3 Pages template on broad website pages. - Open the default dataset for charged rows or local
output/result.jsonfor the full attempted row set, then compare it withsample-output.example.json. - Check
contentQualityScore,mainElementHint, andtruncatedOrThinContentbefore scaling. - Move successful first runs to Recurring Docs Watch when the buyer needs monitoring.
- Move handoff workflows to Webhook → Content Ops Handoff only after the dataset/PPE output shape is accepted.
- If a URL is actually a blog/news/article page, route it to Article Content Extractor.
Tips & Limitations
- Best on standard server-rendered HTML pages.
- Use
markdownfor the clearest first-run proof and easiest reuse in LLM/RAG workflows. - This actor is not a full crawler and does not render JS-heavy SPAs.
- HTTP errors are returned as error rows so bad demo URLs do not masquerade as valid content.
FAQ
How is this different from Article Content Extractor?
Use this actor as the flagship cleaner for broad website pages like docs, pricing, help, policy, and product pages. Use Article Content Extractor only when the URL is an article/blog/newsroom page and article-specific metadata or article confidence matters.
Can I use this after Google News or RSS discovery?
Yes — but only when the discovered URL is a general website page. News/blog/article URLs should route to Article Content Extractor.
Does it work on JavaScript-heavy sites?
No browser is used. If the page renders most content client-side, switch to a browser-based actor.
Related Actors
Start with Website Content Extractor when the buyer needs cleaned broad-page copy first. Cross-sell the next actor only when routing or enrichment changes the job:
- 📰 Article Content Extractor — Article-specialized feeder for newsroom, blog, and press URLs discovered inside a broad page list.
- 📰 Google News Scraper and 📡 RSS Feed Aggregator — Upstream discovery when the buyer does not already have URLs; route article URLs to Article Content Extractor and broad pages back here.
- Shopify Store Intelligence API — Upgrade when the target is a Shopify storefront and the buyer needs products, collections, vendors, and merch rollups instead of page text alone.
- 📧 Contact Details Extractor — Add after page cleanup when public emails, phones, or social handles are needed from contact/about/support pages.
- Domain Security Audit API — Add when the cleaned pages belong to owned domains that also need SSL, DMARC, expiry, or security-header trust checks.
Cost
Pay Per Event:
- Actor start pricing: check the Apify Store Pricing tab for the current live rate.
- Chargeable dataset rows: useful full and partial page results are pushed to the Apify default dataset and carry
chargedEvent: "apify-default-dataset-item". - No-charge statuses:
emptyanderror_no_resultrows stay in localoutput/result.jsonand webhook payloads withchargedEvent: null; they are not pushed to the Apify default dataset and are not charged. - Role split: the default dataset is the billable charged-row surface; local output and webhook payloads preserve the full attempted row set for audit and repair.
⭐ Was this helpful?
If this actor saved you time, please leave a ★ rating on Apify Store.