📄 Website Content Extractor avatar

📄 Website Content Extractor

Pricing

Pay per event

Go to Apify Store
📄 Website Content Extractor

📄 Website Content Extractor

Strip noise from general website pages to extract clean markdown and structured text. Perfect for building LLM datasets from docs, pricing, and product pages.

Pricing

Pay per event

Rating

0.0

(0)

Developer

太郎 山田

太郎 山田

Maintained by Community

Actor stats

1

Bookmarked

17

Total users

7

Monthly active users

2 hours ago

Last modified

Share

Extract clean, structured text and pristine markdown from docs, product pages, policy pages, help centers, and other public website pages without the heavy overhead of a headless browser. Website Content Extractor is the flagship content-cleaning actor in this cluster: start here when a buyer already has page URLs and needs canonical dataset rows ready for LLM, RAG, search, review, or content operations workflows.

The actor strips away navigation menus, footers, ads, and boilerplate code so buyers can validate clean page copy on the first run. Use it for recurring docs watches, product and FAQ knowledge-base ingestion, policy review prep, competitive page monitoring, or webhook handoff into content operations.

Because it bypasses the browser, it can process large URL batches quickly on public server-rendered pages. When the URL is a real article, blog post, newsroom item, or press release, route that URL to Article Content Extractor as the article-specialized feeder; keep docs, product, policy, help, and broad website pages here.

Store Quickstart

  • Start here for broad website pages. Use store-input.example.json or Quickstart — Clean 3 Pages for the cheapest reliable proof run.
  • Then use the buyer upgrade ladder from store-input.templates.json:
    1. Quickstart — Clean 3 Pages for first proof
    2. Recurring Docs Watch for scheduled monitoring
    3. Webhook → Content Ops Handoff for routed downstream delivery
  • Route article/blog/news URLs to Article Content Extractor instead of forcing them through a general page workflow.
  • Side presets stay available for job-specific lanes: Competitor Page Extract and Policy / Terms Diff Prep.
  • Buyer-facing proof assets live in sample-output.example.json and live-proof.example.json.

Which actor should I use?

SurfaceBest for
Website Content ExtractorFlagship/default cleaner for docs, product, pricing, policy, help-center, and broad website pages
Article Content ExtractorArticle-specialized feeder for news stories, blog posts, newsroom URLs, and pages where byline/date metadata matters
Google News ScraperUpstream discovery when the buyer needs fresh article URLs by query
RSS Feed AggregatorUpstream discovery when the buyer has known feeds and needs article URLs before cleanup

Key Features

  • 📄 Generic page cleanup — Removes common boilerplate from standard HTML pages
  • 🧭 Flagship routing — Default starting point for broad website content in the content cluster
  • 📊 Buyer-trust signals — Returns contentQualityScore, mainElementHint, and truncatedOrThinContent
  • 📝 Flexible output — Export text, markdown, or sanitized HTML
  • 🔀 Cross-sell fit — Sends true article URLs to Article Content Extractor instead of diluting page-cleanup proof
  • HTTP-only — Fast first runs on public server-rendered pages

Use Cases

WhoWhy
AI / RAG teamsClean docs and help-center pages before indexing
RevOps / enablementCapture product, pricing, and FAQ pages for internal search
Compliance teamsNormalize policy and legal pages before diffing
Competitive intelligenceClean product pages before structured analysis
Content operationsSend cleaned page rows into review queues or webhook handoffs

Buyer Workflows and Upgrade Routing

Buyer workflowStart hereRoute next
Clean a known list of docs, help, product, pricing, or policy URLsQuickstart — Clean 3 PagesScale to Recurring Docs Watch when the same pages need monitoring
Build an LLM/RAG corpus from broad website pagesWebsite Content ExtractorKeep markdown output and review contentQualityScore before indexing
Hand cleaned pages to another systemWebhook → Content Ops HandoffDataset/PPE output remains canonical; webhook delivery is downstream only
Mixed list contains blog or newsroom URLsSplit the list firstSend article URLs to Article Content Extractor and keep broad pages here
Buyer does not have URLs yetAdd Google News Scraper or RSS Feed Aggregator only for discoveryRoute discovered article URLs to Article Content Extractor; route general pages back here

Input

FieldTypeDefaultDescription
urlsstring[]requiredPublic broad website page URLs (max 200); route article/news/blog URLs to Article Content Extractor
outputFormatstringmarkdowntext, markdown, or html
includeMetadatabooleantrueInclude title/description/author/date/language when available
concurrencyinteger5Parallel fetches
timeoutMsinteger15000Per-page timeout
deliverystringdatasetdataset writes canonical dataset rows. webhook writes canonical dataset rows first, then sends the webhook after dataset/PPE output succeeds
webhookUrlstringWebhook target when delivery=webhook
dryRunbooleanfalseWrite only local output for validation; disables dataset writes and webhook delivery

Input Example

{
"urls": [
"https://docs.apify.com/platform/actors",
"https://docs.apify.com/platform/storage/dataset",
"https://docs.apify.com/platform/storage/key-value-store"
],
"outputFormat": "markdown",
"includeMetadata": true,
"concurrency": 3,
"delivery": "dataset",
"dryRun": false
}

Delivery and PPE output

Non-dry-run runs always write canonical dataset rows first. This is true for both delivery=dataset and delivery=webhook.

When delivery=webhook, the webhook is a downstream handoff: it is sent only after the dataset write and PPE output succeed. If dataset/PPE output fails, no webhook request is sent.

dryRun=true writes only local output/result.json and disables both dataset writes and webhook delivery. Docker and local runtime require Node.js 20+; the actor Dockerfile uses node:20-slim.

Output

FieldTypeDescription
urlstringSource page URL
titlestringExtracted page title
contentstringMain content in the selected format
wordCountintegerWord count of the cleaned content
contentLengthintegerCharacter length of the cleaned content
extractionModestringWhich main-content strategy won (semantic-main, article-like, role-main, body-fallback)
mainElementHintstringMain HTML container that was used
contentQualityScoreintegerHeuristic confidence score from 0-100
truncatedOrThinContentbooleanTrue when the page looks suspiciously short
authorstringAuthor when metadata exists
publishedDatestringPublish date when metadata exists
languagestringHTML language hint
statusstringResult billing status: success, partial, empty, or error_no_result
chargedEventstringnull

Output Example

{
"url": "https://docs.apify.com/platform/actors",
"title": "Actors overview",
"extractionMode": "semantic-main",
"mainElementHint": "main",
"contentQualityScore": 88,
"truncatedOrThinContent": false,
"wordCount": 1642,
"contentLength": 10384,
"content": "# Actors overview\n\nActors are serverless programs...",
"language": "en",
"status": "success",
"chargedEvent": "apify-default-dataset-item",
"checkedAt": "2026-04-20T17:30:00.000Z"
}

First-run buyer experience

  1. Run store-input.example.json or the Quickstart — Clean 3 Pages template on broad website pages.
  2. Open the default dataset for charged rows or local output/result.json for the full attempted row set, then compare it with sample-output.example.json.
  3. Check contentQualityScore, mainElementHint, and truncatedOrThinContent before scaling.
  4. Move successful first runs to Recurring Docs Watch when the buyer needs monitoring.
  5. Move handoff workflows to Webhook → Content Ops Handoff only after the dataset/PPE output shape is accepted.
  6. If a URL is actually a blog/news/article page, route it to Article Content Extractor.

Tips & Limitations

  • Best on standard server-rendered HTML pages.
  • Use markdown for the clearest first-run proof and easiest reuse in LLM/RAG workflows.
  • This actor is not a full crawler and does not render JS-heavy SPAs.
  • HTTP errors are returned as error rows so bad demo URLs do not masquerade as valid content.

FAQ

How is this different from Article Content Extractor?

Use this actor as the flagship cleaner for broad website pages like docs, pricing, help, policy, and product pages. Use Article Content Extractor only when the URL is an article/blog/newsroom page and article-specific metadata or article confidence matters.

Can I use this after Google News or RSS discovery?

Yes — but only when the discovered URL is a general website page. News/blog/article URLs should route to Article Content Extractor.

Does it work on JavaScript-heavy sites?

No browser is used. If the page renders most content client-side, switch to a browser-based actor.

Start with Website Content Extractor when the buyer needs cleaned broad-page copy first. Cross-sell the next actor only when routing or enrichment changes the job:

Cost

Pay Per Event:

  • Actor start pricing: check the Apify Store Pricing tab for the current live rate.
  • Chargeable dataset rows: useful full and partial page results are pushed to the Apify default dataset and carry chargedEvent: "apify-default-dataset-item".
  • No-charge statuses: empty and error_no_result rows stay in local output/result.json and webhook payloads with chargedEvent: null; they are not pushed to the Apify default dataset and are not charged.
  • Role split: the default dataset is the billable charged-row surface; local output and webhook payloads preserve the full attempted row set for audit and repair.

⭐ Was this helpful?

If this actor saved you time, please leave a ★ rating on Apify Store.