📄 Website Content Extractor avatar

📄 Website Content Extractor

Pricing

Pay per event

Go to Apify Store
📄 Website Content Extractor

📄 Website Content Extractor

Strip noise from general website pages to extract clean markdown and structured text. Perfect for building LLM datasets from docs, pricing, and product pages.

Pricing

Pay per event

Rating

0.0

(0)

Developer

太郎 山田

太郎 山田

Maintained by Community

Actor stats

1

Bookmarked

14

Total users

5

Monthly active users

4 days ago

Last modified

Share

Extract clean, structured text and pristine markdown from arbitrary website pages without the heavy overhead of a headless browser. The Website Content Extractor strips away navigation menus, footers, ads, and boilerplate code to deliver the core readable content you actually need. Designed specifically for AI developers, content teams, and data scientists, this scraper turns noisy web URLs into high-quality datasets ready for Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) pipelines, and vector databases.

Whether you need to scrape competitor pricing pages, download technical docs, or extract policy updates, this tool handles the baseline cleanup automatically. Use it to run a recurring docs watch, scrape product details for market analysis, or feed a webhook directly into your content operations handoff.

Because it bypasses the browser, you can extract data from hundreds of websites in seconds. You provide the URLs, and the scraper returns clean markdown, plain text, page titles, descriptions, and metadata. By isolating the main body from general website pages, you get accurate results without writing complex, site-specific CSS selectors. Schedule recurring runs to track competitor changes or build massive text corpora from help centers and product catalogs.

Store Quickstart

  • Start with store-input.example.json or Quickstart — Clean 3 Pages for the cheapest reliable first run.
  • Then use the upgrade ladder from store-input.templates.json:
    1. Quickstart — Clean 3 Pages
    2. Recurring Docs Watch
    3. Webhook → Content Ops Handoff
  • Side presets stay available for job-specific lanes: Competitor Page Extract and Policy / Terms Diff Prep.
  • Buyer-facing proof assets live in sample-output.example.json and live-proof.example.json.

Which actor should I use?

SurfaceBest for
Website Content ExtractorDocs, product, pricing, policy, help-center, and general website pages
Article Content ExtractorNews stories, blog posts, newsroom URLs, and article pages with byline/date metadata
Google News ScraperDiscover article URLs from Google News before cleanup
RSS Feed AggregatorDiscover article URLs from known feeds before cleanup

Key Features

  • 📄 Generic page cleanup — Removes common boilerplate from standard HTML pages
  • 🧭 Role clarity — Designed for broad pages, not premium article extraction
  • 📊 Buyer-trust signals — Returns contentQualityScore, mainElementHint, and truncatedOrThinContent
  • 📝 Flexible output — Export text, markdown, or sanitized HTML
  • HTTP-only — Fast first runs on public server-rendered pages

Use Cases

WhoWhy
AI / RAG teamsClean docs and help-center pages before indexing
RevOps / enablementCapture product, pricing, and FAQ pages for internal search
Compliance teamsNormalize policy and legal pages before diffing
Competitive intelligenceClean product pages before structured analysis

Input

FieldTypeDefaultDescription
urlsstring[]requiredPublic HTML page URLs (max 200)
outputFormatstringmarkdowntext, markdown, or html
includeMetadatabooleantrueInclude title/description/author/date/language when available
concurrencyinteger5Parallel fetches
timeoutMsinteger15000Per-page timeout
deliverystringdatasetdataset or webhook
webhookUrlstringWebhook target when delivery=webhook
dryRunbooleanfalseWrite only local output for validation

Input Example

{
"urls": [
"https://docs.apify.com/platform/actors",
"https://docs.apify.com/platform/storage/dataset",
"https://docs.apify.com/platform/storage/key-value-store"
],
"outputFormat": "markdown",
"includeMetadata": true,
"concurrency": 3
}

Output

FieldTypeDescription
urlstringSource page URL
titlestringExtracted page title
contentstringMain content in the selected format
wordCountintegerWord count of the cleaned content
contentLengthintegerCharacter length of the cleaned content
extractionModestringWhich main-content strategy won (semantic-main, article-like, role-main, body-fallback)
mainElementHintstringMain HTML container that was used
contentQualityScoreintegerHeuristic confidence score from 0-100
truncatedOrThinContentbooleanTrue when the page looks suspiciously short
authorstringAuthor when metadata exists
publishedDatestringPublish date when metadata exists
languagestringHTML language hint

Output Example

{
"url": "https://docs.apify.com/platform/actors",
"title": "Actors overview",
"extractionMode": "semantic-main",
"mainElementHint": "main",
"contentQualityScore": 88,
"truncatedOrThinContent": false,
"wordCount": 1642,
"contentLength": 10384,
"content": "# Actors overview\n\nActors are serverless programs...",
"language": "en",
"checkedAt": "2026-04-20T17:30:00.000Z"
}

First-run buyer experience

  1. Run store-input.example.json or the Quickstart — Clean 3 Pages template.
  2. Open the dataset or local output/result.json, then compare it with sample-output.example.json.
  3. Check contentQualityScore and truncatedOrThinContent before scaling.
  4. Move successful first runs to Recurring Docs Watch or Webhook → Content Ops Handoff.
  5. If a URL is actually a blog/news post, move it to Article Content Extractor.

Tips & Limitations

  • Best on standard server-rendered HTML pages.
  • Use markdown for the clearest first-run proof and easiest reuse in LLM/RAG workflows.
  • This actor is not a full crawler and does not render JS-heavy SPAs.
  • HTTP errors are returned as error rows so bad demo URLs do not masquerade as valid content.

FAQ

How is this different from Article Content Extractor?

Use this actor for broad pages like docs, pricing, help, policy, and product pages. Use Article Content Extractor when article-specific metadata and article confidence matter.

Can I use this after Google News or RSS discovery?

Yes — but only when the discovered URL is a general page. News/blog URLs should usually go to Article Content Extractor.

Does it work on JavaScript-heavy sites?

No browser is used. If the page renders most content client-side, switch to a browser-based actor.

Start here when the buyer needs cleaned page copy first. Add the next actor only when the job changes:

  • 📰 Article Content Extractor — Switch to this when the URL is a newsroom or blog article and byline / publish-date confidence matters.
  • 📰 Google News Scraper and 📡 RSS Feed Aggregator — Add upstream discovery when you do not already have URLs; send general pages back here and article pages to Article Content Extractor.
  • Shopify Store Intelligence API — Use this instead when the site is a Shopify storefront and you need products, collections, vendors, and merch rollups instead of cleaned page text alone.
  • 📧 Contact Details Extractor — Add after page cleanup when you want public emails, phones, or social handles from contact, about, or support pages on the same domain.
  • Domain Security Audit API — Add when the cleaned pages belong to owned domains you also need to audit for SSL, DMARC, expiry, or security-header trust.

Cost

Pay Per Event:

  • actor-start: $0.01
  • dataset-item: $0.005 per output item

⭐ Was this helpful?

If this actor saved you time, please leave a ★ rating on Apify Store.