📄 Website Content Extractor
Pricing
Pay per event
📄 Website Content Extractor
Strip noise from general website pages to extract clean markdown and structured text. Perfect for building LLM datasets from docs, pricing, and product pages.
Pricing
Pay per event
Rating
0.0
(0)
Developer
太郎 山田
Actor stats
1
Bookmarked
14
Total users
5
Monthly active users
4 days ago
Last modified
Categories
Share
Extract clean, structured text and pristine markdown from arbitrary website pages without the heavy overhead of a headless browser. The Website Content Extractor strips away navigation menus, footers, ads, and boilerplate code to deliver the core readable content you actually need. Designed specifically for AI developers, content teams, and data scientists, this scraper turns noisy web URLs into high-quality datasets ready for Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) pipelines, and vector databases.
Whether you need to scrape competitor pricing pages, download technical docs, or extract policy updates, this tool handles the baseline cleanup automatically. Use it to run a recurring docs watch, scrape product details for market analysis, or feed a webhook directly into your content operations handoff.
Because it bypasses the browser, you can extract data from hundreds of websites in seconds. You provide the URLs, and the scraper returns clean markdown, plain text, page titles, descriptions, and metadata. By isolating the main body from general website pages, you get accurate results without writing complex, site-specific CSS selectors. Schedule recurring runs to track competitor changes or build massive text corpora from help centers and product catalogs.
Store Quickstart
- Start with
store-input.example.jsonor Quickstart — Clean 3 Pages for the cheapest reliable first run. - Then use the upgrade ladder from
store-input.templates.json:- Quickstart — Clean 3 Pages
- Recurring Docs Watch
- Webhook → Content Ops Handoff
- Side presets stay available for job-specific lanes: Competitor Page Extract and Policy / Terms Diff Prep.
- Buyer-facing proof assets live in
sample-output.example.jsonandlive-proof.example.json.
Which actor should I use?
| Surface | Best for |
|---|---|
| Website Content Extractor | Docs, product, pricing, policy, help-center, and general website pages |
| Article Content Extractor | News stories, blog posts, newsroom URLs, and article pages with byline/date metadata |
| Google News Scraper | Discover article URLs from Google News before cleanup |
| RSS Feed Aggregator | Discover article URLs from known feeds before cleanup |
Key Features
- 📄 Generic page cleanup — Removes common boilerplate from standard HTML pages
- 🧭 Role clarity — Designed for broad pages, not premium article extraction
- 📊 Buyer-trust signals — Returns
contentQualityScore,mainElementHint, andtruncatedOrThinContent - 📝 Flexible output — Export text, markdown, or sanitized HTML
- ⚡ HTTP-only — Fast first runs on public server-rendered pages
Use Cases
| Who | Why |
|---|---|
| AI / RAG teams | Clean docs and help-center pages before indexing |
| RevOps / enablement | Capture product, pricing, and FAQ pages for internal search |
| Compliance teams | Normalize policy and legal pages before diffing |
| Competitive intelligence | Clean product pages before structured analysis |
Input
| Field | Type | Default | Description |
|---|---|---|---|
urls | string[] | required | Public HTML page URLs (max 200) |
outputFormat | string | markdown | text, markdown, or html |
includeMetadata | boolean | true | Include title/description/author/date/language when available |
concurrency | integer | 5 | Parallel fetches |
timeoutMs | integer | 15000 | Per-page timeout |
delivery | string | dataset | dataset or webhook |
webhookUrl | string | — | Webhook target when delivery=webhook |
dryRun | boolean | false | Write only local output for validation |
Input Example
{"urls": ["https://docs.apify.com/platform/actors","https://docs.apify.com/platform/storage/dataset","https://docs.apify.com/platform/storage/key-value-store"],"outputFormat": "markdown","includeMetadata": true,"concurrency": 3}
Output
| Field | Type | Description |
|---|---|---|
url | string | Source page URL |
title | string | Extracted page title |
content | string | Main content in the selected format |
wordCount | integer | Word count of the cleaned content |
contentLength | integer | Character length of the cleaned content |
extractionMode | string | Which main-content strategy won (semantic-main, article-like, role-main, body-fallback) |
mainElementHint | string | Main HTML container that was used |
contentQualityScore | integer | Heuristic confidence score from 0-100 |
truncatedOrThinContent | boolean | True when the page looks suspiciously short |
author | string | Author when metadata exists |
publishedDate | string | Publish date when metadata exists |
language | string | HTML language hint |
Output Example
{"url": "https://docs.apify.com/platform/actors","title": "Actors overview","extractionMode": "semantic-main","mainElementHint": "main","contentQualityScore": 88,"truncatedOrThinContent": false,"wordCount": 1642,"contentLength": 10384,"content": "# Actors overview\n\nActors are serverless programs...","language": "en","checkedAt": "2026-04-20T17:30:00.000Z"}
First-run buyer experience
- Run
store-input.example.jsonor the Quickstart — Clean 3 Pages template. - Open the dataset or local
output/result.json, then compare it withsample-output.example.json. - Check
contentQualityScoreandtruncatedOrThinContentbefore scaling. - Move successful first runs to Recurring Docs Watch or Webhook → Content Ops Handoff.
- If a URL is actually a blog/news post, move it to Article Content Extractor.
Tips & Limitations
- Best on standard server-rendered HTML pages.
- Use
markdownfor the clearest first-run proof and easiest reuse in LLM/RAG workflows. - This actor is not a full crawler and does not render JS-heavy SPAs.
- HTTP errors are returned as error rows so bad demo URLs do not masquerade as valid content.
FAQ
How is this different from Article Content Extractor?
Use this actor for broad pages like docs, pricing, help, policy, and product pages. Use Article Content Extractor when article-specific metadata and article confidence matter.
Can I use this after Google News or RSS discovery?
Yes — but only when the discovered URL is a general page. News/blog URLs should usually go to Article Content Extractor.
Does it work on JavaScript-heavy sites?
No browser is used. If the page renders most content client-side, switch to a browser-based actor.
Related Actors
Start here when the buyer needs cleaned page copy first. Add the next actor only when the job changes:
- 📰 Article Content Extractor — Switch to this when the URL is a newsroom or blog article and byline / publish-date confidence matters.
- 📰 Google News Scraper and 📡 RSS Feed Aggregator — Add upstream discovery when you do not already have URLs; send general pages back here and article pages to Article Content Extractor.
- Shopify Store Intelligence API — Use this instead when the site is a Shopify storefront and you need products, collections, vendors, and merch rollups instead of cleaned page text alone.
- 📧 Contact Details Extractor — Add after page cleanup when you want public emails, phones, or social handles from contact, about, or support pages on the same domain.
- Domain Security Audit API — Add when the cleaned pages belong to owned domains you also need to audit for SSL, DMARC, expiry, or security-header trust.
Cost
Pay Per Event:
actor-start: $0.01dataset-item: $0.005 per output item
⭐ Was this helpful?
If this actor saved you time, please leave a ★ rating on Apify Store.