Article Content Extractor & Reader Scraper avatar

Article Content Extractor & Reader Scraper

Pricing

from $8.00 / 1,000 results

Go to Apify Store
Article Content Extractor & Reader Scraper

Article Content Extractor & Reader Scraper

Article content extractor + reader scraper for news, blog, and press URLs. Returns article body, byline, publish date, excerpt, and hero image. Cookie banner / nav / share-button stripping is more aggressive than off-the-shelf readability libraries.

Pricing

from $8.00 / 1,000 results

Rating

0.0

(0)

Developer

naoki anzai

naoki anzai

Maintained by Community

Actor stats

0

Bookmarked

10

Total users

4

Monthly active users

6 days ago

Last modified

Categories

Share

Article Content Extractor

After this run

Turn this Actor's output into a capped paid report with Website RAG Readiness Audit Report. Use it when AI builders, documentation teams, support teams, and technical marketers need to decide whether public website pages are clean and complete enough for RAG ingestion.

  • First report: $9 / website_rag_snapshot_report; set maxChargeUsd to $9.
  • Deeper report: $29 / website_rag_readiness_report; use only when the first result needs competitor or action-depth.
  • This is an internal Apify flow aid. It is not revenue proof until accounted paid usage appears.

Content teams, researchers, SEO teams, and AI dataset builders use this actor to turn Public article URLs supplied by the user into a clean dataset for Site QA & Content Intelligence Pack. Provide focused source inputs, keep the first run small, and expand only after the output shape is useful. Each emitted row includes source context, timestamps, and fields designed for monitoring, QA, research, or workflow handoff.

Store Quickstart

Start with a small list of article URLs, review body extraction quality, then schedule recurring publisher checks.

Recommended first run:

{
"urls": [
"https://example.com/news/example"
],
"includeImages": true,
"limit": 10,
"delivery": "dataset",
"dryRun": false
}

Input examples

Article URLs

{
"urls": [
"https://example.com/news/example"
],
"includeImages": true,
"limit": 10,
"delivery": "dataset",
"dryRun": false
}

Press pages

{
"urls": [
"https://example.com/press/release"
],
"includeImages": false,
"limit": 10,
"delivery": "dataset",
"dryRun": false
}

Research webhook

{
"urls": [
"https://example.com/blog/post"
],
"delivery": "webhook",
"webhookUrl": "https://example.com/webhook",
"dryRun": false
}

Sample output

{
"meta": {
"actorName": "article-content-extractor",
"actorTitle": "Article Content Extractor",
"bundle": "Site QA & Content Intelligence Pack",
"fetchedAt": "2026-05-06T00:00:00.000Z",
"totalRows": 1
},
"rows": [
{
"actorName": "article-content-extractor",
"rowType": "article",
"url": "https://example.com/news/example",
"headline": "Example Headline",
"author": "Example Author",
"publishedAt": "2026-05-06",
"articleText": "Example article body.",
"sourceUrl": "https://example.com/news/example",
"fetchedAt": "2026-05-06T00:00:00.000Z"
}
],
"warnings": []
}

Output fields

  • rowType
  • url
  • headline
  • author
  • publishedAt
  • articleText
  • excerpt
  • heroImage
  • sourceUrl

Rows also include source URLs, fetch timestamps, warnings when a source is partial, and stable IDs when the workflow supports recurring change detection.

See also (Content extraction cluster)

Pricing and no-change runs

$0.001 actor start and $0.008 per useful article row. Failed/no-content rows should stay out of the default dataset.

The default dataset is the billable surface. Dry runs, validation-only runs, missing-key warnings, and unchanged recurring polls should not write payable default-dataset rows.

Compliance guardrails

  • Fetch public article pages supplied by the user.
  • Do not imply content ownership transfer or publisher endorsement.
  • Use output for research, QA, and internal workflows.
  • Do not use provider emblems or wording that implies approval by an upstream data provider.

See also

Use these follow-on Actors when you want a capped, decision-ready report instead of more raw rows. They use public or user-provided inputs, respect maxChargeUsd, and do not promise rankings, revenue, conversion lifts, or sales outcomes.

If this Actor gave you raw rows or source context, these follow-on report Actors are designed for a small capped paid run. They help make a decision, not just collect more data.

  • Website RAG Readiness Audit Report - decide whether public website pages are clean and complete enough for RAG ingestion. Entry $9 / website_rag_snapshot_report; premium $29 / website_rag_readiness_report.

Keep maxChargeUsd equal to the selected tier. Internal links are traffic aids only; real proof requires accounted paid usage.

💾 Save it for later: click the bookmark icon at the top of the Apify Store page if you'd like to come back to it. Bookmarks help other engineers find this actor via Apify's discovery surfaces.

⭐ Was Article Content Extractor & Reader Scraper useful for your article body extraction?

If this actor saved you time, please leave a 5★ rating on Apify Store — it takes 10 seconds, helps other engineers and analysts discover it, and keeps updates free.

Have a feature request, bug, or sample workflow you'd like to share? Open an issue — we read every one and use them to prioritise the next release.