Pricing

Pay per event

📄 Article Content Scraper & Extractor

Scrape clean article bodies, authors, and metadata from messy newsrooms. Built for AI models, NLP datasets, and SEO audits requiring pristine text.

Pricing

Pay per event

Rating

0.0

(0)

Developer

太郎山田

Actor stats

Bookmarked

Total users

Monthly active users

2 hours ago

Last modified

📰 Article Content Extractor

Extract clean article bodies, author bylines, publish dates, excerpts, hero images, and article confidence signals from public newsroom, blog, and press pages. Article Content Extractor is the article-specialized feeder in the content cluster: use it when the URL is clearly an article and the buyer needs article metadata that the flagship Website Content Extractor should not guess at.

Use this actor after Google News or RSS discovery, or after a broad page list is split and article URLs are routed out of Website Content Extractor. It keeps the article lane focused on article body cleanup, temporal context, and publisher metadata for LLM, RAG, SEO, comms, and research workflows.

For docs, product pages, policy pages, help centers, pricing pages, and general website URLs, start with Website Content Extractor instead. This actor is the feeder for article-grade URLs, not the default cleaner for every page on a website.

Store Quickstart

Start here only for article/blog/newsroom URLs. Use store-input.example.json or Quickstart — 3 Public Articles for a reliable article proof run.
Then use the article feeder ladder from store-input.templates.json:
1. Quickstart — 3 Public Articles for byline/date/body proof
2. Recurring News Watch for scheduled article monitoring
3. Webhook → Article Desk Handoff for routed editorial or research delivery
Route docs, product, policy, pricing, help-center, and other broad website URLs back to Website Content Extractor.
Side presets stay available for job-specific lanes: Competitor Blog Watch and Google News → Article Cleanup.
Buyer-facing proof assets live in sample-output.example.json and live-proof.example.json.

Which actor should I use?

Surface	Best for
Website Content Extractor	Flagship/default cleaner for docs, pricing, product, policy, help-center, and broad website pages
Article Content Extractor	Article-specialized feeder for news stories, blogs, newsroom posts, and press releases
Google News Scraper	Upstream discovery for fresh article URLs by query
RSS Feed Aggregator	Upstream discovery for fresh article URLs from known publishers

Key Features

📰 Article-first extraction — Prioritizes article containers and article metadata
👤 Rich metadata — Captures author, publish date, site name, excerpt, and hero image when available
📊 Buyer-trust signal — Returns articleConfidenceScore for quick validation
🔀 Feeder routing — Accepts article URLs from Google News, RSS, or broad-page triage and sends non-article work back to Website Content Extractor
🖼️ Image-aware — Captures heroImage when available and keeps the optional inline images array for downstream review
⚡ Fast HTTP-only workflow — Good for public article pages that do not need a browser

Use Cases

Who	Why
PR / comms teams	Clean article mentions for newsroom monitoring
Competitive intelligence	Turn blog watchlists into clean, comparable text
Research teams	Collect article-grade datasets with richer metadata
AI / RAG builders	Ingest article pages without writing custom parsers
Content cluster buyers	Feed article URLs discovered by Google News, RSS, or website-page triage into a dedicated article lane

Buyer Workflows and Upgrade Routing

Buyer workflow	Start here	Route next
Buyer has news, blog, newsroom, or press URLs	Quickstart — 3 Public Articles	Scale to Recurring News Watch when the same sources need monitoring
Buyer discovered URLs with Google News or RSS	Article Content Extractor for the discovered article URLs	Keep article rows canonical and review `articleConfidenceScore` before ingestion
Buyer needs editorial or research handoff	Webhook → Article Desk Handoff	Dataset/PPE output remains canonical; webhook delivery is downstream only
URL is docs, pricing, product, policy, help, or a broad page	Do not force it into the article lane	Send it to Website Content Extractor
Buyer needs broader website cleanup first	Start with Website Content Extractor	Route only true article/blog/news URLs here

Input

Field	Type	Default	Description
`urls`	`string[]`	required	Article/news/blog/newsroom/press URLs (max 300); route broad website pages to Website Content Extractor
`outputFormat`	`string`	`markdown`	`text` or `markdown`
`includeImages`	`boolean`	`true`	Include the inline `images` array; `heroImage` can still be returned separately when available
`concurrency`	`integer`	`5`	Parallel fetches
`timeoutMs`	`integer`	`15000`	Per-article timeout
`delivery`	`string`	`dataset`	`dataset` writes canonical dataset rows. `webhook` writes canonical dataset rows first, then sends the webhook after dataset/PPE output succeeds
`webhookUrl`	`string`	—	Webhook target when `delivery=webhook`
`dryRun`	`boolean`	`false`	Write only local output for validation; disables dataset writes and webhook delivery

Input Example

{
  "urls": [
    "https://blog.google/products-and-platforms/products/photos/new-touch-up-tools-google-photos/",
    "https://blog.google/products-and-platforms/products/search/summer-travel-tips-google-search-ai/",
    "https://blog.google/products-and-platforms/products/travel/2026-travel-trends/"
  ],
  "outputFormat": "markdown",
  "includeImages": true,
  "concurrency": 3,
  "delivery": "dataset",
  "dryRun": false
}

Delivery and PPE output

Non-dry-run runs always write canonical dataset rows first. This is true for both delivery=dataset and delivery=webhook.

When delivery=webhook, the webhook is a downstream handoff: it is sent only after the dataset write and PPE output succeed. If dataset/PPE output fails, no webhook request is sent.

dryRun=true writes only local output/result.json and disables both dataset writes and webhook delivery. Docker and local runtime require Node.js 20+; the actor Dockerfile uses node:20-slim.

Output

Field	Type	Description
`url`	string	Source article URL
`title`	string	Headline
`author`	string	Byline when available
`publishedDate`	string	Publish timestamp when available
`excerpt`	string	Description or lead-text excerpt
`siteName`	string	Publisher / site name
`heroImage`	string	Hero image URL when available
`images`	array	Inline image objects when `includeImages` is enabled
`mainElementHint`	string	HTML container used for extraction
`articleConfidenceScore`	integer	Heuristic confidence score from 0-100
`content`	string	Clean article body
`contentLength`	integer	Character length of the body
`wordCount`	integer	Body word count
`status`	string	Result billing status: `success`, `partial`, `empty`, or `error_no_result`
`chargedEvent`	string	null

Output Example

{
  "url": "https://blog.google/products-and-platforms/products/photos/new-touch-up-tools-google-photos/",
  "title": "New touch-up tools in Google Photos’ image editor let you make quick, subtle fixes.",
  "author": null,
  "publishedDate": "2026-04-20T17:00:00+00:00",
  "excerpt": "New touch-up tools in Google Photos’ image editor let you refine skin texture, remove blemishes, brighten eyes or whiten teeth.",
  "siteName": "Google",
  "heroImage": "https://storage.googleapis.com/gweb-uniblog-publish-prod/images/Static_header.max-600x600.format-webp.webp",
  "images": [
    {
      "url": "https://storage.googleapis.com/gweb-uniblog-publish-prod/images/Static_header.max-600x600.format-webp.webp",
      "alt": ""
    }
  ],
  "mainElementHint": "article",
  "articleConfidenceScore": 89,
  "wordCount": 612,
  "status": "success",
  "chargedEvent": "apify-default-dataset-item",
  "content": "# New touch-up tools...\n\nYour photos should capture how you feel in the moment..."
}

First-run buyer experience

Run store-input.example.json or Quickstart — 3 Public Articles on article/blog/newsroom URLs.
Confirm the actor returns a real article body instead of an error, teaser page, or category page.
Compare the charged dataset rows and full local output/result.json with sample-output.example.json.
Review articleConfidenceScore, excerpt, publishedDate, heroImage, and images.
Move successful article runs to Recurring News Watch when the buyer needs monitoring.
Move handoff workflows to Webhook → Article Desk Handoff only after the dataset/PPE output shape is accepted.
For non-article URLs, switch to Website Content Extractor.

Tips & Limitations

Best for public newsroom, blog, and article pages.
Markdown is usually the strongest first-run proof for buyers and the easiest output for downstream LLM pipelines.
Paywalled pages can still return partial or teaser text.
HTTP errors are returned as error rows so broken demo URLs do not pretend to be valid article extractions.

FAQ

How is this different from Website Content Extractor?

This actor is tuned for article pages and returns richer article metadata. Website Content Extractor is the flagship/default fit for docs, help, pricing, policy, product pages, and other broad website pages.

What should I pair it with?

Use Google News Scraper or RSS Feed Aggregator to discover article URLs, then send only article/blog/newsroom URLs here for cleanup. Route broad website pages to Website Content Extractor.

Does it work on paywalled sites?

It only extracts what is publicly visible in the fetched HTML.

Content Intelligence Pack routing:

📄 Website Content Extractor — flagship/default cleaner for non-article docs, pricing, policy, product, help-center, and broad website pages
📰 Google News Scraper — upstream discovery for fresh article URLs by query before this feeder runs
📡 RSS Feed Aggregator — upstream discovery for fresh article URLs from publisher feeds before this feeder runs

Cost

Pay Per Event:

Actor start pricing: check the Apify Store Pricing tab for the current live rate.
Chargeable dataset rows: useful full and partial article rows are pushed to the Apify default dataset and carry chargedEvent: "apify-default-dataset-item".
No-charge statuses: empty and error_no_result rows stay in local output/result.json and webhook payloads with chargedEvent: null; they are not pushed to the Apify default dataset and are not charged.
Role split: the default dataset is the billable charged-row surface; local output and webhook payloads preserve the full attempted row set for audit and repair.

⭐ Was this helpful?

If this actor saved you time, please leave a ★ rating on Apify Store.

Smart Article Extractor

parseforge/article-extractor

Extract clean article content from any news, blog, or publisher site! Pull full body text, author, publish date, word count, language, reading time, images, and metadata at scale. Ideal for content research, media monitoring, SEO audits, and AI training. Start extracting articles in minutes!

ParseForge

Article Content Extractor 📄

easyapi/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Supports multiple URLs and returns well-formatted JSON with title, description, content, author, publish date and more. 🔍📄

EasyApi

109

Google News Article Scraper

webscrap18/google-news-article-scraper

Scrape Google News, Extract full content with Title, Article Text, Images and Structured data.

WebScrap

Smart Article Extractor

datapilot/smart-article-extractor

News Article Extractor Actor fetches article URLs and extracts structured content using Requests, , and Newspaper3k. It collects title, author, publish date, text, summary, keywords, images, and word count. Supports proxy use and outputs clean JSON results.

Data Pilot

News Article Scraper for Feeding LLM

proscraper/newsarticlescraper

Scrape news articles metadata to feed into LLM models. Returns article body, published date, article title, author etc.

Owais Nazir

169

Wikipedia Article Extractor (AI-ready)

changeable_acacia/wikipedia-article-extractor-ai-ready

Extracts clean JSON from any Wikipedia article for AI/RAG use.

SABYASACHI TRIPATHY

Article Furniture

mynewhome/article-furniture

Scrape Article Furniture products

Aida Sarre

RSS & News Feed Aggregator — Multi-Source Article Scraper

joyouscam35875/rss-news-aggregator

Aggregate and parse RSS/Atom feeds from any source. Extract articles with titles, descriptions, authors, dates, images. Optionally fetch full article content. Perfect for news monitoring and AI pipelines. $0.0005/article.