📄 Article Content Scraper & Extractor avatar

📄 Article Content Scraper & Extractor

Pricing

Pay per event

Go to Apify Store
📄 Article Content Scraper & Extractor

📄 Article Content Scraper & Extractor

Scrape clean article bodies, authors, and metadata from messy newsrooms. Built for AI models, NLP datasets, and SEO audits requiring pristine text.

Pricing

Pay per event

Rating

0.0

(0)

Developer

太郎 山田

太郎 山田

Maintained by Community

Actor stats

0

Bookmarked

7

Total users

4

Monthly active users

2 hours ago

Last modified

Categories

Share

📰 Article Content Extractor

Extract clean article bodies, author bylines, publish dates, excerpts, hero images, and article confidence signals from public newsroom, blog, and press pages. Article Content Extractor is the article-specialized feeder in the content cluster: use it when the URL is clearly an article and the buyer needs article metadata that the flagship Website Content Extractor should not guess at.

Use this actor after Google News or RSS discovery, or after a broad page list is split and article URLs are routed out of Website Content Extractor. It keeps the article lane focused on article body cleanup, temporal context, and publisher metadata for LLM, RAG, SEO, comms, and research workflows.

For docs, product pages, policy pages, help centers, pricing pages, and general website URLs, start with Website Content Extractor instead. This actor is the feeder for article-grade URLs, not the default cleaner for every page on a website.

Store Quickstart

  • Start here only for article/blog/newsroom URLs. Use store-input.example.json or Quickstart — 3 Public Articles for a reliable article proof run.
  • Then use the article feeder ladder from store-input.templates.json:
    1. Quickstart — 3 Public Articles for byline/date/body proof
    2. Recurring News Watch for scheduled article monitoring
    3. Webhook → Article Desk Handoff for routed editorial or research delivery
  • Route docs, product, policy, pricing, help-center, and other broad website URLs back to Website Content Extractor.
  • Side presets stay available for job-specific lanes: Competitor Blog Watch and Google News → Article Cleanup.
  • Buyer-facing proof assets live in sample-output.example.json and live-proof.example.json.

Which actor should I use?

SurfaceBest for
Website Content ExtractorFlagship/default cleaner for docs, pricing, product, policy, help-center, and broad website pages
Article Content ExtractorArticle-specialized feeder for news stories, blogs, newsroom posts, and press releases
Google News ScraperUpstream discovery for fresh article URLs by query
RSS Feed AggregatorUpstream discovery for fresh article URLs from known publishers

Key Features

  • 📰 Article-first extraction — Prioritizes article containers and article metadata
  • 👤 Rich metadata — Captures author, publish date, site name, excerpt, and hero image when available
  • 📊 Buyer-trust signal — Returns articleConfidenceScore for quick validation
  • 🔀 Feeder routing — Accepts article URLs from Google News, RSS, or broad-page triage and sends non-article work back to Website Content Extractor
  • 🖼️ Image-aware — Captures heroImage when available and keeps the optional inline images array for downstream review
  • Fast HTTP-only workflow — Good for public article pages that do not need a browser

Use Cases

WhoWhy
PR / comms teamsClean article mentions for newsroom monitoring
Competitive intelligenceTurn blog watchlists into clean, comparable text
Research teamsCollect article-grade datasets with richer metadata
AI / RAG buildersIngest article pages without writing custom parsers
Content cluster buyersFeed article URLs discovered by Google News, RSS, or website-page triage into a dedicated article lane

Buyer Workflows and Upgrade Routing

Buyer workflowStart hereRoute next
Buyer has news, blog, newsroom, or press URLsQuickstart — 3 Public ArticlesScale to Recurring News Watch when the same sources need monitoring
Buyer discovered URLs with Google News or RSSArticle Content Extractor for the discovered article URLsKeep article rows canonical and review articleConfidenceScore before ingestion
Buyer needs editorial or research handoffWebhook → Article Desk HandoffDataset/PPE output remains canonical; webhook delivery is downstream only
URL is docs, pricing, product, policy, help, or a broad pageDo not force it into the article laneSend it to Website Content Extractor
Buyer needs broader website cleanup firstStart with Website Content ExtractorRoute only true article/blog/news URLs here

Input

FieldTypeDefaultDescription
urlsstring[]requiredArticle/news/blog/newsroom/press URLs (max 300); route broad website pages to Website Content Extractor
outputFormatstringmarkdowntext or markdown
includeImagesbooleantrueInclude the inline images array; heroImage can still be returned separately when available
concurrencyinteger5Parallel fetches
timeoutMsinteger15000Per-article timeout
deliverystringdatasetdataset writes canonical dataset rows. webhook writes canonical dataset rows first, then sends the webhook after dataset/PPE output succeeds
webhookUrlstringWebhook target when delivery=webhook
dryRunbooleanfalseWrite only local output for validation; disables dataset writes and webhook delivery

Input Example

{
"urls": [
"https://blog.google/products-and-platforms/products/photos/new-touch-up-tools-google-photos/",
"https://blog.google/products-and-platforms/products/search/summer-travel-tips-google-search-ai/",
"https://blog.google/products-and-platforms/products/travel/2026-travel-trends/"
],
"outputFormat": "markdown",
"includeImages": true,
"concurrency": 3,
"delivery": "dataset",
"dryRun": false
}

Delivery and PPE output

Non-dry-run runs always write canonical dataset rows first. This is true for both delivery=dataset and delivery=webhook.

When delivery=webhook, the webhook is a downstream handoff: it is sent only after the dataset write and PPE output succeed. If dataset/PPE output fails, no webhook request is sent.

dryRun=true writes only local output/result.json and disables both dataset writes and webhook delivery. Docker and local runtime require Node.js 20+; the actor Dockerfile uses node:20-slim.

Output

FieldTypeDescription
urlstringSource article URL
titlestringHeadline
authorstringByline when available
publishedDatestringPublish timestamp when available
excerptstringDescription or lead-text excerpt
siteNamestringPublisher / site name
heroImagestringHero image URL when available
imagesarrayInline image objects when includeImages is enabled
mainElementHintstringHTML container used for extraction
articleConfidenceScoreintegerHeuristic confidence score from 0-100
contentstringClean article body
contentLengthintegerCharacter length of the body
wordCountintegerBody word count
statusstringResult billing status: success, partial, empty, or error_no_result
chargedEventstringnull

Output Example

{
"url": "https://blog.google/products-and-platforms/products/photos/new-touch-up-tools-google-photos/",
"title": "New touch-up tools in Google Photos’ image editor let you make quick, subtle fixes.",
"author": null,
"publishedDate": "2026-04-20T17:00:00+00:00",
"excerpt": "New touch-up tools in Google Photos’ image editor let you refine skin texture, remove blemishes, brighten eyes or whiten teeth.",
"siteName": "Google",
"heroImage": "https://storage.googleapis.com/gweb-uniblog-publish-prod/images/Static_header.max-600x600.format-webp.webp",
"images": [
{
"url": "https://storage.googleapis.com/gweb-uniblog-publish-prod/images/Static_header.max-600x600.format-webp.webp",
"alt": ""
}
],
"mainElementHint": "article",
"articleConfidenceScore": 89,
"wordCount": 612,
"status": "success",
"chargedEvent": "apify-default-dataset-item",
"content": "# New touch-up tools...\n\nYour photos should capture how you feel in the moment..."
}

First-run buyer experience

  1. Run store-input.example.json or Quickstart — 3 Public Articles on article/blog/newsroom URLs.
  2. Confirm the actor returns a real article body instead of an error, teaser page, or category page.
  3. Compare the charged dataset rows and full local output/result.json with sample-output.example.json.
  4. Review articleConfidenceScore, excerpt, publishedDate, heroImage, and images.
  5. Move successful article runs to Recurring News Watch when the buyer needs monitoring.
  6. Move handoff workflows to Webhook → Article Desk Handoff only after the dataset/PPE output shape is accepted.
  7. For non-article URLs, switch to Website Content Extractor.

Tips & Limitations

  • Best for public newsroom, blog, and article pages.
  • Markdown is usually the strongest first-run proof for buyers and the easiest output for downstream LLM pipelines.
  • Paywalled pages can still return partial or teaser text.
  • HTTP errors are returned as error rows so broken demo URLs do not pretend to be valid article extractions.

FAQ

How is this different from Website Content Extractor?

This actor is tuned for article pages and returns richer article metadata. Website Content Extractor is the flagship/default fit for docs, help, pricing, policy, product pages, and other broad website pages.

What should I pair it with?

Use Google News Scraper or RSS Feed Aggregator to discover article URLs, then send only article/blog/newsroom URLs here for cleanup. Route broad website pages to Website Content Extractor.

Does it work on paywalled sites?

It only extracts what is publicly visible in the fetched HTML.

Content Intelligence Pack routing:

Cost

Pay Per Event:

  • Actor start pricing: check the Apify Store Pricing tab for the current live rate.
  • Chargeable dataset rows: useful full and partial article rows are pushed to the Apify default dataset and carry chargedEvent: "apify-default-dataset-item".
  • No-charge statuses: empty and error_no_result rows stay in local output/result.json and webhook payloads with chargedEvent: null; they are not pushed to the Apify default dataset and are not charged.
  • Role split: the default dataset is the billable charged-row surface; local output and webhook payloads preserve the full attempted row set for audit and repair.

⭐ Was this helpful?

If this actor saved you time, please leave a ★ rating on Apify Store.