📄 Article Content Scraper & Extractor
Pricing
Pay per event
📄 Article Content Scraper & Extractor
Scrape clean article bodies, authors, and metadata from messy newsrooms. Built for AI models, NLP datasets, and SEO audits requiring pristine text.
📰 Article Content Extractor
Extract clean article bodies, author bylines, publish dates, excerpts, hero images, and article confidence signals from public newsroom, blog, and press pages. Article Content Extractor is the article-specialized feeder in the content cluster: use it when the URL is clearly an article and the buyer needs article metadata that the flagship Website Content Extractor should not guess at.
Use this actor after Google News or RSS discovery, or after a broad page list is split and article URLs are routed out of Website Content Extractor. It keeps the article lane focused on article body cleanup, temporal context, and publisher metadata for LLM, RAG, SEO, comms, and research workflows.
For docs, product pages, policy pages, help centers, pricing pages, and general website URLs, start with Website Content Extractor instead. This actor is the feeder for article-grade URLs, not the default cleaner for every page on a website.
Store Quickstart
- Start here only for article/blog/newsroom URLs. Use
store-input.example.jsonor Quickstart — 3 Public Articles for a reliable article proof run. - Then use the article feeder ladder from
store-input.templates.json:- Quickstart — 3 Public Articles for byline/date/body proof
- Recurring News Watch for scheduled article monitoring
- Webhook → Article Desk Handoff for routed editorial or research delivery
- Route docs, product, policy, pricing, help-center, and other broad website URLs back to Website Content Extractor.
- Side presets stay available for job-specific lanes: Competitor Blog Watch and Google News → Article Cleanup.
- Buyer-facing proof assets live in
sample-output.example.jsonandlive-proof.example.json.
Which actor should I use?
| Surface | Best for |
|---|---|
| Website Content Extractor | Flagship/default cleaner for docs, pricing, product, policy, help-center, and broad website pages |
| Article Content Extractor | Article-specialized feeder for news stories, blogs, newsroom posts, and press releases |
| Google News Scraper | Upstream discovery for fresh article URLs by query |
| RSS Feed Aggregator | Upstream discovery for fresh article URLs from known publishers |
Key Features
- 📰 Article-first extraction — Prioritizes article containers and article metadata
- 👤 Rich metadata — Captures author, publish date, site name, excerpt, and hero image when available
- 📊 Buyer-trust signal — Returns
articleConfidenceScorefor quick validation - 🔀 Feeder routing — Accepts article URLs from Google News, RSS, or broad-page triage and sends non-article work back to Website Content Extractor
- 🖼️ Image-aware — Captures
heroImagewhen available and keeps the optional inlineimagesarray for downstream review - ⚡ Fast HTTP-only workflow — Good for public article pages that do not need a browser
Use Cases
| Who | Why |
|---|---|
| PR / comms teams | Clean article mentions for newsroom monitoring |
| Competitive intelligence | Turn blog watchlists into clean, comparable text |
| Research teams | Collect article-grade datasets with richer metadata |
| AI / RAG builders | Ingest article pages without writing custom parsers |
| Content cluster buyers | Feed article URLs discovered by Google News, RSS, or website-page triage into a dedicated article lane |
Buyer Workflows and Upgrade Routing
| Buyer workflow | Start here | Route next |
|---|---|---|
| Buyer has news, blog, newsroom, or press URLs | Quickstart — 3 Public Articles | Scale to Recurring News Watch when the same sources need monitoring |
| Buyer discovered URLs with Google News or RSS | Article Content Extractor for the discovered article URLs | Keep article rows canonical and review articleConfidenceScore before ingestion |
| Buyer needs editorial or research handoff | Webhook → Article Desk Handoff | Dataset/PPE output remains canonical; webhook delivery is downstream only |
| URL is docs, pricing, product, policy, help, or a broad page | Do not force it into the article lane | Send it to Website Content Extractor |
| Buyer needs broader website cleanup first | Start with Website Content Extractor | Route only true article/blog/news URLs here |
Input
| Field | Type | Default | Description |
|---|---|---|---|
urls | string[] | required | Article/news/blog/newsroom/press URLs (max 300); route broad website pages to Website Content Extractor |
outputFormat | string | markdown | text or markdown |
includeImages | boolean | true | Include the inline images array; heroImage can still be returned separately when available |
concurrency | integer | 5 | Parallel fetches |
timeoutMs | integer | 15000 | Per-article timeout |
delivery | string | dataset | dataset writes canonical dataset rows. webhook writes canonical dataset rows first, then sends the webhook after dataset/PPE output succeeds |
webhookUrl | string | — | Webhook target when delivery=webhook |
dryRun | boolean | false | Write only local output for validation; disables dataset writes and webhook delivery |
Input Example
{"urls": ["https://blog.google/products-and-platforms/products/photos/new-touch-up-tools-google-photos/","https://blog.google/products-and-platforms/products/search/summer-travel-tips-google-search-ai/","https://blog.google/products-and-platforms/products/travel/2026-travel-trends/"],"outputFormat": "markdown","includeImages": true,"concurrency": 3,"delivery": "dataset","dryRun": false}
Delivery and PPE output
Non-dry-run runs always write canonical dataset rows first. This is true for both delivery=dataset and delivery=webhook.
When delivery=webhook, the webhook is a downstream handoff: it is sent only after the dataset write and PPE output succeed. If dataset/PPE output fails, no webhook request is sent.
dryRun=true writes only local output/result.json and disables both dataset writes and webhook delivery. Docker and local runtime require Node.js 20+; the actor Dockerfile uses node:20-slim.
Output
| Field | Type | Description |
|---|---|---|
url | string | Source article URL |
title | string | Headline |
author | string | Byline when available |
publishedDate | string | Publish timestamp when available |
excerpt | string | Description or lead-text excerpt |
siteName | string | Publisher / site name |
heroImage | string | Hero image URL when available |
images | array | Inline image objects when includeImages is enabled |
mainElementHint | string | HTML container used for extraction |
articleConfidenceScore | integer | Heuristic confidence score from 0-100 |
content | string | Clean article body |
contentLength | integer | Character length of the body |
wordCount | integer | Body word count |
status | string | Result billing status: success, partial, empty, or error_no_result |
chargedEvent | string | null |
Output Example
{"url": "https://blog.google/products-and-platforms/products/photos/new-touch-up-tools-google-photos/","title": "New touch-up tools in Google Photos’ image editor let you make quick, subtle fixes.","author": null,"publishedDate": "2026-04-20T17:00:00+00:00","excerpt": "New touch-up tools in Google Photos’ image editor let you refine skin texture, remove blemishes, brighten eyes or whiten teeth.","siteName": "Google","heroImage": "https://storage.googleapis.com/gweb-uniblog-publish-prod/images/Static_header.max-600x600.format-webp.webp","images": [{"url": "https://storage.googleapis.com/gweb-uniblog-publish-prod/images/Static_header.max-600x600.format-webp.webp","alt": ""}],"mainElementHint": "article","articleConfidenceScore": 89,"wordCount": 612,"status": "success","chargedEvent": "apify-default-dataset-item","content": "# New touch-up tools...\n\nYour photos should capture how you feel in the moment..."}
First-run buyer experience
- Run
store-input.example.jsonor Quickstart — 3 Public Articles on article/blog/newsroom URLs. - Confirm the actor returns a real article body instead of an error, teaser page, or category page.
- Compare the charged dataset rows and full local
output/result.jsonwithsample-output.example.json. - Review
articleConfidenceScore,excerpt,publishedDate,heroImage, andimages. - Move successful article runs to Recurring News Watch when the buyer needs monitoring.
- Move handoff workflows to Webhook → Article Desk Handoff only after the dataset/PPE output shape is accepted.
- For non-article URLs, switch to Website Content Extractor.
Tips & Limitations
- Best for public newsroom, blog, and article pages.
- Markdown is usually the strongest first-run proof for buyers and the easiest output for downstream LLM pipelines.
- Paywalled pages can still return partial or teaser text.
- HTTP errors are returned as error rows so broken demo URLs do not pretend to be valid article extractions.
FAQ
How is this different from Website Content Extractor?
This actor is tuned for article pages and returns richer article metadata. Website Content Extractor is the flagship/default fit for docs, help, pricing, policy, product pages, and other broad website pages.
What should I pair it with?
Use Google News Scraper or RSS Feed Aggregator to discover article URLs, then send only article/blog/newsroom URLs here for cleanup. Route broad website pages to Website Content Extractor.
Does it work on paywalled sites?
It only extracts what is publicly visible in the fetched HTML.
Related Actors
Content Intelligence Pack routing:
- 📄 Website Content Extractor — flagship/default cleaner for non-article docs, pricing, policy, product, help-center, and broad website pages
- 📰 Google News Scraper — upstream discovery for fresh article URLs by query before this feeder runs
- 📡 RSS Feed Aggregator — upstream discovery for fresh article URLs from publisher feeds before this feeder runs
Cost
Pay Per Event:
- Actor start pricing: check the Apify Store Pricing tab for the current live rate.
- Chargeable dataset rows: useful full and partial article rows are pushed to the Apify default dataset and carry
chargedEvent: "apify-default-dataset-item". - No-charge statuses:
emptyanderror_no_resultrows stay in localoutput/result.jsonand webhook payloads withchargedEvent: null; they are not pushed to the Apify default dataset and are not charged. - Role split: the default dataset is the billable charged-row surface; local output and webhook payloads preserve the full attempted row set for audit and repair.
⭐ Was this helpful?
If this actor saved you time, please leave a ★ rating on Apify Store.