📄 Article Content Scraper & Extractor
Pricing
Pay per event
📄 Article Content Scraper & Extractor
Scrape clean article bodies, authors, and metadata from messy newsrooms. Built for AI models, NLP datasets, and SEO audits requiring pristine text.
📰 Article Content Extractor
Extract perfectly clean text, author bylines, and core metadata from any newsroom, blog, or PR page using the Article Content Scraper & Extractor. While generic web scrapers often pull in useless footer text, sidebars, and navigation links, this specialized tool uses intelligent parsing to isolate the actual article body. It is the definitive solution for engineers and SEO professionals who need pristine textual data without writing custom selectors for every new target website.
If you are building an AI application, training Large Language Models (LLMs), or populating a Retrieval-Augmented Generation (RAG) database, raw HTML is not enough. You need the exact content body, the hero image, and the original publish date to ensure context and temporal accuracy. This scraper handles exactly that workflow. By feeding it URLs scraped from Google News or RSS feeds, you can automate a continuous pipeline of clean text extraction.
Use cases range from aggressive competitor blog tracking to daily industry news aggregation. Every run delivers a highly structured output featuring the main article headline, the cleaned article text, the author's byline, excerpted summaries, and an algorithmic article confidence score. This confidence metric ensures your database only ingests valid news content, saving you from processing failed pages or category indexes. Stop fighting with messy web pages and start extracting reliable, ready-to-use article data that easily scales with your daily content requirements.
Store Quickstart
- Start with
store-input.example.jsonor Quickstart — 3 Public Articles for a reliable first run. - Then use the upgrade ladder from
store-input.templates.json:- Quickstart — 3 Public Articles
- Recurring News Watch
- Webhook → Article Desk Handoff
- Side presets stay available for job-specific lanes: Competitor Blog Watch and Google News → Article Cleanup.
- Buyer-facing proof assets live in
sample-output.example.jsonandlive-proof.example.json.
Which actor should I use?
| Surface | Best for |
|---|---|
| Article Content Extractor | News stories, blogs, newsroom posts, press releases |
| Website Content Extractor | Docs, pricing, product, policy, help-center, and broad website pages |
| Google News Scraper | Find fresh article URLs by query |
| RSS Feed Aggregator | Find fresh article URLs from known publishers |
Key Features
- 📰 Article-first extraction — Prioritizes article containers and article metadata
- 👤 Rich metadata — Captures author, publish date, site name, excerpt, and hero image when available
- 📊 Buyer-trust signal — Returns
articleConfidenceScorefor quick validation - 🖼️ Image-aware — Keeps hero and inline images for downstream review
- ⚡ Fast HTTP-only workflow — Good for public article pages that do not need a browser
Use Cases
| Who | Why |
|---|---|
| PR / comms teams | Clean article mentions for newsroom monitoring |
| Competitive intelligence | Turn blog watchlists into clean, comparable text |
| Research teams | Collect article-grade datasets with richer metadata |
| AI / RAG builders | Ingest article pages without writing custom parsers |
Input
| Field | Type | Default | Description |
|---|---|---|---|
urls | string[] | required | Article/news/blog URLs (max 300) |
outputFormat | string | markdown | text or markdown |
includeImages | boolean | true | Include hero and inline image URLs |
concurrency | integer | 5 | Parallel fetches |
timeoutMs | integer | 15000 | Per-article timeout |
delivery | string | dataset | dataset or webhook |
webhookUrl | string | — | Webhook target when delivery=webhook |
dryRun | boolean | false | Write only local output for validation |
Input Example
{"urls": ["https://blog.google/products-and-platforms/products/photos/new-touch-up-tools-google-photos/","https://blog.google/products-and-platforms/products/search/summer-travel-tips-google-search-ai/","https://blog.google/products-and-platforms/products/travel/2026-travel-trends/"],"outputFormat": "markdown","includeImages": true,"concurrency": 3}
Output
| Field | Type | Description |
|---|---|---|
url | string | Source article URL |
title | string | Headline |
author | string | Byline when available |
publishedDate | string | Publish timestamp when available |
excerpt | string | Description or lead-text excerpt |
siteName | string | Publisher / site name |
heroImage | string | Hero image URL when available |
images | array | Inline and hero image objects |
mainElementHint | string | HTML container used for extraction |
articleConfidenceScore | integer | Heuristic confidence score from 0-100 |
content | string | Clean article body |
contentLength | integer | Character length of the body |
wordCount | integer | Body word count |
Output Example
{"url": "https://blog.google/products-and-platforms/products/photos/new-touch-up-tools-google-photos/","title": "New touch-up tools in Google Photos’ image editor let you make quick, subtle fixes.","author": null,"publishedDate": "2026-04-20T17:00:00+00:00","excerpt": "New touch-up tools in Google Photos’ image editor let you refine skin texture, remove blemishes, brighten eyes or whiten teeth.","siteName": "Google","heroImage": "https://storage.googleapis.com/gweb-uniblog-publish-prod/images/Static_header.max-600x600.format-webp.webp","mainElementHint": "article","articleConfidenceScore": 89,"wordCount": 612,"content": "# New touch-up tools...\n\nYour photos should capture how you feel in the moment..."}
First-run buyer experience
- Run
store-input.example.jsonor Quickstart — 3 Public Articles. - Confirm the actor returns a real article body instead of an error or teaser page.
- Compare
output/result.jsonwithsample-output.example.json. - Review
articleConfidenceScore,excerpt,publishedDate, andheroImage. - Move successful runs to Recurring News Watch or Webhook → Article Desk Handoff.
- For non-article URLs, switch to Website Content Extractor.
Tips & Limitations
- Best for public newsroom, blog, and article pages.
- Markdown is usually the strongest first-run proof for buyers and the easiest output for downstream LLM pipelines.
- Paywalled pages can still return partial or teaser text.
- HTTP errors are returned as error rows so broken demo URLs do not pretend to be valid article extractions.
FAQ
How is this different from Website Content Extractor?
This actor is tuned for article pages and returns richer article metadata. Website Content Extractor is the better fit for docs, help, pricing, policy, and product pages.
What should I pair it with?
Use Google News Scraper or RSS Feed Aggregator to discover URLs, then send those article URLs here for cleanup.
Does it work on paywalled sites?
It only extracts what is publicly visible in the fetched HTML.
Related Actors
Content Intelligence Pack handoffs:
- 📰 Google News Scraper — discover fresh article URLs by query
- 📡 RSS Feed Aggregator — discover fresh article URLs from publisher feeds
- 📄 Website Content Extractor — clean non-article docs, pricing, and policy pages
Cost
Pay Per Event:
actor-start: $0.01dataset-item: $0.005 per output item
⭐ Was this helpful?
If this actor saved you time, please leave a ★ rating on Apify Store.