📄 Article Content Scraper & Extractor avatar

📄 Article Content Scraper & Extractor

Pricing

Pay per event

Go to Apify Store
📄 Article Content Scraper & Extractor

📄 Article Content Scraper & Extractor

Scrape clean article bodies, authors, and metadata from messy newsrooms. Built for AI models, NLP datasets, and SEO audits requiring pristine text.

Pricing

Pay per event

Rating

0.0

(0)

Developer

太郎 山田

太郎 山田

Maintained by Community

Actor stats

0

Bookmarked

6

Total users

3

Monthly active users

4 days ago

Last modified

Categories

Share

📰 Article Content Extractor

Extract perfectly clean text, author bylines, and core metadata from any newsroom, blog, or PR page using the Article Content Scraper & Extractor. While generic web scrapers often pull in useless footer text, sidebars, and navigation links, this specialized tool uses intelligent parsing to isolate the actual article body. It is the definitive solution for engineers and SEO professionals who need pristine textual data without writing custom selectors for every new target website.

If you are building an AI application, training Large Language Models (LLMs), or populating a Retrieval-Augmented Generation (RAG) database, raw HTML is not enough. You need the exact content body, the hero image, and the original publish date to ensure context and temporal accuracy. This scraper handles exactly that workflow. By feeding it URLs scraped from Google News or RSS feeds, you can automate a continuous pipeline of clean text extraction.

Use cases range from aggressive competitor blog tracking to daily industry news aggregation. Every run delivers a highly structured output featuring the main article headline, the cleaned article text, the author's byline, excerpted summaries, and an algorithmic article confidence score. This confidence metric ensures your database only ingests valid news content, saving you from processing failed pages or category indexes. Stop fighting with messy web pages and start extracting reliable, ready-to-use article data that easily scales with your daily content requirements.

Store Quickstart

  • Start with store-input.example.json or Quickstart — 3 Public Articles for a reliable first run.
  • Then use the upgrade ladder from store-input.templates.json:
    1. Quickstart — 3 Public Articles
    2. Recurring News Watch
    3. Webhook → Article Desk Handoff
  • Side presets stay available for job-specific lanes: Competitor Blog Watch and Google News → Article Cleanup.
  • Buyer-facing proof assets live in sample-output.example.json and live-proof.example.json.

Which actor should I use?

SurfaceBest for
Article Content ExtractorNews stories, blogs, newsroom posts, press releases
Website Content ExtractorDocs, pricing, product, policy, help-center, and broad website pages
Google News ScraperFind fresh article URLs by query
RSS Feed AggregatorFind fresh article URLs from known publishers

Key Features

  • 📰 Article-first extraction — Prioritizes article containers and article metadata
  • 👤 Rich metadata — Captures author, publish date, site name, excerpt, and hero image when available
  • 📊 Buyer-trust signal — Returns articleConfidenceScore for quick validation
  • 🖼️ Image-aware — Keeps hero and inline images for downstream review
  • Fast HTTP-only workflow — Good for public article pages that do not need a browser

Use Cases

WhoWhy
PR / comms teamsClean article mentions for newsroom monitoring
Competitive intelligenceTurn blog watchlists into clean, comparable text
Research teamsCollect article-grade datasets with richer metadata
AI / RAG buildersIngest article pages without writing custom parsers

Input

FieldTypeDefaultDescription
urlsstring[]requiredArticle/news/blog URLs (max 300)
outputFormatstringmarkdowntext or markdown
includeImagesbooleantrueInclude hero and inline image URLs
concurrencyinteger5Parallel fetches
timeoutMsinteger15000Per-article timeout
deliverystringdatasetdataset or webhook
webhookUrlstringWebhook target when delivery=webhook
dryRunbooleanfalseWrite only local output for validation

Input Example

{
"urls": [
"https://blog.google/products-and-platforms/products/photos/new-touch-up-tools-google-photos/",
"https://blog.google/products-and-platforms/products/search/summer-travel-tips-google-search-ai/",
"https://blog.google/products-and-platforms/products/travel/2026-travel-trends/"
],
"outputFormat": "markdown",
"includeImages": true,
"concurrency": 3
}

Output

FieldTypeDescription
urlstringSource article URL
titlestringHeadline
authorstringByline when available
publishedDatestringPublish timestamp when available
excerptstringDescription or lead-text excerpt
siteNamestringPublisher / site name
heroImagestringHero image URL when available
imagesarrayInline and hero image objects
mainElementHintstringHTML container used for extraction
articleConfidenceScoreintegerHeuristic confidence score from 0-100
contentstringClean article body
contentLengthintegerCharacter length of the body
wordCountintegerBody word count

Output Example

{
"url": "https://blog.google/products-and-platforms/products/photos/new-touch-up-tools-google-photos/",
"title": "New touch-up tools in Google Photos’ image editor let you make quick, subtle fixes.",
"author": null,
"publishedDate": "2026-04-20T17:00:00+00:00",
"excerpt": "New touch-up tools in Google Photos’ image editor let you refine skin texture, remove blemishes, brighten eyes or whiten teeth.",
"siteName": "Google",
"heroImage": "https://storage.googleapis.com/gweb-uniblog-publish-prod/images/Static_header.max-600x600.format-webp.webp",
"mainElementHint": "article",
"articleConfidenceScore": 89,
"wordCount": 612,
"content": "# New touch-up tools...\n\nYour photos should capture how you feel in the moment..."
}

First-run buyer experience

  1. Run store-input.example.json or Quickstart — 3 Public Articles.
  2. Confirm the actor returns a real article body instead of an error or teaser page.
  3. Compare output/result.json with sample-output.example.json.
  4. Review articleConfidenceScore, excerpt, publishedDate, and heroImage.
  5. Move successful runs to Recurring News Watch or Webhook → Article Desk Handoff.
  6. For non-article URLs, switch to Website Content Extractor.

Tips & Limitations

  • Best for public newsroom, blog, and article pages.
  • Markdown is usually the strongest first-run proof for buyers and the easiest output for downstream LLM pipelines.
  • Paywalled pages can still return partial or teaser text.
  • HTTP errors are returned as error rows so broken demo URLs do not pretend to be valid article extractions.

FAQ

How is this different from Website Content Extractor?

This actor is tuned for article pages and returns richer article metadata. Website Content Extractor is the better fit for docs, help, pricing, policy, and product pages.

What should I pair it with?

Use Google News Scraper or RSS Feed Aggregator to discover URLs, then send those article URLs here for cleanup.

Does it work on paywalled sites?

It only extracts what is publicly visible in the fetched HTML.

Content Intelligence Pack handoffs:

Cost

Pay Per Event:

  • actor-start: $0.01
  • dataset-item: $0.005 per output item

⭐ Was this helpful?

If this actor saved you time, please leave a ★ rating on Apify Store.