Pricing

$1.00 / 1,000 article extracteds

Try for free

Go to Apify Store

Article/News Extractor

Try for free

CAPABILITIES: extract_article, extract_metadata, detect_language, clean_text, batch_urls. INPUT: URLs (single or array) of articles/news pages. OUTPUT: structured JSON with title, author, date, content, language, word_count. FORMATS: json, markdown, text. PRICING: PPE $0.001/article.

Pricing

$1.00 / 1,000 article extracteds

Rating

5.0

(1)

Developer

Bado

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

Article & News Extractor

Extract clean, structured article text and metadata from any news site or blog. Built by Tropical Tools — structured data extraction APIs optimized for AI agents.

Feed in URLs and get back title, author, publication date, full article content, language, tags, word count, reading time, and token estimates. The article extractor is purpose-built for RAG pipelines, content analysis, and AI agent workflows that need reliable, clean text from the web.

What Does It Do?

Article & News Extractor takes any article or blog URL and returns structured, clean content stripped of ads, navigation, sidebars, and other noise. It uses readability-based extraction combined with JSON-LD and Schema.org metadata parsing to pull the most accurate data possible from each page.

Every extraction returns:

Title — The article headline
Author — Byline attribution when available
Publication date — Parsed and normalized to ISO 8601
Content — Full article body, cleaned of HTML cruft
Language — Detected language code (ISO 639-1)
Tags/categories — Topic tags and section labels from the source
Word count — Total words in the extracted content
Reading time — Estimated minutes to read
Token estimate — Approximate token count for LLM context planning
Paywall detection — Flags articles behind paywalls (without bypass)

Why Use This Actor?

Most web scrapers return raw HTML or half-broken text full of navigation elements and ad copy. This news scraper is different:

RAG-optimized output — Clean text with metadata fields that map directly to vector DB schemas. No post-processing needed before chunking and embedding.
Paywall detection — Identifies paywalled content upfront so your pipeline doesn't ingest truncated articles into your knowledge base.
Language detection — Automatic language identification lets you route multilingual content to the right embedding model or translation step.
3 output formats — Get results as structured JSON, clean Markdown, or plain text depending on your downstream needs.
Batch processing — Pass hundreds of URLs in a single run. The actor processes them concurrently and returns results as they complete.

Features

Readability extraction — Mozilla Readability-based content isolation that strips ads, navigation, footers, and related-article blocks
JSON-LD / Schema.org parsing — Extracts structured metadata embedded by publishers for maximum accuracy on dates, authors, and categories
Multi-platform support — Tested and optimized for WordPress, Medium, Substack, Ghost, major news outlets (Reuters, AP, BBC, NYT), and thousands of independent blogs
Paywall detection — Detects soft and hard paywalls and flags them in output metadata (does not bypass paywalls)
Content deduplication — Identifies and removes repeated boilerplate text across batch runs
Configurable output — Choose between JSON, Markdown, and plain text formats

Input Configuration

Field	Type	Default	Description
`urls`	`array`	required	List of article URLs to extract
`outputFormat`	`string`	`"json"`	Output format: `json`, `markdown`, or `text`
`includeMetadata`	`boolean`	`true`	Include full metadata (author, date, tags, etc.)
`extractStructuredData`	`boolean`	`true`	Parse JSON-LD and Schema.org data from pages

Example Input

{
    "urls": [
        "https://en.wikipedia.org/wiki/Web_scraping",
        "https://example.com/blog/ai-agents-2025"
    ],
    "outputFormat": "json",
    "includeMetadata": true,
    "extractStructuredData": true
}

Output Example

Extracting a Wikipedia article returns structured data like this:

{
    "url": "https://en.wikipedia.org/wiki/Web_scraping",
    "title": "Web scraping",
    "author": "Wikipedia contributors",
    "publishedDate": "2024-11-15T08:22:00Z",
    "content": "Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser...",
    "language": "en",
    "tags": ["Web scraping", "Data extraction", "Web harvesting"],
    "wordCount": 4250,
    "readingTimeMinutes": 17,
    "tokenEstimate": 5660,
    "paywallDetected": false,
    "structuredData": {
        "@type": "Article",
        "name": "Web scraping",
        "inLanguage": "en",
        "isPartOf": {
            "@type": "WebSite",
            "name": "Wikipedia"
        }
    },
    "extractedAt": "2025-03-15T12:00:00Z"
}

Cost Estimation

Each successfully extracted article costs $0.001 (one-tenth of a cent). Failed extractions (404 errors, unreachable URLs) are not charged.

Volume	Cost	Cost per 1K articles
100 articles	$0.10	$1.00
1,000 articles	$1.00	$1.00
10,000 articles	$10.00	$1.00
100,000 articles	$100.00	$1.00

Use Cases

RAG / vector DB ingestion — Extract and chunk articles for retrieval-augmented generation pipelines. Clean text with metadata makes embedding and retrieval more accurate.
News monitoring — Track coverage across dozens of publications. Feed URLs from RSS feeds or news APIs and get structured, comparable output.
Content analysis — Analyze word counts, reading levels, topic tags, and publication patterns across large content sets.
AI agent reading — Give your AI agent the ability to read and understand web articles. The token estimate field helps with context window planning.
Competitive intelligence — Monitor competitor blogs, press releases, and thought leadership content in structured form.

FAQ

Does it bypass paywalls?

No. The actor detects paywalled content and flags it in the output (paywallDetected: true), but it does not bypass, circumvent, or work around any paywall mechanisms. You will receive whatever content is publicly accessible on the page.

What sites are supported?

The article extractor works with any standard HTML page that contains article content. It has been optimized for WordPress, Medium, Substack, Ghost, and major news platforms including Reuters, AP, BBC, The Guardian, and The New York Times. Sites with heavy JavaScript rendering may require additional processing time.

How does language detection work?

Language is detected using a combination of HTML lang attributes, metadata tags, and content-level analysis. The actor returns an ISO 639-1 language code (e.g., en, es, fr, de, ja). Detection is accurate for all major languages and most minor ones.

Does it handle AMP pages?

Yes. When an AMP version of a page is detected, the actor extracts content from the AMP markup. If you provide an AMP URL directly, it will be processed normally. The actor prefers canonical (non-AMP) versions when both are available, as they typically contain richer metadata.

Can I use it with RSS feeds?

The actor accepts direct article URLs, not RSS feed URLs. However, it pairs well with RSS feed actors — use an RSS parser to get article URLs, then pass those URLs to this actor for full content extraction. This is a common pattern for news monitoring pipelines.

What happens if a URL returns a 404 or is unreachable?

Failed URLs are reported in the output with an error status and message. They are not charged. The actor continues processing remaining URLs in the batch without stopping.

Website Content Extractor

tropical_prune/website-content-extractor

CAPABILITIES: extract_content, convert_to_markdown, batch_urls, extract_metadata. INPUT: URLs (single or array), with optional selectors and output format. OUTPUT: structured JSON with title, text, metadata, word_count. FORMATS: json, markdown, text. PRICING: PPE $0.001/page.

Bado

Google News Article Scraper

webscrap18/google-news-article-scraper

Scrape Google News, Extract full content with Title, Article Text, Images and Structured data.

WebScrap

News Article To Markdown

extremescrapes/news-article-to-markdown

Extract news articles as clean, ad-free Markdown with automatic author and publish date detection.

Extreme Scrapes

News Article Scraper for Feeding LLM

proscraper/newsarticlescraper

Scrape news articles metadata to feed into LLM models. Returns article body, published date, article title, author etc.

Owais Nazir

180

Google News Scraper

fortuitous_pirate/google-news-scraper

Scrape news articles from Google News by search query or topic. Extracts article title, source, published date, and URL. Supports language and country filtering. Export to JSON, CSV, or Excel.

Fortuitous Pirate

Smart Article Extractor

datapilot/smart-article-extractor

News Article Extractor Actor fetches article URLs and extracts structured content using Requests, , and Newspaper3k. It collects title, author, publish date, text, summary, keywords, images, and word count. Supports proxy use and outputs clean JSON results.

Data Pilot

Fast News Content Scraper

datapilot/fast-news-content-scraper

Fast News Content Scraper Actor collects news articles using Fast News RSS and . It extracts title, URL, publish date, author, description, and full article text. Supports multiple queries, anti-bot delays, and outputs structured JSON with source site and scrape timestamp.

Data Pilot

Medium Article Extractor

extremescrapes/medium-article-extractor

Extract news articles as clean, ad-free Markdown with automatic author and publish date detection.

Extreme Scrapes

News Website Crawler & Article Extractor

xtech/news-source-crawler

Scrape all articles from any news website. Extract full text, metadata, keywords, and summaries. Ideal for content analysis, research, and news aggregation.

Xtech

405

4.8

News Article Extractor for AI & RAG

wiry_kingdom/news-article-extractor-ai

Extract clean, structured JSON from any news article or blog post - title, authors, published date, full content, keywords, images. Perfect for LLM training data, RAG pipelines, content monitoring and news aggregation. Uses JSON-LD, Open Graph and readability heuristics.