Article/News Extractor avatar

Article/News Extractor

Pricing

Pay per usage

Go to Apify Store
Article/News Extractor

Article/News Extractor

CAPABILITIES: extract_article, extract_metadata, detect_language, clean_text, batch_urls. INPUT: URLs (single or array) of articles/news pages. OUTPUT: structured JSON with title, author, date, content, language, word_count. FORMATS: json, markdown, text. PRICING: PPE $0.001/article.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Bado

Bado

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

2

Monthly active users

20 hours ago

Last modified

Categories

Share

Article & News Extractor

Extract clean, structured article text and metadata from any news site or blog. Built by Tropical Tools — structured data extraction APIs optimized for AI agents.

Feed in URLs and get back title, author, publication date, full article content, language, tags, word count, reading time, and token estimates. The article extractor is purpose-built for RAG pipelines, content analysis, and AI agent workflows that need reliable, clean text from the web.

What Does It Do?

Article & News Extractor takes any article or blog URL and returns structured, clean content stripped of ads, navigation, sidebars, and other noise. It uses readability-based extraction combined with JSON-LD and Schema.org metadata parsing to pull the most accurate data possible from each page.

Every extraction returns:

  • Title — The article headline
  • Author — Byline attribution when available
  • Publication date — Parsed and normalized to ISO 8601
  • Content — Full article body, cleaned of HTML cruft
  • Language — Detected language code (ISO 639-1)
  • Tags/categories — Topic tags and section labels from the source
  • Word count — Total words in the extracted content
  • Reading time — Estimated minutes to read
  • Token estimate — Approximate token count for LLM context planning
  • Paywall detection — Flags articles behind paywalls (without bypass)

Why Use This Actor?

Most web scrapers return raw HTML or half-broken text full of navigation elements and ad copy. This news scraper is different:

  • RAG-optimized output — Clean text with metadata fields that map directly to vector DB schemas. No post-processing needed before chunking and embedding.
  • Paywall detection — Identifies paywalled content upfront so your pipeline doesn't ingest truncated articles into your knowledge base.
  • Language detection — Automatic language identification lets you route multilingual content to the right embedding model or translation step.
  • 3 output formats — Get results as structured JSON, clean Markdown, or plain text depending on your downstream needs.
  • Batch processing — Pass hundreds of URLs in a single run. The actor processes them concurrently and returns results as they complete.

Features

  • Readability extraction — Mozilla Readability-based content isolation that strips ads, navigation, footers, and related-article blocks
  • JSON-LD / Schema.org parsing — Extracts structured metadata embedded by publishers for maximum accuracy on dates, authors, and categories
  • Multi-platform support — Tested and optimized for WordPress, Medium, Substack, Ghost, major news outlets (Reuters, AP, BBC, NYT), and thousands of independent blogs
  • Paywall detection — Detects soft and hard paywalls and flags them in output metadata (does not bypass paywalls)
  • Content deduplication — Identifies and removes repeated boilerplate text across batch runs
  • Configurable output — Choose between JSON, Markdown, and plain text formats

Input Configuration

FieldTypeDefaultDescription
urlsarrayrequiredList of article URLs to extract
outputFormatstring"json"Output format: json, markdown, or text
includeMetadatabooleantrueInclude full metadata (author, date, tags, etc.)
extractStructuredDatabooleantrueParse JSON-LD and Schema.org data from pages

Example Input

{
"urls": [
"https://en.wikipedia.org/wiki/Web_scraping",
"https://example.com/blog/ai-agents-2025"
],
"outputFormat": "json",
"includeMetadata": true,
"extractStructuredData": true
}

Output Example

Extracting a Wikipedia article returns structured data like this:

{
"url": "https://en.wikipedia.org/wiki/Web_scraping",
"title": "Web scraping",
"author": "Wikipedia contributors",
"publishedDate": "2024-11-15T08:22:00Z",
"content": "Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser...",
"language": "en",
"tags": ["Web scraping", "Data extraction", "Web harvesting"],
"wordCount": 4250,
"readingTimeMinutes": 17,
"tokenEstimate": 5660,
"paywallDetected": false,
"structuredData": {
"@type": "Article",
"name": "Web scraping",
"inLanguage": "en",
"isPartOf": {
"@type": "WebSite",
"name": "Wikipedia"
}
},
"extractedAt": "2025-03-15T12:00:00Z"
}

Cost Estimation

Each successfully extracted article costs $0.001 (one-tenth of a cent). Failed extractions (404 errors, unreachable URLs) are not charged.

VolumeCostCost per 1K articles
100 articles$0.10$1.00
1,000 articles$1.00$1.00
10,000 articles$10.00$1.00
100,000 articles$100.00$1.00

Use Cases

  • RAG / vector DB ingestion — Extract and chunk articles for retrieval-augmented generation pipelines. Clean text with metadata makes embedding and retrieval more accurate.
  • News monitoring — Track coverage across dozens of publications. Feed URLs from RSS feeds or news APIs and get structured, comparable output.
  • Content analysis — Analyze word counts, reading levels, topic tags, and publication patterns across large content sets.
  • AI agent reading — Give your AI agent the ability to read and understand web articles. The token estimate field helps with context window planning.
  • Competitive intelligence — Monitor competitor blogs, press releases, and thought leadership content in structured form.

FAQ

Does it bypass paywalls?

No. The actor detects paywalled content and flags it in the output (paywallDetected: true), but it does not bypass, circumvent, or work around any paywall mechanisms. You will receive whatever content is publicly accessible on the page.

What sites are supported?

The article extractor works with any standard HTML page that contains article content. It has been optimized for WordPress, Medium, Substack, Ghost, and major news platforms including Reuters, AP, BBC, The Guardian, and The New York Times. Sites with heavy JavaScript rendering may require additional processing time.

How does language detection work?

Language is detected using a combination of HTML lang attributes, metadata tags, and content-level analysis. The actor returns an ISO 639-1 language code (e.g., en, es, fr, de, ja). Detection is accurate for all major languages and most minor ones.

Does it handle AMP pages?

Yes. When an AMP version of a page is detected, the actor extracts content from the AMP markup. If you provide an AMP URL directly, it will be processed normally. The actor prefers canonical (non-AMP) versions when both are available, as they typically contain richer metadata.

Can I use it with RSS feeds?

The actor accepts direct article URLs, not RSS feed URLs. However, it pairs well with RSS feed actors — use an RSS parser to get article URLs, then pass those URLs to this actor for full content extraction. This is a common pattern for news monitoring pipelines.

What happens if a URL returns a 404 or is unreachable?

Failed URLs are reported in the output with an error status and message. They are not charged. The actor continues processing remaining URLs in the batch without stopping.