Pricing

$19.00/month + usage

Try for free

Go to Apify Store

Article Extractor & News Scraper

Try for free

Extract articles from any news site, blog, or webpage. Get title, full text, author, date, images & metadata using 7 extraction engines (Newspaper4k, Trafilatura, Goose3). Anti-bot bypass, proxy rotation, automatic fallback. Perfect for news monitoring, NLP datasets & content aggregation.

Pricing

$19.00/month + usage

Rating

5.0

(2)

Developer

Web Harvester

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

✨ Features

Core Capabilities

🔍 7 Specialized Extraction Engines — Choose from Newspaper4k, Trafilatura, Boilerpy3, News-Please, Goose3, Article Parser, and JusText
🌐 Universal Website Compatibility — Works with news sites, blogs, magazines, and any article-based content
📊 Complete Content Extraction — Captures title, description, full text, authors, publication date, images, keywords, and metadata
🔄 Smart Fallback System — Automatically tries alternative extractors if the primary one fails

Anti-Blocking Technology

🎭 Browser Fingerprint Generation — Uses browserforge for realistic browser headers
🔀 Proxy Rotation — Automatic proxy rotation with support for residential proxies
⏱️ Intelligent Rate Limiting — Domain-specific delays and concurrency control
☁️ CloudScraper Integration — Bypasses Cloudflare and similar protections
📦 Google Cache Fallback — Retrieves content from Google's cache when direct access fails

Output Options

📝 Plain Text — Clean, extracted article text
🔖 Article HTML — Preserved formatting with links and media
📄 Full Page HTML — Complete webpage source for custom processing
📋 Structured JSON — All metadata in a standardized format

🎯 Use Cases

Industry	Application
Media Monitoring	Track news coverage, brand mentions, and competitor activity
Research & Academia	Collect data for NLP, sentiment analysis, and content studies
Content Aggregation	Build news feeds, curated content platforms, and newsletters
SEO Analysis	Analyze competitor content, keywords, and publishing patterns
Market Intelligence	Monitor industry news, trends, and market developments
Web Archiving	Preserve article content with full metadata
AI/ML Training	Generate training datasets for language models

⚙️ How It Works

graph LR
    A[Input URLs] --> B[Fetch Pages]
    B --> C{Anti-Bot Check}
    C -->|Blocked| D[Rotate Proxy/Headers]
    D --> B
    C -->|Success| E[Extract Content]
    E --> F{Extraction OK?}
    F -->|No| G[Try Fallback Engine]
    G --> E
    F -->|Yes| H[Output JSON]

Input Processing — Accepts a list of article URLs
Smart Fetching — Uses randomized browser headers and proxy rotation
Anti-Bot Evasion — Detects and bypasses blocking with CloudScraper and fingerprint rotation
Content Extraction — Applies the selected extraction engine
Fallback Logic — Automatically tries alternative engines if extraction fails
Output Generation — Returns structured JSON with all extracted data

📊 Extraction Engines Comparison

Engine	Best For	Speed	Metadata	NLP Features	Special Capabilities
Newspaper4k	General news	⚡⚡⚡	✅ Full	✅ Yes	Summary, keywords, NER
Trafilatura	News & blogs	⚡⚡⚡⚡	✅ Full	❌ No	Language detection, categories
Boilerpy3	Simple articles	⚡⚡⚡⚡⚡	⚠️ Basic	❌ No	Text density metrics
News-Please	Rich metadata	⚡⚡	✅ Full	❌ No	Multiple fallback methods
Goose3	Image extraction	⚡⚡⚡	✅ Full	❌ No	Top image detection
Article Parser	HTML/Markdown	⚡⚡⚡	⚠️ Basic	❌ No	Multiple output formats
JusText	Boilerplate removal	⚡⚡⚡⚡	⚠️ Basic	❌ No	Language-aware filtering

🚀 Quick Start

Run on Apify Platform

{
  "startUrls": [
    "https://www.nytimes.com/2024/01/15/technology/ai-developments.html",
    "https://www.theguardian.com/world/2024/jan/15/breaking-news"
  ],
  "extractorEngine": "newspaper4k"
}

Run Locally with Apify CLI

# Install Apify CLI
npm install -g apify-cli

# Clone and run
apify pull article-extractor-news-scraper
cd article-extractor-news-scraper
apify run --input='{"startUrls": ["https://example.com/article"]}'

Call via API

curl -X POST "https://api.apify.com/v2/acts/YOUR_USERNAME~article-extractor-news-scraper/runs" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "startUrls": ["https://www.bbc.com/news/world-12345"],
    "extractorEngine": "newspaper4k"
  }'

📝 Input Configuration

Parameter	Type	Default	Description
`startUrls`	array	required	List of article URLs to extract
`extractorEngine`	string	`newspaper4k`	Extraction engine to use
`useFallbackExtractors`	boolean	`true`	Try alternative engines if primary fails
`saveHtml`	boolean	`false`	Include full page HTML in output
`saveArticleHtml`	boolean	`false`	Include cleaned article HTML
`maxRetries`	integer	`15`	Retry attempts for failed requests
`useHeaderGenerator`	boolean	`true`	Generate realistic browser headers
`headerGeneratorOptions`	object	`{}`	Browser/device emulation settings
`customHeaders`	object	`{}`	Additional HTTP headers
`proxyConfiguration`	object	residential	Proxy settings

Full Input Example

{
  "startUrls": [
    "https://www.nytimes.com/2024/01/15/world/article.html",
    "https://www.theguardian.com/world/2024/jan/15/story",
    "https://www.bbc.com/news/world-12345678"
  ],
  "extractorEngine": "newspaper4k",
  "useFallbackExtractors": true,
  "saveHtml": false,
  "saveArticleHtml": true,
  "maxRetries": 15,
  "useHeaderGenerator": true,
  "headerGeneratorOptions": {
    "browsers": ["chrome", "firefox", "safari", "edge"],
    "devices": ["desktop"]
  },
  "customHeaders": {},
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": ["RESIDENTIAL"]
  }
}

📤 Output Format

Each extracted article produces a JSON object with the following fields:

Common Fields (All Engines)

Field	Type	Description
`url`	string	Original article URL
`title`	string	Article headline
`text`	string	Full article text (cleaned)
`sourceDomain`	string	Website domain
`extractorEngine`	string	Engine used for extraction
`extractedAt`	string	ISO 8601 timestamp

Extended Fields (Engine-Dependent)

Field	Type	Available In
`description`	string	newspaper4k, goose3, news-please
`author`	array	newspaper4k, news-please
`publishedDate`	string	newspaper4k, trafilatura, news-please
`image`	string	newspaper4k, goose3, news-please
`keywords`	array	newspaper4k, goose3
`summary`	string	newspaper4k
`language`	string	newspaper4k, trafilatura, justext
`categories`	array	trafilatura
`tags`	array	trafilatura
`allImages`	array	newspaper4k
`metaData`	object	newspaper4k
`siteName`	string	newspaper4k
`favicon`	string	newspaper4k

Metadata Fields

Field	Type	Description
`fallbackUsed`	boolean	Whether a fallback engine was used
`originalExtractor`	string	Originally requested engine (if fallback used)
`fetchedFromCache`	boolean	Whether content was fetched from Google Cache

📋 Example Outputs

🛡️ Anti-Blocking Features

This Actor includes advanced anti-blocking technology to maximize success rates:

Browser Fingerprint Generation

Uses browserforge to generate realistic browser fingerprints including:

Chrome, Firefox, Safari, and Edge user agents
Proper sec-ch-ua client hints
Consistent platform and viewport data
Session-based fingerprint persistence

Proxy Rotation

Automatic proxy rotation on 403/429 errors
Support for residential, datacenter, and custom proxies
Domain-specific proxy strategies

Intelligent Rate Limiting

Per-domain concurrency control
Adaptive delays based on site response
Strict mode for heavily protected sites

CloudScraper Integration

Bypasses Cloudflare browser verification
Handles JavaScript challenges
Automatic cookie management

Google Cache Fallback

When direct access fails after all retries, the Actor attempts to retrieve content from Google's cache as a last resort.

⚡ Performance Tips

For Maximum Speed

{
  "extractorEngine": "boilerpy3",
  "maxRetries": 5,
  "useFallbackExtractors": false
}

For Maximum Success Rate

{
  "extractorEngine": "newspaper4k",
  "maxRetries": 15,
  "useFallbackExtractors": true,
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": ["RESIDENTIAL"]
  }
}

For Rich Metadata

{
  "extractorEngine": "newspaper4k",
  "saveArticleHtml": true,
  "useFallbackExtractors": true
}

🔌 Integrations

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

run_input = {
    "startUrls": ["https://www.example.com/article"],
    "extractorEngine": "newspaper4k"
}

run = client.actor("YOUR_USERNAME/article-extractor-news-scraper").call(run_input=run_input)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"Title: {item['title']}")
    print(f"Text: {item['text'][:200]}...")

JavaScript/Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });

const run = await client.actor('YOUR_USERNAME/article-extractor-news-scraper').call({
    startUrls: ['https://www.example.com/article'],
    extractorEngine: 'newspaper4k'
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach(item => {
    console.log(`Title: ${item.title}`);
    console.log(`Text: ${item.text.substring(0, 200)}...`);
});

Webhooks

Configure webhooks to receive results automatically:

{
  "webhooks": [
    {
      "eventTypes": ["ACTOR.RUN.SUCCEEDED"],
      "requestUrl": "https://your-server.com/webhook"
    }
  ]
}

Zapier / Make (Integromat)

Use the Apify integration in Zapier or Make to connect extracted articles to:

Google Sheets
Notion databases
Slack notifications
Email newsletters
CRM systems

🔧 Troubleshooting

Common Issues

Issue	Cause	Solution
Empty text output	Anti-bot blocking	Enable residential proxies, reduce concurrency
403/429 errors	Rate limiting	Increase `maxRetries`
Timeout errors	Slow server response	Increase timeout, try Google Cache
Missing metadata	Engine limitation	Switch to a different extraction engine
Garbled text	Encoding issues	Try trafilatura or newspaper4k

Reporting Issues

If you encounter persistent issues:

Check if the URL works in a regular browser
Try different extraction engines
Open an issue with:
- The problematic URL
- Your input configuration

❓ Frequently Asked Questions

📝 Changelog

v1.0.0 (December 2025)

✨ Initial public release
🔍 7 extraction engines: Newspaper4k, Trafilatura, Boilerpy3, News-Please, Goose3, Article Parser, JusText
🛡️ Advanced anti-blocking with browserforge fingerprinting
🔄 Automatic fallback extraction
☁️ Google Cache fallback for blocked pages
📊 Multiple dataset views (Overview, Content, Metadata)
⚙️ Configurable concurrency and retry settings

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Built with ❤️ for the data extraction community

Keywords: article extractor, news scraper, web scraping, content extraction, newspaper4k, trafilatura, apify actor, python scraper, text extraction, metadata extraction, NLP, news monitoring, content aggregation

Fast News Content Scraper

datapilot/fast-news-content-scraper

Fast News Content Scraper Actor collects news articles using Fast News RSS and . It extracts title, URL, publish date, author, description, and full article text. Supports multiple queries, anti-bot delays, and outputs structured JSON with source site and scrape timestamp.

Data Pilot

News Website Crawler & Article Extractor

xtech/news-source-crawler

Scrape all articles from any news website. Extract full text, metadata, keywords, and summaries. Ideal for content analysis, research, and news aggregation.

Xtech

350

4.9

Google News Scraper

codingfrontend/google-news-scraper

Scrape news articles from news.google.com with deep article content extraction

codingfrontend

Google News Scraper

futurizerush/google-news-scraper

Google News Search Scraper - Real-time news aggregation from Google News. Features smart article enrichment with full content extraction. Perfect for market research, trend analysis, and content monitoring.

Rush

5.0

Google News Article Scraper

webscrap18/google-news-article-scraper

Scrape Google News, Extract full content with Title, Article Text, Images and Structured data.

WebScrap

Smart Article Extractor

datapilot/smart-article-extractor

News Article Extractor Actor fetches article URLs and extracts structured content using Requests, , and Newspaper3k. It collects title, author, publish date, text, summary, keywords, images, and word count. Supports proxy use and outputs clean JSON results.

Data Pilot

Yahoo News Scraper

piotrv1001/yahoo-news-scraper

Scrapes news articles from Yahoo News categories, extracting titles, authors, sources, publication dates, descriptions, images, and full article body text. Ideal for media monitoring, trend analysis, and news aggregation.

FalconScrape

Google News Scraper

crawlerbros/google-news-scraper

Scrape Google News in real-time. Supports keyword search, date filters, full-text article extraction, and image extraction.

Crawler Bros

5.0

Google News Scraper

easyapi/google-news-scraper

Powerful Google News scraper, collect up to 5000 news articles with flexible search options, language support. Perfect for news aggregation, market research, and sentiment analysis. 📰🔍

EasyApi

1.1K

4.6

Web Article Content Extractor

vulnv/web-article-content-extractor

Extract clean, readable content from news articles, blog posts, and web pages. Batch process multiple URLs, download images, bypass bot protection with proxy support. Perfect for content curation, research, and data analysis.

VulnV

Article Extractor & News Scraper

Table of Contents

✨ Features

Core Capabilities

Anti-Blocking Technology

Output Options

🎯 Use Cases

⚙️ How It Works

📊 Extraction Engines Comparison

Recommended Engines by Content Type

🚀 Quick Start

Run on Apify Platform

Run Locally with Apify CLI

Call via API

📝 Input Configuration

Full Input Example

📤 Output Format

Common Fields (All Engines)

Extended Fields (Engine-Dependent)

Metadata Fields

📋 Example Outputs

🛡️ Anti-Blocking Features

Browser Fingerprint Generation

Proxy Rotation

Intelligent Rate Limiting

CloudScraper Integration

Google Cache Fallback

⚡ Performance Tips

For Maximum Speed

For Maximum Success Rate

For Rich Metadata

🔌 Integrations

Python

JavaScript/Node.js

Webhooks

Zapier / Make (Integromat)

🔧 Troubleshooting

Common Issues

Reporting Issues

❓ Frequently Asked Questions

📝 Changelog

v1.0.0 (December 2025)

📄 License

🤝 Contributing

You might also like

Fast News Content Scraper

News Website Crawler & Article Extractor

Google News Scraper

Google News Scraper

Google News Article Scraper

Smart Article Extractor

Yahoo News Scraper

Google News Scraper

Google News Scraper

Web Article Content Extractor