Article Extractor & News Scraper
Pricing
$19.00/month + usage
Article Extractor & News Scraper
Extract articles from any news site, blog, or webpage. Get title, full text, author, date, images & metadata using 7 extraction engines (Newspaper4k, Trafilatura, Goose3). Anti-bot bypass, proxy rotation, automatic fallback. Perfect for news monitoring, NLP datasets & content aggregation.
Pricing
$19.00/month + usage
Rating
5.0
(2)
Developer

Web Harvester
Actor stats
3
Bookmarked
23
Total users
0
Monthly active users
3 days ago
Last modified
Share
Extract articles, news content, and blog posts from any website. Get clean, structured data with title, full text, authors, publication date, images, keywords, and metadataβpowered by 7 specialized extraction engines.
π Run this Actor on Apify | π API Documentation
Table of Contents
- Features
- Use Cases
- How It Works
- Extraction Engines Comparison
- Quick Start
- Input Configuration
- Output Format
- Example Outputs
- Anti-Blocking Features
- Performance Tips
- Integrations
- Troubleshooting
- FAQ
- Changelog
β¨ Features
Core Capabilities
- π 7 Specialized Extraction Engines β Choose from Newspaper4k, Trafilatura, Boilerpy3, News-Please, Goose3, Article Parser, and JusText
- π Universal Website Compatibility β Works with news sites, blogs, magazines, and any article-based content
- π Complete Content Extraction β Captures title, description, full text, authors, publication date, images, keywords, and metadata
- π Smart Fallback System β Automatically tries alternative extractors if the primary one fails
Anti-Blocking Technology
- π Browser Fingerprint Generation β Uses browserforge for realistic browser headers
- π Proxy Rotation β Automatic proxy rotation with support for residential proxies
- β±οΈ Intelligent Rate Limiting β Domain-specific delays and concurrency control
- βοΈ CloudScraper Integration β Bypasses Cloudflare and similar protections
- π¦ Google Cache Fallback β Retrieves content from Google's cache when direct access fails
Output Options
- π Plain Text β Clean, extracted article text
- π Article HTML β Preserved formatting with links and media
- π Full Page HTML β Complete webpage source for custom processing
- π Structured JSON β All metadata in a standardized format
π― Use Cases
| Industry | Application |
|---|---|
| Media Monitoring | Track news coverage, brand mentions, and competitor activity |
| Research & Academia | Collect data for NLP, sentiment analysis, and content studies |
| Content Aggregation | Build news feeds, curated content platforms, and newsletters |
| SEO Analysis | Analyze competitor content, keywords, and publishing patterns |
| Market Intelligence | Monitor industry news, trends, and market developments |
| Web Archiving | Preserve article content with full metadata |
| AI/ML Training | Generate training datasets for language models |
βοΈ How It Works
graph LRA[Input URLs] --> B[Fetch Pages]B --> C{Anti-Bot Check}C -->|Blocked| D[Rotate Proxy/Headers]D --> BC -->|Success| E[Extract Content]E --> F{Extraction OK?}F -->|No| G[Try Fallback Engine]G --> EF -->|Yes| H[Output JSON]
- Input Processing β Accepts a list of article URLs
- Smart Fetching β Uses randomized browser headers and proxy rotation
- Anti-Bot Evasion β Detects and bypasses blocking with CloudScraper and fingerprint rotation
- Content Extraction β Applies the selected extraction engine
- Fallback Logic β Automatically tries alternative engines if extraction fails
- Output Generation β Returns structured JSON with all extracted data
π Extraction Engines Comparison
| Engine | Best For | Speed | Metadata | NLP Features | Special Capabilities |
|---|---|---|---|---|---|
| Newspaper4k | General news | β‘β‘β‘ | β Full | β Yes | Summary, keywords, NER |
| Trafilatura | News & blogs | β‘β‘β‘β‘ | β Full | β No | Language detection, categories |
| Boilerpy3 | Simple articles | β‘β‘β‘β‘β‘ | β οΈ Basic | β No | Text density metrics |
| News-Please | Rich metadata | β‘β‘ | β Full | β No | Multiple fallback methods |
| Goose3 | Image extraction | β‘β‘β‘ | β Full | β No | Top image detection |
| Article Parser | HTML/Markdown | β‘β‘β‘ | β οΈ Basic | β No | Multiple output formats |
| JusText | Boilerplate removal | β‘β‘β‘β‘ | β οΈ Basic | β No | Language-aware filtering |
Recommended Engines by Content Type
- π° News Sites β Newspaper4k or Trafilatura
- π Blog Posts β Trafilatura or Goose3
- π Long-form Articles β Newspaper4k (with NLP for summarization)
- πΌοΈ Image-heavy Content β Goose3
- β‘ High-volume Scraping β Boilerpy3 or Trafilatura
- π€ Non-English Content β JusText (40+ languages supported)
π Quick Start
Run on Apify Platform
{"startUrls": ["https://www.nytimes.com/2024/01/15/technology/ai-developments.html","https://www.theguardian.com/world/2024/jan/15/breaking-news"],"extractorEngine": "newspaper4k"}
Run Locally with Apify CLI
# Install Apify CLInpm install -g apify-cli# Clone and runapify pull article-extractor-news-scrapercd article-extractor-news-scraperapify run --input='{"startUrls": ["https://example.com/article"]}'
Call via API
curl -X POST "https://api.apify.com/v2/acts/YOUR_USERNAME~article-extractor-news-scraper/runs" \-H "Authorization: Bearer YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{"startUrls": ["https://www.bbc.com/news/world-12345"],"extractorEngine": "newspaper4k"}'
π Input Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | array | required | List of article URLs to extract |
extractorEngine | string | newspaper4k | Extraction engine to use |
useFallbackExtractors | boolean | true | Try alternative engines if primary fails |
saveHtml | boolean | false | Include full page HTML in output |
saveArticleHtml | boolean | false | Include cleaned article HTML |
maxRetries | integer | 15 | Retry attempts for failed requests |
useHeaderGenerator | boolean | true | Generate realistic browser headers |
headerGeneratorOptions | object | {} | Browser/device emulation settings |
customHeaders | object | {} | Additional HTTP headers |
proxyConfiguration | object | residential | Proxy settings |
Full Input Example
{"startUrls": ["https://www.nytimes.com/2024/01/15/world/article.html","https://www.theguardian.com/world/2024/jan/15/story","https://www.bbc.com/news/world-12345678"],"extractorEngine": "newspaper4k","useFallbackExtractors": true,"saveHtml": false,"saveArticleHtml": true,"maxRetries": 15,"useHeaderGenerator": true,"headerGeneratorOptions": {"browsers": ["chrome", "firefox", "safari", "edge"],"devices": ["desktop"]},"customHeaders": {},"proxyConfiguration": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}}
π€ Output Format
Each extracted article produces a JSON object with the following fields:
Common Fields (All Engines)
| Field | Type | Description |
|---|---|---|
url | string | Original article URL |
title | string | Article headline |
text | string | Full article text (cleaned) |
sourceDomain | string | Website domain |
extractorEngine | string | Engine used for extraction |
extractedAt | string | ISO 8601 timestamp |
Extended Fields (Engine-Dependent)
| Field | Type | Available In |
|---|---|---|
description | string | newspaper4k, goose3, news-please |
author | array | newspaper4k, news-please |
publishedDate | string | newspaper4k, trafilatura, news-please |
image | string | newspaper4k, goose3, news-please |
keywords | array | newspaper4k, goose3 |
summary | string | newspaper4k |
language | string | newspaper4k, trafilatura, justext |
categories | array | trafilatura |
tags | array | trafilatura |
allImages | array | newspaper4k |
metaData | object | newspaper4k |
siteName | string | newspaper4k |
favicon | string | newspaper4k |
Metadata Fields
| Field | Type | Description |
|---|---|---|
fallbackUsed | boolean | Whether a fallback engine was used |
originalExtractor | string | Originally requested engine (if fallback used) |
fetchedFromCache | boolean | Whether content was fetched from Google Cache |
π Example Outputs
π‘οΈ Anti-Blocking Features
This Actor includes advanced anti-blocking technology to maximize success rates:
Browser Fingerprint Generation
Uses browserforge to generate realistic browser fingerprints including:
- Chrome, Firefox, Safari, and Edge user agents
- Proper
sec-ch-uaclient hints - Consistent platform and viewport data
- Session-based fingerprint persistence
Proxy Rotation
- Automatic proxy rotation on 403/429 errors
- Support for residential, datacenter, and custom proxies
- Domain-specific proxy strategies
Intelligent Rate Limiting
- Per-domain concurrency control
- Adaptive delays based on site response
- Strict mode for heavily protected sites
CloudScraper Integration
- Bypasses Cloudflare browser verification
- Handles JavaScript challenges
- Automatic cookie management
Google Cache Fallback
When direct access fails after all retries, the Actor attempts to retrieve content from Google's cache as a last resort.
β‘ Performance Tips
For Maximum Speed
{"extractorEngine": "boilerpy3","maxRetries": 5,"useFallbackExtractors": false}
For Maximum Success Rate
{"extractorEngine": "newspaper4k","maxRetries": 15,"useFallbackExtractors": true,"proxyConfiguration": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}}
For Rich Metadata
{"extractorEngine": "newspaper4k","saveArticleHtml": true,"useFallbackExtractors": true}
π Integrations
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run_input = {"startUrls": ["https://www.example.com/article"],"extractorEngine": "newspaper4k"}run = client.actor("YOUR_USERNAME/article-extractor-news-scraper").call(run_input=run_input)for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(f"Title: {item['title']}")print(f"Text: {item['text'][:200]}...")
JavaScript/Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('YOUR_USERNAME/article-extractor-news-scraper').call({startUrls: ['https://www.example.com/article'],extractorEngine: 'newspaper4k'});const { items } = await client.dataset(run.defaultDatasetId).listItems();items.forEach(item => {console.log(`Title: ${item.title}`);console.log(`Text: ${item.text.substring(0, 200)}...`);});
Webhooks
Configure webhooks to receive results automatically:
{"webhooks": [{"eventTypes": ["ACTOR.RUN.SUCCEEDED"],"requestUrl": "https://your-server.com/webhook"}]}
Zapier / Make (Integromat)
Use the Apify integration in Zapier or Make to connect extracted articles to:
- Google Sheets
- Notion databases
- Slack notifications
- Email newsletters
- CRM systems
π§ Troubleshooting
Common Issues
| Issue | Cause | Solution |
|---|---|---|
| Empty text output | Anti-bot blocking | Enable residential proxies, reduce concurrency |
| 403/429 errors | Rate limiting | Increase maxRetries |
| Timeout errors | Slow server response | Increase timeout, try Google Cache |
| Missing metadata | Engine limitation | Switch to a different extraction engine |
| Garbled text | Encoding issues | Try trafilatura or newspaper4k |
Reporting Issues
If you encounter persistent issues:
- Check if the URL works in a regular browser
- Try different extraction engines
- Open an issue with:
- The problematic URL
- Your input configuration
β Frequently Asked Questions
π Changelog
v1.0.0 (December 2025)
- β¨ Initial public release
- π 7 extraction engines: Newspaper4k, Trafilatura, Boilerpy3, News-Please, Goose3, Article Parser, JusText
- π‘οΈ Advanced anti-blocking with browserforge fingerprinting
- π Automatic fallback extraction
- βοΈ Google Cache fallback for blocked pages
- π Multiple dataset views (Overview, Content, Metadata)
- βοΈ Configurable concurrency and retry settings
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π€ Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Built with β€οΈ for the data extraction community
Keywords: article extractor, news scraper, web scraping, content extraction, newspaper4k, trafilatura, apify actor, python scraper, text extraction, metadata extraction, NLP, news monitoring, content aggregation