Article Extractor & News Scraper avatar
Article Extractor & News Scraper

Pricing

$19.00/month + usage

Go to Apify Store
Article Extractor & News Scraper

Article Extractor & News Scraper

Extract articles from any news site, blog, or webpage. Get title, full text, author, date, images & metadata using 7 extraction engines (Newspaper4k, Trafilatura, Goose3). Anti-bot bypass, proxy rotation, automatic fallback. Perfect for news monitoring, NLP datasets & content aggregation.

Pricing

$19.00/month + usage

Rating

5.0

(2)

Developer

Web Harvester

Web Harvester

Maintained by Community

Actor stats

3

Bookmarked

23

Total users

0

Monthly active users

3 days ago

Last modified

Categories

Share

Apify Actor Python 3.12 License: MIT

Extract articles, news content, and blog posts from any website. Get clean, structured data with title, full text, authors, publication date, images, keywords, and metadataβ€”powered by 7 specialized extraction engines.

πŸ”— Run this Actor on Apify | πŸ“– API Documentation


Table of Contents


✨ Features

Core Capabilities

  • πŸ” 7 Specialized Extraction Engines β€” Choose from Newspaper4k, Trafilatura, Boilerpy3, News-Please, Goose3, Article Parser, and JusText
  • 🌐 Universal Website Compatibility β€” Works with news sites, blogs, magazines, and any article-based content
  • πŸ“Š Complete Content Extraction β€” Captures title, description, full text, authors, publication date, images, keywords, and metadata
  • πŸ”„ Smart Fallback System β€” Automatically tries alternative extractors if the primary one fails

Anti-Blocking Technology

  • 🎭 Browser Fingerprint Generation β€” Uses browserforge for realistic browser headers
  • πŸ”€ Proxy Rotation β€” Automatic proxy rotation with support for residential proxies
  • ⏱️ Intelligent Rate Limiting β€” Domain-specific delays and concurrency control
  • ☁️ CloudScraper Integration β€” Bypasses Cloudflare and similar protections
  • πŸ“¦ Google Cache Fallback β€” Retrieves content from Google's cache when direct access fails

Output Options

  • πŸ“ Plain Text β€” Clean, extracted article text
  • πŸ”– Article HTML β€” Preserved formatting with links and media
  • πŸ“„ Full Page HTML β€” Complete webpage source for custom processing
  • πŸ“‹ Structured JSON β€” All metadata in a standardized format

🎯 Use Cases

IndustryApplication
Media MonitoringTrack news coverage, brand mentions, and competitor activity
Research & AcademiaCollect data for NLP, sentiment analysis, and content studies
Content AggregationBuild news feeds, curated content platforms, and newsletters
SEO AnalysisAnalyze competitor content, keywords, and publishing patterns
Market IntelligenceMonitor industry news, trends, and market developments
Web ArchivingPreserve article content with full metadata
AI/ML TrainingGenerate training datasets for language models

βš™οΈ How It Works

graph LR
A[Input URLs] --> B[Fetch Pages]
B --> C{Anti-Bot Check}
C -->|Blocked| D[Rotate Proxy/Headers]
D --> B
C -->|Success| E[Extract Content]
E --> F{Extraction OK?}
F -->|No| G[Try Fallback Engine]
G --> E
F -->|Yes| H[Output JSON]
  1. Input Processing β€” Accepts a list of article URLs
  2. Smart Fetching β€” Uses randomized browser headers and proxy rotation
  3. Anti-Bot Evasion β€” Detects and bypasses blocking with CloudScraper and fingerprint rotation
  4. Content Extraction β€” Applies the selected extraction engine
  5. Fallback Logic β€” Automatically tries alternative engines if extraction fails
  6. Output Generation β€” Returns structured JSON with all extracted data

πŸ“Š Extraction Engines Comparison

EngineBest ForSpeedMetadataNLP FeaturesSpecial Capabilities
Newspaper4kGeneral newsβš‘βš‘βš‘βœ… Fullβœ… YesSummary, keywords, NER
TrafilaturaNews & blogsβš‘βš‘βš‘βš‘βœ… Full❌ NoLanguage detection, categories
Boilerpy3Simple articles⚑⚑⚑⚑⚑⚠️ Basic❌ NoText density metrics
News-PleaseRich metadataβš‘βš‘βœ… Full❌ NoMultiple fallback methods
Goose3Image extractionβš‘βš‘βš‘βœ… Full❌ NoTop image detection
Article ParserHTML/Markdown⚑⚑⚑⚠️ Basic❌ NoMultiple output formats
JusTextBoilerplate removal⚑⚑⚑⚑⚠️ Basic❌ NoLanguage-aware filtering
  • πŸ“° News Sites β†’ Newspaper4k or Trafilatura
  • πŸ“ Blog Posts β†’ Trafilatura or Goose3
  • πŸ“š Long-form Articles β†’ Newspaper4k (with NLP for summarization)
  • πŸ–ΌοΈ Image-heavy Content β†’ Goose3
  • ⚑ High-volume Scraping β†’ Boilerpy3 or Trafilatura
  • πŸ”€ Non-English Content β†’ JusText (40+ languages supported)

πŸš€ Quick Start

Run on Apify Platform

{
"startUrls": [
"https://www.nytimes.com/2024/01/15/technology/ai-developments.html",
"https://www.theguardian.com/world/2024/jan/15/breaking-news"
],
"extractorEngine": "newspaper4k"
}

Run Locally with Apify CLI

# Install Apify CLI
npm install -g apify-cli
# Clone and run
apify pull article-extractor-news-scraper
cd article-extractor-news-scraper
apify run --input='{"startUrls": ["https://example.com/article"]}'

Call via API

curl -X POST "https://api.apify.com/v2/acts/YOUR_USERNAME~article-extractor-news-scraper/runs" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"startUrls": ["https://www.bbc.com/news/world-12345"],
"extractorEngine": "newspaper4k"
}'

πŸ“ Input Configuration

ParameterTypeDefaultDescription
startUrlsarrayrequiredList of article URLs to extract
extractorEnginestringnewspaper4kExtraction engine to use
useFallbackExtractorsbooleantrueTry alternative engines if primary fails
saveHtmlbooleanfalseInclude full page HTML in output
saveArticleHtmlbooleanfalseInclude cleaned article HTML
maxRetriesinteger15Retry attempts for failed requests
useHeaderGeneratorbooleantrueGenerate realistic browser headers
headerGeneratorOptionsobject{}Browser/device emulation settings
customHeadersobject{}Additional HTTP headers
proxyConfigurationobjectresidentialProxy settings

Full Input Example

{
"startUrls": [
"https://www.nytimes.com/2024/01/15/world/article.html",
"https://www.theguardian.com/world/2024/jan/15/story",
"https://www.bbc.com/news/world-12345678"
],
"extractorEngine": "newspaper4k",
"useFallbackExtractors": true,
"saveHtml": false,
"saveArticleHtml": true,
"maxRetries": 15,
"useHeaderGenerator": true,
"headerGeneratorOptions": {
"browsers": ["chrome", "firefox", "safari", "edge"],
"devices": ["desktop"]
},
"customHeaders": {},
"proxyConfiguration": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"]
}
}

πŸ“€ Output Format

Each extracted article produces a JSON object with the following fields:

Common Fields (All Engines)

FieldTypeDescription
urlstringOriginal article URL
titlestringArticle headline
textstringFull article text (cleaned)
sourceDomainstringWebsite domain
extractorEnginestringEngine used for extraction
extractedAtstringISO 8601 timestamp

Extended Fields (Engine-Dependent)

FieldTypeAvailable In
descriptionstringnewspaper4k, goose3, news-please
authorarraynewspaper4k, news-please
publishedDatestringnewspaper4k, trafilatura, news-please
imagestringnewspaper4k, goose3, news-please
keywordsarraynewspaper4k, goose3
summarystringnewspaper4k
languagestringnewspaper4k, trafilatura, justext
categoriesarraytrafilatura
tagsarraytrafilatura
allImagesarraynewspaper4k
metaDataobjectnewspaper4k
siteNamestringnewspaper4k
faviconstringnewspaper4k

Metadata Fields

FieldTypeDescription
fallbackUsedbooleanWhether a fallback engine was used
originalExtractorstringOriginally requested engine (if fallback used)
fetchedFromCachebooleanWhether content was fetched from Google Cache

πŸ“‹ Example Outputs


πŸ›‘οΈ Anti-Blocking Features

This Actor includes advanced anti-blocking technology to maximize success rates:

Browser Fingerprint Generation

Uses browserforge to generate realistic browser fingerprints including:

  • Chrome, Firefox, Safari, and Edge user agents
  • Proper sec-ch-ua client hints
  • Consistent platform and viewport data
  • Session-based fingerprint persistence

Proxy Rotation

  • Automatic proxy rotation on 403/429 errors
  • Support for residential, datacenter, and custom proxies
  • Domain-specific proxy strategies

Intelligent Rate Limiting

  • Per-domain concurrency control
  • Adaptive delays based on site response
  • Strict mode for heavily protected sites

CloudScraper Integration

  • Bypasses Cloudflare browser verification
  • Handles JavaScript challenges
  • Automatic cookie management

Google Cache Fallback

When direct access fails after all retries, the Actor attempts to retrieve content from Google's cache as a last resort.


⚑ Performance Tips

For Maximum Speed

{
"extractorEngine": "boilerpy3",
"maxRetries": 5,
"useFallbackExtractors": false
}

For Maximum Success Rate

{
"extractorEngine": "newspaper4k",
"maxRetries": 15,
"useFallbackExtractors": true,
"proxyConfiguration": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"]
}
}

For Rich Metadata

{
"extractorEngine": "newspaper4k",
"saveArticleHtml": true,
"useFallbackExtractors": true
}

πŸ”Œ Integrations

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run_input = {
"startUrls": ["https://www.example.com/article"],
"extractorEngine": "newspaper4k"
}
run = client.actor("YOUR_USERNAME/article-extractor-news-scraper").call(run_input=run_input)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(f"Title: {item['title']}")
print(f"Text: {item['text'][:200]}...")

JavaScript/Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('YOUR_USERNAME/article-extractor-news-scraper').call({
startUrls: ['https://www.example.com/article'],
extractorEngine: 'newspaper4k'
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach(item => {
console.log(`Title: ${item.title}`);
console.log(`Text: ${item.text.substring(0, 200)}...`);
});

Webhooks

Configure webhooks to receive results automatically:

{
"webhooks": [
{
"eventTypes": ["ACTOR.RUN.SUCCEEDED"],
"requestUrl": "https://your-server.com/webhook"
}
]
}

Zapier / Make (Integromat)

Use the Apify integration in Zapier or Make to connect extracted articles to:

  • Google Sheets
  • Notion databases
  • Slack notifications
  • Email newsletters
  • CRM systems

πŸ”§ Troubleshooting

Common Issues

IssueCauseSolution
Empty text outputAnti-bot blockingEnable residential proxies, reduce concurrency
403/429 errorsRate limitingIncrease maxRetries
Timeout errorsSlow server responseIncrease timeout, try Google Cache
Missing metadataEngine limitationSwitch to a different extraction engine
Garbled textEncoding issuesTry trafilatura or newspaper4k

Reporting Issues

If you encounter persistent issues:

  1. Check if the URL works in a regular browser
  2. Try different extraction engines
  3. Open an issue with:
    • The problematic URL
    • Your input configuration

❓ Frequently Asked Questions


πŸ“ Changelog

v1.0.0 (December 2025)

  • ✨ Initial public release
  • πŸ” 7 extraction engines: Newspaper4k, Trafilatura, Boilerpy3, News-Please, Goose3, Article Parser, JusText
  • πŸ›‘οΈ Advanced anti-blocking with browserforge fingerprinting
  • πŸ”„ Automatic fallback extraction
  • ☁️ Google Cache fallback for blocked pages
  • πŸ“Š Multiple dataset views (Overview, Content, Metadata)
  • βš™οΈ Configurable concurrency and retry settings

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.


Built with ❀️ for the data extraction community

Keywords: article extractor, news scraper, web scraping, content extraction, newspaper4k, trafilatura, apify actor, python scraper, text extraction, metadata extraction, NLP, news monitoring, content aggregation