2 hours trial then $25.00/month - No credit card required now

Ultimate Article Extractor

web.harvester/ultimate-article-extractor

2 hours trial then $25.00/month - No credit card required now

A powerful and modular web scraping tool designed to extract content from any webpage, article, or news site. Get clean, structured data from any website with optimized extraction algorithms, anti-bot detection avoidance, and proxy support.

Developer

Web Harvester

Actor Metrics

1 monthly user
5.0 / 5 (2)
2 bookmarks
Created in Mar 2025
Modified a day ago

Categories

Ultimate Article Extractor: Advanced Scraping & Content Extraction Tool

Overview

Ultimate Article Extractor uses multiple specialized extraction engines to extract meaningful content from any webpage. It's designed for data scientists, researchers, journalists, and developers who need to analyze web content at scale.

Perfect for:

Content aggregation
News monitoring
Research data collection
SEO analysis
Topic modeling and NLP projects
Web archiving
Market intelligence

Key Features

7 Specialized Extraction Engines: Choose from Newspaper4k, Trafilatura, Boilerpy3, News-Please, Goose3, Article Parser, and JusText
Universal Website Compatibility: Works with any website regardless of structure or layout
Complete Content Extraction: Captures title, description, full text, authors, publication date, images, and metadata
Smart Fallback System: Automatically tries alternative extraction methods if the primary one fails
Advanced Header Generation: Uses sophisticated browser fingerprinting to bypass anti-bot measures
Proxy Support: Integrates with residential proxies to prevent IP blocking
Domain-Specific Rate Limiting: Automatically manages request rates per domain to avoid detection
Customizable Output: Save article HTML, full page HTML, plaintext, or structured JSON
Parallel Processing: Process multiple URLs concurrently with optimized resource usage
State Persistence: Handles interruptions gracefully by saving progress

Extraction Methods Compared

Extractor	Best For	Key Strengths	Output Fields
Newspaper4k	General news articles	NLP capabilities, metadata extraction	Title, text, authors, publish date, keywords, summary
Trafilatura	News & blog content	Optimized for news, metadata support	Title, text, author, date, language, categories, tags
Boilerpy3	Simple article extraction	Fast, efficient text extraction	Title, text, text density metrics
News-Please	Comprehensive extraction	Rich metadata, fallback capabilities	Title, text, authors, publish date, language, images
Goose3	Article content & images	Image extraction, metadata support	Title, text, authors, images, keywords
Article Parser	HTML & markdown output	Multiple output formats	Title, HTML content, markdown content
JusText	Boilerplate removal	Focuses on main content	Text, paragraphs count, language

Input Configuration

The application accepts the following input parameters:

1{
2  "startUrls": [
3    "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire"
4  ],
5  "extractorEngine": "newspaper4k",
6  "saveHtml": false,
7  "saveArticleHtml": false,
8  "useHeaderGenerator": true,
9  "headerGeneratorOptions": {
10    "browsers": ["chrome", "firefox", "safari", "edge"],
11    "devices": ["desktop"]
12  },
13  "customHeaders": {},
14  "proxyConfiguration": {
15    "useApifyProxy": true,
16    "apifyProxyGroups": [
17      "RESIDENTIAL"
18    ]
19  },
20  "maxRetries": 15
21}

Input Parameters Explained

startUrls (required): Array of article URLs to extract content from
extractorEngine (optional): Choose your preferred extraction library:
- newspaper4k - Best all-around extractor with NLP capabilities (default)
- trafilatura - Optimized for news content
- boilerpy3 - Fast and efficient text extraction
- news-please - Rich metadata extraction
- goose3 - Good for extracting images and article content
- article-parser - Supports multiple output formats
- justext - Focused on boilerplate removal
saveHtml (optional): When true, saves the complete HTML of the webpage
saveArticleHtml (optional): When true, saves the extracted article HTML (for supported extractors)
useHeaderGenerator (optional): Enables sophisticated header generation to bypass detection
headerGeneratorOptions (optional): Configure which browsers and devices to emulate
customHeaders (optional): Set custom HTTP headers for requests
proxyConfiguration (optional): Configure proxy settings to avoid IP blocking
maxRetries (optional): Maximum number of retry attempts for failed requests (default: 15)

Example Outputs by Extractor

Newspaper4k Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure",
3  "description": "The cause of a blaze that knocked out power to one of the world's busiest airports was under investigation.",
4  "text": "The authorities said there was no immediate indication of foul play in the substation fire...",
5  "author": ["Michael Levenson", "Andrew Das"],
6  "publishedDate": "2025-03-21T04:09:20",
7  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
8  "language": "en",
9  "image": "https://static01.nyt.com/images/2025/03/21/multimedia/21vid-heathrow-closure-package-cover-zqhj/21vid-heathrow-closure-package-cover-zqhj-superJumbo.jpg",
10  "keywords": ["airport", "heathrow", "power outage", "london"],
11  "summary": "Heathrow Airport in London resumed some flight departures and arrivals late Friday...",
12  "extractorEngine": "newspaper4k"
13}

Trafilatura Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure",
3  "text": "Flights Resume at Heathrow After Fire Forced Its Closure\nThe cause of a blaze that knocked out power to one of the world's busiest airports was under investigation...",
4  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
5  "language": "en",
6  "categories": ["world", "europe"],
7  "tags": ["heathrow", "airport", "power outage", "london"],
8  "extractorEngine": "trafilatura"
9}

Boilerpy3 Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure - The New York Times",
3  "text": "SKIP ADVERTISEMENT\nFlights Resume at Heathrow After Fire Forced Its Closure\nThe cause of a blaze that knocked out power to one of the world's busiest airports was under investigation...",
4  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
5  "textDensity": 0.85,
6  "markupToTextRatio": 0.32,
7  "extractorUsed": "ArticleExtractor",
8  "extractorEngine": "boilerpy3"
9}

Goose3 Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure",
3  "description": "The cause of a blaze that knocked out power to one of the world's busiest airports was under investigation.",
4  "text": "Heathrow Airport in London resumed some flight departures and arrivals late Friday as one of the world's busiest air travel hubs began to rumble back to life...",
5  "image": "https://static01.nyt.com/images/2025/03/21/multimedia/21vid-heathrow-closure-package-cover-zqhj/21vid-heathrow-closure-package-cover-zqhj-superJumbo.jpg",
6  "keywords": ["heathrow", "airport", "power outage", "london"],
7  "extractorEngine": "goose3"
8}

JusText Example

1{
2  "text": "Flights Resume at Heathrow After Fire Forced Its Closure\nThe cause of a blaze that knocked out power to one of the world's busiest airports was under investigation...",
3  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
4  "paragraphsCount": 15,
5  "languageUsed": "English",
6  "extractorEngine": "justext"
7}

Article Parser Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure",
3  "articleHtml": "<div><p>Heathrow Airport in London resumed some flight departures and arrivals late Friday...</p></div>",
4  "text": "# Flights Resume at Heathrow After Fire Forced Its Closure\n\nHeathrow Airport in London resumed some flight departures and arrivals late Friday...",
5  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
6  "extractorEngine": "article-parser"
7}

News-Please Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure",
3  "description": "The cause of a blaze that knocked out power to one of the world's busiest airports was under investigation.",
4  "text": "Heathrow Airport in London resumed some flight departures and arrivals late Friday...",
5  "author": ["Michael Levenson", "Andrew Das"],
6  "publishedDate": "2025-03-21T04:09:20",
7  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
8  "language": "en",
9  "image": "https://static01.nyt.com/images/2025/03/21/multimedia/21vid-heathrow-closure-package-cover-zqhj/21vid-heathrow-closure-package-cover-zqhj-superJumbo.jpg",
10  "extractorEngine": "news-please"
11}

Google Trends Scraper FAST

data_xplorer/google-trends-fast-scraper

Get instant access to 🔥 daily trending searches by country and analyze Google Trends keyword search trends. This ultimate API alternative gives developers and data scientists programmatic access to Google Trends data without limitations.

Data Xplorer

105

5.0/5

Smart Article Extractor

lukaskrivka/article-extractor-smart

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

Lukáš Křivka

4.7k

4.7/5

Yahoo Finance Data Scraper

pratikdani/yahoo-finance-data-scraper

This actor retrieves comprehensive business data from Yahoo Finance, including key statistics, financial statements, earnings history, and company profile information, enabling users to analyze investment opportunities and track market trends.

Pratik Dani

LinkedIn Posts Scraper

pratikdani/linkedin-posts-scraper

Scrape LinkedIn posts data from LinkedIn Post URLs.

Pratik Dani

175

5.0/5

YouTube Scraper

streamers/youtube-scraper

YouTube crawler and video scraper. Alternative YouTube API with no limits or quotas. Extract and download channel name, likes, number of views, and number of subscribers.

Streamers

15.4k

4.5/5

Twitter (X.com) Scraper Unlimited: No Rate-Limits

apidojo/twitter-scraper-lite

Introducing Twitter Scraper Unlimited, the most comprehensive Twitter data extraction solution available. Our enterprise-grade scraper offers unmatched capabilities with a transparent event-based pricing model, making it perfect for both small-scale and large-scale data extraction needs.

API Dojo

2.4k

Parsera

parsera-labs/parsera

Extract data from any website using just a URL and column descriptions

Parsera Labs

Youtube Video Downloader

epctex/youtube-video-downloader

Effortlessly download YouTube videos of your preferred quality with our user-friendly Video Downloader. Try it now!

epctex

894

🔥 LinkedIn Jobs Scraper

bebity/linkedin-jobs-scraper

ℹ️ Designed for both personal and professional use, simply enter your desired job title and location to receive a tailored list of job opportunities. Try it today!

Bebity

7.3k

X (Twitter) Tweets & Profiles Scraper

web.harvester/twitter-scraper

Extract tweets and full profile data from any X (Twitter) account with our powerful scraper. Search profiles, scrape tweets with replies, or extract data via URL or username. Download in JSON, CSV, Excel, XML, or HTML—ideal for social media monitoring, content analysis, and competitive research.