
Ultimate Article Extractor
2 hours trial then $25.00/month - No credit card required now

Ultimate Article Extractor
2 hours trial then $25.00/month - No credit card required now
A powerful and modular web scraping tool designed to extract content from any webpage, article, or news site. Get clean, structured data from any website with optimized extraction algorithms, anti-bot detection avoidance, and proxy support.
Actor Metrics
1 monthly user
5.0 / 5 (2)
2 bookmarks
Created in Mar 2025
Modified a day ago
Ultimate Article Extractor: Advanced Scraping & Content Extraction Tool
A powerful and modular web scraping tool designed to extract content from any webpage, article, or news site. Get clean, structured data from any website with optimized extraction algorithms, anti-bot detection avoidance, and proxy support.
Features
Table of Contents
Overview
Ultimate Article Extractor uses multiple specialized extraction engines to extract meaningful content from any webpage. It's designed for data scientists, researchers, journalists, and developers who need to analyze web content at scale.
Perfect for:
- Content aggregation
- News monitoring
- Research data collection
- SEO analysis
- Topic modeling and NLP projects
- Web archiving
- Market intelligence
Key Features
- 7 Specialized Extraction Engines: Choose from Newspaper4k, Trafilatura, Boilerpy3, News-Please, Goose3, Article Parser, and JusText
- Universal Website Compatibility: Works with any website regardless of structure or layout
- Complete Content Extraction: Captures title, description, full text, authors, publication date, images, and metadata
- Smart Fallback System: Automatically tries alternative extraction methods if the primary one fails
- Advanced Header Generation: Uses sophisticated browser fingerprinting to bypass anti-bot measures
- Proxy Support: Integrates with residential proxies to prevent IP blocking
- Domain-Specific Rate Limiting: Automatically manages request rates per domain to avoid detection
- Customizable Output: Save article HTML, full page HTML, plaintext, or structured JSON
- Parallel Processing: Process multiple URLs concurrently with optimized resource usage
- State Persistence: Handles interruptions gracefully by saving progress
Extraction Methods Compared
Extractor | Best For | Key Strengths | Output Fields |
---|---|---|---|
Newspaper4k | General news articles | NLP capabilities, metadata extraction | Title, text, authors, publish date, keywords, summary |
Trafilatura | News & blog content | Optimized for news, metadata support | Title, text, author, date, language, categories, tags |
Boilerpy3 | Simple article extraction | Fast, efficient text extraction | Title, text, text density metrics |
News-Please | Comprehensive extraction | Rich metadata, fallback capabilities | Title, text, authors, publish date, language, images |
Goose3 | Article content & images | Image extraction, metadata support | Title, text, authors, images, keywords |
Article Parser | HTML & markdown output | Multiple output formats | Title, HTML content, markdown content |
JusText | Boilerplate removal | Focuses on main content | Text, paragraphs count, language |
Input Configuration
The application accepts the following input parameters:
1{ 2 "startUrls": [ 3 "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire" 4 ], 5 "extractorEngine": "newspaper4k", 6 "saveHtml": false, 7 "saveArticleHtml": false, 8 "useHeaderGenerator": true, 9 "headerGeneratorOptions": { 10 "browsers": ["chrome", "firefox", "safari", "edge"], 11 "devices": ["desktop"] 12 }, 13 "customHeaders": {}, 14 "proxyConfiguration": { 15 "useApifyProxy": true, 16 "apifyProxyGroups": [ 17 "RESIDENTIAL" 18 ] 19 }, 20 "maxRetries": 15 21}
Input Parameters Explained
- startUrls (required): Array of article URLs to extract content from
- extractorEngine (optional): Choose your preferred extraction library:
newspaper4k
- Best all-around extractor with NLP capabilities (default)trafilatura
- Optimized for news contentboilerpy3
- Fast and efficient text extractionnews-please
- Rich metadata extractiongoose3
- Good for extracting images and article contentarticle-parser
- Supports multiple output formatsjustext
- Focused on boilerplate removal
- saveHtml (optional): When true, saves the complete HTML of the webpage
- saveArticleHtml (optional): When true, saves the extracted article HTML (for supported extractors)
- useHeaderGenerator (optional): Enables sophisticated header generation to bypass detection
- headerGeneratorOptions (optional): Configure which browsers and devices to emulate
- customHeaders (optional): Set custom HTTP headers for requests
- proxyConfiguration (optional): Configure proxy settings to avoid IP blocking
- maxRetries (optional): Maximum number of retry attempts for failed requests (default: 15)
Example Outputs by Extractor
Newspaper4k Example
1{ 2 "title": "Flights Resume at Heathrow After Fire Forced Its Closure", 3 "description": "The cause of a blaze that knocked out power to one of the world's busiest airports was under investigation.", 4 "text": "The authorities said there was no immediate indication of foul play in the substation fire...", 5 "author": ["Michael Levenson", "Andrew Das"], 6 "publishedDate": "2025-03-21T04:09:20", 7 "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire", 8 "language": "en", 9 "image": "https://static01.nyt.com/images/2025/03/21/multimedia/21vid-heathrow-closure-package-cover-zqhj/21vid-heathrow-closure-package-cover-zqhj-superJumbo.jpg", 10 "keywords": ["airport", "heathrow", "power outage", "london"], 11 "summary": "Heathrow Airport in London resumed some flight departures and arrivals late Friday...", 12 "extractorEngine": "newspaper4k" 13}
Trafilatura Example
1{ 2 "title": "Flights Resume at Heathrow After Fire Forced Its Closure", 3 "text": "Flights Resume at Heathrow After Fire Forced Its Closure\nThe cause of a blaze that knocked out power to one of the world's busiest airports was under investigation...", 4 "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire", 5 "language": "en", 6 "categories": ["world", "europe"], 7 "tags": ["heathrow", "airport", "power outage", "london"], 8 "extractorEngine": "trafilatura" 9}
Boilerpy3 Example
1{ 2 "title": "Flights Resume at Heathrow After Fire Forced Its Closure - The New York Times", 3 "text": "SKIP ADVERTISEMENT\nFlights Resume at Heathrow After Fire Forced Its Closure\nThe cause of a blaze that knocked out power to one of the world's busiest airports was under investigation...", 4 "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire", 5 "textDensity": 0.85, 6 "markupToTextRatio": 0.32, 7 "extractorUsed": "ArticleExtractor", 8 "extractorEngine": "boilerpy3" 9}
Goose3 Example
1{ 2 "title": "Flights Resume at Heathrow After Fire Forced Its Closure", 3 "description": "The cause of a blaze that knocked out power to one of the world's busiest airports was under investigation.", 4 "text": "Heathrow Airport in London resumed some flight departures and arrivals late Friday as one of the world's busiest air travel hubs began to rumble back to life...", 5 "image": "https://static01.nyt.com/images/2025/03/21/multimedia/21vid-heathrow-closure-package-cover-zqhj/21vid-heathrow-closure-package-cover-zqhj-superJumbo.jpg", 6 "keywords": ["heathrow", "airport", "power outage", "london"], 7 "extractorEngine": "goose3" 8}
JusText Example
1{ 2 "text": "Flights Resume at Heathrow After Fire Forced Its Closure\nThe cause of a blaze that knocked out power to one of the world's busiest airports was under investigation...", 3 "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire", 4 "paragraphsCount": 15, 5 "languageUsed": "English", 6 "extractorEngine": "justext" 7}
Article Parser Example
1{ 2 "title": "Flights Resume at Heathrow After Fire Forced Its Closure", 3 "articleHtml": "<div><p>Heathrow Airport in London resumed some flight departures and arrivals late Friday...</p></div>", 4 "text": "# Flights Resume at Heathrow After Fire Forced Its Closure\n\nHeathrow Airport in London resumed some flight departures and arrivals late Friday...", 5 "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire", 6 "extractorEngine": "article-parser" 7}
News-Please Example
1{ 2 "title": "Flights Resume at Heathrow After Fire Forced Its Closure", 3 "description": "The cause of a blaze that knocked out power to one of the world's busiest airports was under investigation.", 4 "text": "Heathrow Airport in London resumed some flight departures and arrivals late Friday...", 5 "author": ["Michael Levenson", "Andrew Das"], 6 "publishedDate": "2025-03-21T04:09:20", 7 "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire", 8 "language": "en", 9 "image": "https://static01.nyt.com/images/2025/03/21/multimedia/21vid-heathrow-closure-package-cover-zqhj/21vid-heathrow-closure-package-cover-zqhj-superJumbo.jpg", 10 "extractorEngine": "news-please" 11}