Ultimate Article Extractor avatar

Ultimate Article Extractor

Try for free

2 hours trial then $25.00/month - No credit card required now

Go to Store
Ultimate Article Extractor

Ultimate Article Extractor

web.harvester/ultimate-article-extractor
Try for free

2 hours trial then $25.00/month - No credit card required now

A powerful and modular web scraping tool designed to extract content from any webpage, article, or news site. Get clean, structured data from any website with optimized extraction algorithms, anti-bot detection avoidance, and proxy support.

Developer
Maintained by Community

Actor Metrics

  • 1 monthly user

  • 5.0 / 5 (2)

  • 2 bookmarks

  • Created in Mar 2025

  • Modified a day ago

Ultimate Article Extractor: Advanced Scraping & Content Extraction Tool

A powerful and modular web scraping tool designed to extract content from any webpage, article, or news site. Get clean, structured data from any website with optimized extraction algorithms, anti-bot detection avoidance, and proxy support.

Features

Table of Contents

Overview

Ultimate Article Extractor uses multiple specialized extraction engines to extract meaningful content from any webpage. It's designed for data scientists, researchers, journalists, and developers who need to analyze web content at scale.

Perfect for:

  • Content aggregation
  • News monitoring
  • Research data collection
  • SEO analysis
  • Topic modeling and NLP projects
  • Web archiving
  • Market intelligence

Key Features

  • 7 Specialized Extraction Engines: Choose from Newspaper4k, Trafilatura, Boilerpy3, News-Please, Goose3, Article Parser, and JusText
  • Universal Website Compatibility: Works with any website regardless of structure or layout
  • Complete Content Extraction: Captures title, description, full text, authors, publication date, images, and metadata
  • Smart Fallback System: Automatically tries alternative extraction methods if the primary one fails
  • Advanced Header Generation: Uses sophisticated browser fingerprinting to bypass anti-bot measures
  • Proxy Support: Integrates with residential proxies to prevent IP blocking
  • Domain-Specific Rate Limiting: Automatically manages request rates per domain to avoid detection
  • Customizable Output: Save article HTML, full page HTML, plaintext, or structured JSON
  • Parallel Processing: Process multiple URLs concurrently with optimized resource usage
  • State Persistence: Handles interruptions gracefully by saving progress

Extraction Methods Compared

ExtractorBest ForKey StrengthsOutput Fields
Newspaper4kGeneral news articlesNLP capabilities, metadata extractionTitle, text, authors, publish date, keywords, summary
TrafilaturaNews & blog contentOptimized for news, metadata supportTitle, text, author, date, language, categories, tags
Boilerpy3Simple article extractionFast, efficient text extractionTitle, text, text density metrics
News-PleaseComprehensive extractionRich metadata, fallback capabilitiesTitle, text, authors, publish date, language, images
Goose3Article content & imagesImage extraction, metadata supportTitle, text, authors, images, keywords
Article ParserHTML & markdown outputMultiple output formatsTitle, HTML content, markdown content
JusTextBoilerplate removalFocuses on main contentText, paragraphs count, language

Input Configuration

The application accepts the following input parameters:

1{
2  "startUrls": [
3    "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire"
4  ],
5  "extractorEngine": "newspaper4k",
6  "saveHtml": false,
7  "saveArticleHtml": false,
8  "useHeaderGenerator": true,
9  "headerGeneratorOptions": {
10    "browsers": ["chrome", "firefox", "safari", "edge"],
11    "devices": ["desktop"]
12  },
13  "customHeaders": {},
14  "proxyConfiguration": {
15    "useApifyProxy": true,
16    "apifyProxyGroups": [
17      "RESIDENTIAL"
18    ]
19  },
20  "maxRetries": 15
21}

Input Parameters Explained

  • startUrls (required): Array of article URLs to extract content from
  • extractorEngine (optional): Choose your preferred extraction library:
    • newspaper4k - Best all-around extractor with NLP capabilities (default)
    • trafilatura - Optimized for news content
    • boilerpy3 - Fast and efficient text extraction
    • news-please - Rich metadata extraction
    • goose3 - Good for extracting images and article content
    • article-parser - Supports multiple output formats
    • justext - Focused on boilerplate removal
  • saveHtml (optional): When true, saves the complete HTML of the webpage
  • saveArticleHtml (optional): When true, saves the extracted article HTML (for supported extractors)
  • useHeaderGenerator (optional): Enables sophisticated header generation to bypass detection
  • headerGeneratorOptions (optional): Configure which browsers and devices to emulate
  • customHeaders (optional): Set custom HTTP headers for requests
  • proxyConfiguration (optional): Configure proxy settings to avoid IP blocking
  • maxRetries (optional): Maximum number of retry attempts for failed requests (default: 15)

Example Outputs by Extractor

Newspaper4k Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure",
3  "description": "The cause of a blaze that knocked out power to one of the world's busiest airports was under investigation.",
4  "text": "The authorities said there was no immediate indication of foul play in the substation fire...",
5  "author": ["Michael Levenson", "Andrew Das"],
6  "publishedDate": "2025-03-21T04:09:20",
7  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
8  "language": "en",
9  "image": "https://static01.nyt.com/images/2025/03/21/multimedia/21vid-heathrow-closure-package-cover-zqhj/21vid-heathrow-closure-package-cover-zqhj-superJumbo.jpg",
10  "keywords": ["airport", "heathrow", "power outage", "london"],
11  "summary": "Heathrow Airport in London resumed some flight departures and arrivals late Friday...",
12  "extractorEngine": "newspaper4k"
13}

Trafilatura Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure",
3  "text": "Flights Resume at Heathrow After Fire Forced Its Closure\nThe cause of a blaze that knocked out power to one of the world's busiest airports was under investigation...",
4  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
5  "language": "en",
6  "categories": ["world", "europe"],
7  "tags": ["heathrow", "airport", "power outage", "london"],
8  "extractorEngine": "trafilatura"
9}

Boilerpy3 Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure - The New York Times",
3  "text": "SKIP ADVERTISEMENT\nFlights Resume at Heathrow After Fire Forced Its Closure\nThe cause of a blaze that knocked out power to one of the world's busiest airports was under investigation...",
4  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
5  "textDensity": 0.85,
6  "markupToTextRatio": 0.32,
7  "extractorUsed": "ArticleExtractor",
8  "extractorEngine": "boilerpy3"
9}

Goose3 Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure",
3  "description": "The cause of a blaze that knocked out power to one of the world's busiest airports was under investigation.",
4  "text": "Heathrow Airport in London resumed some flight departures and arrivals late Friday as one of the world's busiest air travel hubs began to rumble back to life...",
5  "image": "https://static01.nyt.com/images/2025/03/21/multimedia/21vid-heathrow-closure-package-cover-zqhj/21vid-heathrow-closure-package-cover-zqhj-superJumbo.jpg",
6  "keywords": ["heathrow", "airport", "power outage", "london"],
7  "extractorEngine": "goose3"
8}

JusText Example

1{
2  "text": "Flights Resume at Heathrow After Fire Forced Its Closure\nThe cause of a blaze that knocked out power to one of the world's busiest airports was under investigation...",
3  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
4  "paragraphsCount": 15,
5  "languageUsed": "English",
6  "extractorEngine": "justext"
7}

Article Parser Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure",
3  "articleHtml": "<div><p>Heathrow Airport in London resumed some flight departures and arrivals late Friday...</p></div>",
4  "text": "# Flights Resume at Heathrow After Fire Forced Its Closure\n\nHeathrow Airport in London resumed some flight departures and arrivals late Friday...",
5  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
6  "extractorEngine": "article-parser"
7}

News-Please Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure",
3  "description": "The cause of a blaze that knocked out power to one of the world's busiest airports was under investigation.",
4  "text": "Heathrow Airport in London resumed some flight departures and arrivals late Friday...",
5  "author": ["Michael Levenson", "Andrew Das"],
6  "publishedDate": "2025-03-21T04:09:20",
7  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
8  "language": "en",
9  "image": "https://static01.nyt.com/images/2025/03/21/multimedia/21vid-heathrow-closure-package-cover-zqhj/21vid-heathrow-closure-package-cover-zqhj-superJumbo.jpg",
10  "extractorEngine": "news-please"
11}