Universal Article Scraper avatar
Universal Article Scraper

Pricing

Pay per usage

Go to Apify Store
Universal Article Scraper

Universal Article Scraper

Developed by

Michael Novak

Michael Novak

Maintained by Community

Universal article scraper for news websites, blogs, etc. It can scrape articles from multiple websites simultaneously, including metadata such as title, content, publication date, image, and author.

5.0 (2)

Pricing

Pay per usage

2

11

9

Last modified

13 days ago

A powerful web scraper that can extract articles from multiple websites simultaneously. This scraper intelligently identifies and extracts article content, metadata, and structured data from news sites, blogs, and other content platforms.

Features

  • Multi-website scraping - Process multiple websites in parallel
  • Smart article detection - Automatically identifies article content using various heuristics
  • URL pattern filtering - Include/exclude URLs based on patterns
  • Proxy support - Built-in proxy rotation for reliable scraping
  • Structured output - Extracts title, content, metadata, and publication details
  • Rate limiting - Configurable request limits to respect website policies
  • Error handling - Robust error handling with retry mechanisms

How it works

The scraper processes multiple websites concurrently, following these steps for each site:

  1. URL Discovery: Starts from provided seed URLs and discovers article links
  2. Content Extraction: Uses Cheerio to parse HTML and extract article content
  3. Data Structuring: Formats extracted data into a consistent schema
  4. Storage: Saves results to Apify dataset for easy access

Key components:

  • Smart content detection: Identifies main article content using semantic HTML tags and heuristics
  • Metadata extraction: Pulls publication dates, authors, categories, and other structured data
  • URL filtering: Respects include/exclude patterns to focus on relevant content
  • Concurrent processing: Handles multiple websites simultaneously for efficiency

Input Configuration

The scraper accepts a JSON input with the following structure:

{
"websites": [
{
"topic": "techcrunch",
"urls": ["https://techcrunch.com/"],
"patterns": ["**/2024/**", "**/article/**"],
"ignoreUrls": [
"https://techcrunch.com/author*",
"https://techcrunch.com/category*",
"https://techcrunch.com/tag*"
]
},
{
"topic": "bbc-news",
"urls": ["https://www.bbc.com/news"],
"patterns": ["**/news/**"],
"ignoreUrls": ["**/live/**", "**/weather/**"]
},
{
"topic": "theverge",
"urls": ["https://www.theverge.com/"],
"patterns": [],
"ignoreUrls": []
}
],
"maxRequestsPerCrawl": 100
}

Configuration Fields

websites (required)

An array of website objects to scrape. Each website object contains:

  • topic (string, required): A unique identifier for the website (used for labeling results)
  • urls (array, required): Starting URLs to begin crawling from
  • patterns (array, optional): URL patterns to include (glob patterns supported)
    • Example: ["**/article/**", "**/news/**"] - only scrape URLs containing "/article/" or "/news/"
    • Leave empty [] to include all discovered URLs
  • ignoreUrls (array, optional): URL patterns to exclude (glob patterns supported)
    • Example: ["**/author/**", "**/category/**"] - skip author pages and category pages
    • Useful for avoiding non-article pages like navigation, archives, etc.

maxRequestsPerCrawl (number, optional)

Maximum number of requests per website (default: 100). Controls how many pages to scrape from each website to prevent infinite crawling.

Output

Scraped articles are stored in the Apify dataset. Each article contains:

Core Fields

  • url - Source URL where the article was scraped from
  • loadedUrl - Final loaded URL (may differ from original due to redirects)
  • baseUrl - Base URL of the website
  • articleText - Main article content (minimum 300 characters required)
  • title - Article headline
  • topic - Website topic identifier from input configuration

Metadata Fields

  • publishDate - Publication date as Date object (parsed from publishDateString)
  • publishDateString - Raw publication date string as found on the page
  • modifiedDate - Last modified date as Date object (if available)
  • author - Author name
  • description - Article description/summary
  • canonicalUrl - Canonical URL specified by the page

Content Classification

  • type - Content type (e.g., "article")
  • section - Article section/category
  • tags - Array of article tags
  • keywords - Article keywords

Media & SEO

  • imageUrl - Featured image URL
  • imageAlt - Alt text for featured image
  • robots - Robots meta tag value

Note: Empty fields are automatically removed from the output. Articles shorter than 300 characters are filtered out.