Universal Article Scraper

Pricing

from $0.06 / 1,000 results

Try for free

Go to Apify Store

Universal Article Scraper

Try for free

Universal article scraper for news websites, blogs, etc. It can scrape articles from multiple websites simultaneously, including metadata such as title, content, publication date, image, and author.

Pricing

from $0.06 / 1,000 results

Rating

5.0

(2)

Developer

Michael Novak

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Features

Multi-website scraping - Process multiple websites in parallel
Smart article detection - Automatically identifies article content using various heuristics
URL pattern filtering - Include/exclude URLs based on patterns
Proxy support - Built-in proxy rotation for reliable scraping
Structured output - Extracts title, content, metadata, and publication details
Rate limiting - Configurable request limits to respect website policies
Error handling - Robust error handling with retry mechanisms

How it works

The scraper processes multiple websites concurrently, following these steps for each site:

URL Discovery: Starts from provided seed URLs and discovers article links
Content Extraction: Uses Cheerio to parse HTML and extract article content
Data Structuring: Formats extracted data into a consistent schema
Storage: Saves results to Apify dataset for easy access

Key components:

Smart content detection: Identifies main article content using semantic HTML tags and heuristics
Metadata extraction: Pulls publication dates, authors, categories, and other structured data
URL filtering: Respects include/exclude patterns to focus on relevant content
Concurrent processing: Handles multiple websites simultaneously for efficiency

Input Configuration

The scraper accepts a JSON input with the following structure:

{
  "websites": [
    {
      "topic": "techcrunch",
      "urls": ["https://techcrunch.com/"],
      "patterns": ["**/2024/**", "**/article/**"],
      "ignoreUrls": [
        "https://techcrunch.com/author*",
        "https://techcrunch.com/category*",
        "https://techcrunch.com/tag*"
      ]
    },
    {
      "topic": "bbc-news",
      "urls": ["https://www.bbc.com/news"],
      "patterns": ["**/news/**"],
      "ignoreUrls": ["**/live/**", "**/weather/**"]
    },
    {
      "topic": "theverge",
      "urls": ["https://www.theverge.com/"],
      "patterns": [],
      "ignoreUrls": []
    }
  ],
  "maxRequestsPerCrawl": 100
}

Configuration Fields

`websites` (required)

An array of website objects to scrape. Each website object contains:

topic (string, required): A unique identifier for the website (used for labeling results)
urls (array, required): Starting URLs to begin crawling from
patterns (array, optional): URL patterns to include (glob patterns supported)
- Example: ["**/article/**", "**/news/**"] - only scrape URLs containing "/article/" or "/news/"
- Leave empty [] to include all discovered URLs
ignoreUrls (array, optional): URL patterns to exclude (glob patterns supported)
- Example: ["**/author/**", "**/category/**"] - skip author pages and category pages
- Useful for avoiding non-article pages like navigation, archives, etc.

`maxRequestsPerCrawl` (number, optional)

Maximum number of requests per website (default: 100). Controls how many pages to scrape from each website to prevent infinite crawling.

Output

Scraped articles are stored in the Apify dataset. Each article contains:

Core Fields

url - Source URL where the article was scraped from
loadedUrl - Final loaded URL (may differ from original due to redirects)
baseUrl - Base URL of the website
articleText - Main article content (minimum 300 characters required)
title - Article headline
topic - Website topic identifier from input configuration

Metadata Fields

publishDate - Publication date as Date object (parsed from publishDateString)
publishDateString - Raw publication date string as found on the page
modifiedDate - Last modified date as Date object (if available)
author - Author name
description - Article description/summary
canonicalUrl - Canonical URL specified by the page

Content Classification

type - Content type (e.g., "article")
section - Article section/category
tags - Array of article tags
keywords - Article keywords

Media & SEO

imageUrl - Featured image URL
imageAlt - Alt text for featured image
robots - Robots meta tag value

Note: Empty fields are automatically removed from the output. Articles shorter than 300 characters are filtered out.

Web Article Content Extractor

vulnv/web-article-content-extractor

Extract clean, readable content from news articles, blog posts, and web pages. Batch process multiple URLs, download images, bypass bot protection with proxy support. Perfect for content curation, research, and data analysis.

VulnV

5.0

News Article Scraper for Feeding LLM

proscraper/newsarticlescraper

Scrape news articles metadata to feed into LLM models. Returns article body, published date, article title, author etc.

Owais Nazir

Article Content Extractor 📄

easyapi/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Supports multiple URLs and returns well-formatted JSON with title, description, content, author, publish date and more. 🔍📄

EasyApi

5.0

News Website Crawler & Article Extractor

xtech/news-source-crawler

Scrape all articles from any news website. Extract full text, metadata, keywords, and summaries. Ideal for content analysis, research, and news aggregation.

Xtech

278

4.4

Advanced News Scraper

dorcy/advanced-news-scraper

This scraper is crafted to extract the latest news articles based on custom search queries, providing a wealth of information, including article titles, sources, publication dates, full article text, and AI-generated summary.

Dorcy Shema

233

Smart Article Extractor

lukaskrivka/article-extractor-smart

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

Lukáš Křivka

6.4K

4.9

Articles Extractor

web.harvester/articles-extractor

The Article Extractor is an enterprise-grade web scraping solution designed specifically for extracting structured data from news articles, blog posts, and online publications. Our advanced HTML parsing engine delivers unmatched accuracy in content extraction across thousands of websites.

Web Harvester

660

5.0

Blog / Dated Content Crawler

diarmuidr/blog-content-crawler

Crawl an entire blog / knowledge base or filter to just the new content. Supporting relevant AI queries by filtering pages by date

Diarmuid

153

5.0

Google News Scraper

lhotanova/google-news-scraper

Gets featured articles from Google News with title, link, source, publication date and image.

Kristýna Lhoťanová

2.4K

1.2

article-scrapper

credible_sandal/article-scrapper

A flexible and powerful Apify Actor for scraping articles from tech news websites. This scraper can work with any tech news site - either from predefined presets or custom URLs