Universal Article Scraper
Pricing
Pay per usage
Universal Article Scraper
Universal article scraper for news websites, blogs, etc. It can scrape articles from multiple websites simultaneously, including metadata such as title, content, publication date, image, and author.
5.0 (2)
Pricing
Pay per usage
2
11
9
Last modified
13 days ago
A powerful web scraper that can extract articles from multiple websites simultaneously. This scraper intelligently identifies and extracts article content, metadata, and structured data from news sites, blogs, and other content platforms.
Features
- Multi-website scraping - Process multiple websites in parallel
- Smart article detection - Automatically identifies article content using various heuristics
- URL pattern filtering - Include/exclude URLs based on patterns
- Proxy support - Built-in proxy rotation for reliable scraping
- Structured output - Extracts title, content, metadata, and publication details
- Rate limiting - Configurable request limits to respect website policies
- Error handling - Robust error handling with retry mechanisms
How it works
The scraper processes multiple websites concurrently, following these steps for each site:
- URL Discovery: Starts from provided seed URLs and discovers article links
- Content Extraction: Uses Cheerio to parse HTML and extract article content
- Data Structuring: Formats extracted data into a consistent schema
- Storage: Saves results to Apify dataset for easy access
Key components:
- Smart content detection: Identifies main article content using semantic HTML tags and heuristics
- Metadata extraction: Pulls publication dates, authors, categories, and other structured data
- URL filtering: Respects include/exclude patterns to focus on relevant content
- Concurrent processing: Handles multiple websites simultaneously for efficiency
Input Configuration
The scraper accepts a JSON input with the following structure:
{"websites": [{"topic": "techcrunch","urls": ["https://techcrunch.com/"],"patterns": ["**/2024/**", "**/article/**"],"ignoreUrls": ["https://techcrunch.com/author*","https://techcrunch.com/category*","https://techcrunch.com/tag*"]},{"topic": "bbc-news","urls": ["https://www.bbc.com/news"],"patterns": ["**/news/**"],"ignoreUrls": ["**/live/**", "**/weather/**"]},{"topic": "theverge","urls": ["https://www.theverge.com/"],"patterns": [],"ignoreUrls": []}],"maxRequestsPerCrawl": 100}
Configuration Fields
websites
(required)
An array of website objects to scrape. Each website object contains:
topic
(string, required): A unique identifier for the website (used for labeling results)urls
(array, required): Starting URLs to begin crawling frompatterns
(array, optional): URL patterns to include (glob patterns supported)- Example:
["**/article/**", "**/news/**"]
- only scrape URLs containing "/article/" or "/news/" - Leave empty
[]
to include all discovered URLs
- Example:
ignoreUrls
(array, optional): URL patterns to exclude (glob patterns supported)- Example:
["**/author/**", "**/category/**"]
- skip author pages and category pages - Useful for avoiding non-article pages like navigation, archives, etc.
- Example:
maxRequestsPerCrawl
(number, optional)
Maximum number of requests per website (default: 100). Controls how many pages to scrape from each website to prevent infinite crawling.
Output
Scraped articles are stored in the Apify dataset. Each article contains:
Core Fields
url
- Source URL where the article was scraped fromloadedUrl
- Final loaded URL (may differ from original due to redirects)baseUrl
- Base URL of the websitearticleText
- Main article content (minimum 300 characters required)title
- Article headlinetopic
- Website topic identifier from input configuration
Metadata Fields
publishDate
- Publication date as Date object (parsed from publishDateString)publishDateString
- Raw publication date string as found on the pagemodifiedDate
- Last modified date as Date object (if available)author
- Author namedescription
- Article description/summarycanonicalUrl
- Canonical URL specified by the page
Content Classification
type
- Content type (e.g., "article")section
- Article section/categorytags
- Array of article tagskeywords
- Article keywords
Media & SEO
imageUrl
- Featured image URLimageAlt
- Alt text for featured imagerobots
- Robots meta tag value
Note: Empty fields are automatically removed from the output. Articles shorter than 300 characters are filtered out.
On this page
Share Actor: