RSS / XML Scraper avatar
RSS / XML Scraper

Pricing

Pay per usage

Go to Apify Store
RSS / XML Scraper

RSS / XML Scraper

Meet the RSS / XML Scraper: the most advanced actor for parsing any RSS feed or XML file. It effortlessly extracts clean, structured data from even the most complex sources. Your ultimate tool for content aggregation, data monitoring, and content analysis.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Shahid Irfan

Shahid Irfan

Maintained by Community

Actor stats

0

Bookmarked

8

Total users

3

Monthly active users

18 days ago

Last modified

Share

RSS/XML Scraper


📋 What does this Actor do?

This Actor scrapes RSS/Atom feeds and extracts structured data from feed entries. It automatically discovers RSS feeds from websites and can optionally extract full article content. All extracted data is stored in the Apify dataset for easy processing and analysis.

✨ Key Features

  • 📡 Feed Scraping: Extract data from RSS/Atom feeds
  • 🔍 Auto Discovery: Find RSS feeds automatically from websites
  • 📄 Full Content: Optional extraction of complete article content
  • ⚡ Fast Processing: Asynchronous processing for high performance
  • 🎯 Structured Data: Clean, structured output in JSON format
  • 🔧 Flexible Input: Support for multiple URL formats and input methods

📥 Input

The Actor accepts various input formats to accommodate different use cases.

Input Parameters

ParameterTypeRequiredDefaultDescription
urlsstring or arrayRequired-RSS feed URLs, website URLs, or both. Supports multiple formats:
• Single URL: "https://example.com/feed.xml"
• Multi-line: One URL per line
• Comma-separated: "url1,url2,url3"
• JSON array: ["url1", "url2"]
extractContentbooleanOptionalfalseExtract full article content from feed entry links
maxEntriesnumberOptional0Maximum entries to process per feed (0 = all entries)
discoverFeedsbooleanOptionalfalseAutomatically discover RSS feeds from website URLs
userAgentstringOptional-Custom user agent string for HTTP requests
timeoutnumberOptional30Request timeout in seconds
concurrencynumberOptional5Maximum number of feeds/websites processed in parallel

Legacy Parameters (for backward compatibility)

ParameterTypeRequiredDescription
rss_urlstringOptionalSingle RSS feed URL (alternative to urls)
xml_urlstringOptionalSingle XML feed URL (alternative to urls)

📤 Output

The Actor outputs structured JSON data to the Apify dataset. Data is available in multiple views for different analysis needs.

Data Structure

Each processed entry contains the following fields:

{
"feed_url": "https://example.com/feed.xml",
"title": "Article Title",
"link": "https://example.com/article",
"description": "Article description or summary",
"author": "John Doe",
"published": "2025-11-08T10:30:00+00:00",
"id": "unique-entry-identifier",
"tags": ["tag1", "tag2"],
"collected_at": "2025-11-08T12:00:00+00:00"
}

Additional Fields (when extractContent: true)

{
"full_text": "Complete article text content...",
"full_html": "<p>Complete article HTML...</p>",
"keywords": ["keyword1", "keyword2"],
"top_image": "https://example.com/image.jpg",
"authors": ["Author Name"],
"publish_date": "2025-11-08T10:30:00+00:00",
"meta_description": "Article meta description"
}

Dataset Views

The dataset provides multiple views for different analysis needs:

  • 📊 Overview: Complete entry data with all fields
  • 📰 Feeds: Feed-level information and metadata
  • 📝 Articles: Article content and extracted data

🚀 Usage Examples

Basic RSS Feed Scraping

{
"urls": "https://example.com/feed.xml"
}

Multiple Feeds

{
"urls": [
"https://blog1.com/feed.xml",
"https://blog2.com/rss",
"https://news.com/atom.xml"
]
}

Website Feed Discovery

{
"urls": "https://example.com",
"discoverFeeds": true
}

Full Content Extraction

{
"urls": "https://tech-news.com/feed.xml",
"extractContent": true,
"maxEntries": 50
}

Advanced Configuration

{
"urls": "https://example.com/feed.xml",
"extractContent": true,
"maxEntries": 100,
"discoverFeeds": false,
"userAgent": "Custom Bot/1.0",
"timeout": 60
}

Legacy Input Format

{
"rss_url": "https://example.com/feed.xml",
"extractContent": true
}

💰 Cost & Performance

Compute Units

  • Free: 1,000 entries per month
  • Paid: $0.25 per 1,000 entries

Performance

  • Typical Speed: 100-500 entries per minute
  • Concurrent Processing: Multiple feeds processed simultaneously
  • Memory Usage: ~50MB base + ~10MB per active feed

⚠️ Limits & Quotas

  • Maximum URLs: 100 URLs per run
  • Maximum Entries: 10,000 entries per feed (configurable)
  • Request Timeout: 300 seconds maximum
  • Rate Limiting: Automatic handling of rate limits
  • File Size: No limit on extracted content

🛠️ Troubleshooting

Common Issues

"No feeds found"

  • Check if the URL is accessible
  • Verify the URL points to a valid RSS/Atom feed
  • Use discoverFeeds: true for website URLs

"Content extraction failed"

  • Some websites block automated access
  • Try with a custom userAgent
  • Check if the article URL is still valid

"Timeout errors"

  • Increase the timeout parameter
  • Reduce maxEntries for large feeds
  • Check network connectivity

Error Handling

The Actor automatically handles:

  • Network timeouts and retries
  • Invalid URLs and feeds
  • Malformed content
  • Rate limiting from websites

📚 Resources