RSS / XML Scraper
Pricing
Pay per usage
RSS / XML Scraper
Meet the RSS / XML Scraper: the most advanced actor for parsing any RSS feed or XML file. It effortlessly extracts clean, structured data from even the most complex sources. Your ultimate tool for content aggregation, data monitoring, and content analysis.
Pricing
Pay per usage
Rating
5.0
(4)
Developer
Shahid Irfan
Actor stats
5
Bookmarked
54
Total users
11
Monthly active users
10 days ago
Last modified
Categories
Share
RSS XML Feed Scraper
Extract RSS and Atom feed data into structured datasets for monitoring, research, and content pipelines. Collect feed metadata, article metadata, tags, media links, and optional full article text at scale. Built for reliable feed ingestion with clean output suitable for automation and analytics.
Features
- Feed URL list input — Add one or many feed URLs quickly with string-list input.
- Feed discovery mode — Discover valid feeds from website URLs when needed.
- Full article expansion — Expand snippet-only items into richer full text and HTML.
- Batch processing — Process feed entries in batches for faster and steadier runs.
- Batch dataset writes — Push extracted items in batches for better write throughput.
- Fallback feed support — Runs with a default BBC feed when no URL is provided.
- Proxy-ready input — Optional proxy configuration with Apify Proxy disabled by default.
Use Cases
News Monitoring
Track breaking stories from multiple publishers in one scheduled run. Store normalized records for alerts, dashboards, and trend tracking.
Content Aggregation
Collect article headlines, descriptions, publish dates, and links for newsletters and curation workflows. Expand snippets into fuller article text when needed.
Competitive Intelligence
Monitor competitor blogs and media feeds continuously. Compare publishing frequency, topic clusters, and update cadence.
Research Datasets
Build structured datasets for NLP, topic modeling, and sentiment workflows. Export in JSON, CSV, Excel, or XML for downstream analysis.
Input Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
urls | Array | No | ["https://feeds.bbci.co.uk/news/rss.xml"] | List of feed URLs or website URLs. |
extractContent | Boolean | No | false | Forces full article extraction for entries. |
autoExpandSnippets | Boolean | No | true | Expands snippet-only feed items to fuller content automatically. |
maxEntries | Integer | No | 20 | Maximum entries per feed. Use 0 to process all entries. |
discoverFeeds | Boolean | No | false | If true, website URLs are scanned for valid feeds. |
userAgent | String | No | "" | Optional custom user agent header. |
proxyConfiguration | Object | No | { "useApifyProxy": false } | Optional proxy settings. |
Output Data
Each dataset item contains feed-level or entry-level fields, depending on item_type.
| Field | Type | Description |
|---|---|---|
item_type | String | Record type such as feed_meta or entry. |
feed_url | String | Source feed URL. |
title | String | Feed title or article title. |
link | String | Feed homepage or article URL. |
description | String | Description or teaser text. |
summary | String | Summary text from feed payload. |
content | String | Parsed content text from feed payload. |
author | String | Primary author name when available. |
authors | Array | Author list when available. |
published | String | Published time in ISO format when available. |
updated | String | Updated time in ISO format when available. |
tags | Array | Categories or tags from feed entries. |
source_title | String | Source/channel title for the entry when present. |
source_url | String | Source/channel URL for the entry when present. |
enclosure_url | String | Enclosure/media URL when present. |
image_url | String | Best available image URL from feed fields. |
full_text | String | Expanded full article text. |
full_html | String | Cleaned full article HTML. |
meta_description | String | Article meta description when available. |
top_image | String | Primary article image when available. |
publish_date | String | Article publish timestamp when available. |
content_source | String | Indicates whether full content came from feed payload or article page. |
content_error | String | Full-content extraction error details if any. |
error | String | Processing error details if any. |
collected_at | String | Collection timestamp in ISO format. |
Usage Examples
Basic Feed Collection
{"urls": ["https://feeds.bbci.co.uk/news/rss.xml"]}
Multiple Feeds
{"urls": ["https://feeds.bbci.co.uk/news/rss.xml","https://www.theguardian.com/world/rss","https://hnrss.org/frontpage"],"maxEntries": 50}
Force Full Article Extraction
{"urls": ["https://feeds.bbci.co.uk/news/rss.xml"],"extractContent": true,"maxEntries": 25}
Discover Feeds from Website URLs
{"urls": ["https://example.com"],"discoverFeeds": true}
Proxy Configuration
{"urls": ["https://feeds.bbci.co.uk/news/rss.xml"],"proxyConfiguration": {"useApifyProxy": true}}
Sample Output
{"item_type": "entry","feed_url": "https://feeds.bbci.co.uk/news/rss.xml","title": "Example Article Title","link": "https://www.example.com/articles/123","description": "Short teaser from the feed.","summary": "Extended summary text from feed payload.","author": "Reporter Name","published": "2026-04-01T12:20:51.000Z","tags": ["world", "politics"],"source_title": "Example News","full_text": "Expanded article text...","content_source": "article_page","collected_at": "2026-04-01T15:30:00.000Z"}
Tips for Best Results
Use Stable Feed URLs
- Prefer canonical RSS/Atom URLs from publishers.
- Keep feed URL lists clean and deduplicated.
Start with Moderate Limits
- Use
maxEntriesaround20to validate data quality quickly. - Increase limits after verifying target feed behavior.
Enable Full Content Only When Needed
- Keep
extractContentoff for lightweight metadata pipelines. - Turn it on for downstream NLP, summarization, or archiving workflows.
Handle Site Restrictions
- Use
proxyConfigurationwhen targets rate-limit requests. - Set a custom
userAgentfor sites with strict header checks.
Integrations
Connect extracted feed data with:
- Google Sheets — Build live monitoring sheets.
- Airtable — Create searchable editorial databases.
- Slack — Trigger alerts for new stories or keywords.
- Webhooks — Send records to internal services in real time.
- Make — Automate enrichment and routing flows.
- Zapier — Connect feeds to business tools without code.
Export Formats
- JSON — API and backend workflows.
- CSV — Spreadsheet analytics.
- Excel — Business reporting.
- XML — Legacy system integration.
Frequently Asked Questions
What happens if I do not provide any feed URL?
The actor uses a default BBC feed and still produces data.
Can I scrape multiple feeds in one run?
Yes, provide multiple URLs in the urls list.
Does it collect feed metadata and entry data?
Yes, output includes feed-level metadata records and entry-level records.
Can it expand short snippets to full text?
Yes, use extractContent: true or keep autoExpandSnippets: true.
Will it fail if an article page blocks extraction?
No, it falls back to feed-provided summary/content when possible.
Does it support proxies?
Yes, pass proxyConfiguration. By default, Apify Proxy is disabled.
Can I use website URLs instead of direct feed URLs?
Yes, enable discoverFeeds to discover feed endpoints from websites.
Support
For issues or feature requests, use the Apify actor page discussion and support channels.
Resources
Legal Notice
This actor is intended for legitimate data collection. You are responsible for complying with website terms, robots policies, and applicable laws in your jurisdiction.
