RSS / XML Scraper avatar

RSS / XML Scraper

Pricing

Pay per usage

Go to Apify Store
RSS / XML Scraper

RSS / XML Scraper

Meet the RSS / XML Scraper: the most advanced actor for parsing any RSS feed or XML file. It effortlessly extracts clean, structured data from even the most complex sources. Your ultimate tool for content aggregation, data monitoring, and content analysis.

Pricing

Pay per usage

Rating

5.0

(4)

Developer

Shahid Irfan

Shahid Irfan

Maintained by Community

Actor stats

5

Bookmarked

54

Total users

11

Monthly active users

10 days ago

Last modified

Share

RSS XML Feed Scraper

Extract RSS and Atom feed data into structured datasets for monitoring, research, and content pipelines. Collect feed metadata, article metadata, tags, media links, and optional full article text at scale. Built for reliable feed ingestion with clean output suitable for automation and analytics.

Features

  • Feed URL list input — Add one or many feed URLs quickly with string-list input.
  • Feed discovery mode — Discover valid feeds from website URLs when needed.
  • Full article expansion — Expand snippet-only items into richer full text and HTML.
  • Batch processing — Process feed entries in batches for faster and steadier runs.
  • Batch dataset writes — Push extracted items in batches for better write throughput.
  • Fallback feed support — Runs with a default BBC feed when no URL is provided.
  • Proxy-ready input — Optional proxy configuration with Apify Proxy disabled by default.

Use Cases

News Monitoring

Track breaking stories from multiple publishers in one scheduled run. Store normalized records for alerts, dashboards, and trend tracking.

Content Aggregation

Collect article headlines, descriptions, publish dates, and links for newsletters and curation workflows. Expand snippets into fuller article text when needed.

Competitive Intelligence

Monitor competitor blogs and media feeds continuously. Compare publishing frequency, topic clusters, and update cadence.

Research Datasets

Build structured datasets for NLP, topic modeling, and sentiment workflows. Export in JSON, CSV, Excel, or XML for downstream analysis.

Input Parameters

ParameterTypeRequiredDefaultDescription
urlsArrayNo["https://feeds.bbci.co.uk/news/rss.xml"]List of feed URLs or website URLs.
extractContentBooleanNofalseForces full article extraction for entries.
autoExpandSnippetsBooleanNotrueExpands snippet-only feed items to fuller content automatically.
maxEntriesIntegerNo20Maximum entries per feed. Use 0 to process all entries.
discoverFeedsBooleanNofalseIf true, website URLs are scanned for valid feeds.
userAgentStringNo""Optional custom user agent header.
proxyConfigurationObjectNo{ "useApifyProxy": false }Optional proxy settings.

Output Data

Each dataset item contains feed-level or entry-level fields, depending on item_type.

FieldTypeDescription
item_typeStringRecord type such as feed_meta or entry.
feed_urlStringSource feed URL.
titleStringFeed title or article title.
linkStringFeed homepage or article URL.
descriptionStringDescription or teaser text.
summaryStringSummary text from feed payload.
contentStringParsed content text from feed payload.
authorStringPrimary author name when available.
authorsArrayAuthor list when available.
publishedStringPublished time in ISO format when available.
updatedStringUpdated time in ISO format when available.
tagsArrayCategories or tags from feed entries.
source_titleStringSource/channel title for the entry when present.
source_urlStringSource/channel URL for the entry when present.
enclosure_urlStringEnclosure/media URL when present.
image_urlStringBest available image URL from feed fields.
full_textStringExpanded full article text.
full_htmlStringCleaned full article HTML.
meta_descriptionStringArticle meta description when available.
top_imageStringPrimary article image when available.
publish_dateStringArticle publish timestamp when available.
content_sourceStringIndicates whether full content came from feed payload or article page.
content_errorStringFull-content extraction error details if any.
errorStringProcessing error details if any.
collected_atStringCollection timestamp in ISO format.

Usage Examples

Basic Feed Collection

{
"urls": [
"https://feeds.bbci.co.uk/news/rss.xml"
]
}

Multiple Feeds

{
"urls": [
"https://feeds.bbci.co.uk/news/rss.xml",
"https://www.theguardian.com/world/rss",
"https://hnrss.org/frontpage"
],
"maxEntries": 50
}

Force Full Article Extraction

{
"urls": [
"https://feeds.bbci.co.uk/news/rss.xml"
],
"extractContent": true,
"maxEntries": 25
}

Discover Feeds from Website URLs

{
"urls": [
"https://example.com"
],
"discoverFeeds": true
}

Proxy Configuration

{
"urls": [
"https://feeds.bbci.co.uk/news/rss.xml"
],
"proxyConfiguration": {
"useApifyProxy": true
}
}

Sample Output

{
"item_type": "entry",
"feed_url": "https://feeds.bbci.co.uk/news/rss.xml",
"title": "Example Article Title",
"link": "https://www.example.com/articles/123",
"description": "Short teaser from the feed.",
"summary": "Extended summary text from feed payload.",
"author": "Reporter Name",
"published": "2026-04-01T12:20:51.000Z",
"tags": ["world", "politics"],
"source_title": "Example News",
"full_text": "Expanded article text...",
"content_source": "article_page",
"collected_at": "2026-04-01T15:30:00.000Z"
}

Tips for Best Results

Use Stable Feed URLs

  • Prefer canonical RSS/Atom URLs from publishers.
  • Keep feed URL lists clean and deduplicated.

Start with Moderate Limits

  • Use maxEntries around 20 to validate data quality quickly.
  • Increase limits after verifying target feed behavior.

Enable Full Content Only When Needed

  • Keep extractContent off for lightweight metadata pipelines.
  • Turn it on for downstream NLP, summarization, or archiving workflows.

Handle Site Restrictions

  • Use proxyConfiguration when targets rate-limit requests.
  • Set a custom userAgent for sites with strict header checks.

Integrations

Connect extracted feed data with:

  • Google Sheets — Build live monitoring sheets.
  • Airtable — Create searchable editorial databases.
  • Slack — Trigger alerts for new stories or keywords.
  • Webhooks — Send records to internal services in real time.
  • Make — Automate enrichment and routing flows.
  • Zapier — Connect feeds to business tools without code.

Export Formats

  • JSON — API and backend workflows.
  • CSV — Spreadsheet analytics.
  • Excel — Business reporting.
  • XML — Legacy system integration.

Frequently Asked Questions

What happens if I do not provide any feed URL?

The actor uses a default BBC feed and still produces data.

Can I scrape multiple feeds in one run?

Yes, provide multiple URLs in the urls list.

Does it collect feed metadata and entry data?

Yes, output includes feed-level metadata records and entry-level records.

Can it expand short snippets to full text?

Yes, use extractContent: true or keep autoExpandSnippets: true.

Will it fail if an article page blocks extraction?

No, it falls back to feed-provided summary/content when possible.

Does it support proxies?

Yes, pass proxyConfiguration. By default, Apify Proxy is disabled.

Can I use website URLs instead of direct feed URLs?

Yes, enable discoverFeeds to discover feed endpoints from websites.


Support

For issues or feature requests, use the Apify actor page discussion and support channels.

Resources


This actor is intended for legitimate data collection. You are responsible for complying with website terms, robots policies, and applicable laws in your jurisdiction.