Web Scraper avatar
Web Scraper

Pricing

Pay per usage

Go to Store
Web Scraper

Web Scraper

Developed by

Apify

Apify

Maintained by Apify

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

4.5 (22)

Pricing

Pay per usage

727

Total users

83.3k

Monthly users

4.2k

Runs succeeded

>99%

Issue response

43 days

Last modified

a month ago

CH

rss API on Readybot.io: The feed could not be parsed or is empty

Closed

changchiyou opened this issue
a year ago

I tried to follow the tutorial How to turn any website into an RSS feed.

The datas have been scraped perfectly from target website and storaged in database, but the error message "The feed could not be parsed or is empty" pop out from Readybot.io's client with "https://api.apify.com/v2/actor-tasks/[TASK_ID]/runs/last/dataset/items?token=[YOUR_API_TOKEN]&format=rss" and Readybot.io noted that "The response body may have been truncated." with:

[{
"#error": false,
"#debug": {
"requestId": "0hfqY6MHbN19ofn",
"url": "https://wildrift.leagueoflegends.com/zh-tw/news/",
"loadedUrl": "https://wildrift.leagueoflegends.com/zh-tw/news/",
"method": "GET",
"retryCount": 0,
"errorMessages": [],
"statusCode": 200
}
},
{
"url": "https://wildrift.leagueoflegends.com/zh-tw/news/game-updates/wild-rift-patch-notes-5-0c/",
"title": "《激鬥峽谷》5.0c版本更新公告",
"date": "Wed, 13 Mar 2024 00:00:00 GMT",
"guid": "https://wildrift.leagueo

Haved I missed something about setting web scraper for rss?

CH

changchiyou

a year ago

I replaced "https://api.apify.com/v2/actor-tasks/%5BTASK_ID%5D/runs/last/dataset/items?token=%5BYOUR_API_TOKEN%5D&format=rss" with:

  1. https://api.apify.com/v2/actor-tasks/%5BTASK_ID%5D/runs/last/dataset/items?token=%5BYOUR_API_TOKEN%5D&format=rss&clean=true

    W3C - Feed Validation Service

  2. https://api.apify.com/v2/actor-tasks/%5BTASK_ID%5D/runs/last/dataset/items?token=%5BYOUR_API_TOKEN%5D&format=xml&clean=true

    W3C - Feed Validation Service

but still didn't work. :(

jindrich.bar avatar

Hello @changchiyou and thank you for your interest in this Actor!

The errors you're getting from ReadyBot.io are indeed weird - in the first one (with the "The response body may have been truncated" error message), it seems you're passing a JSON object there. Make sure you only use the URL with the format=rss query parameter whenever you want to get an RSS feed. Even though RSS feeds are technically XML documents, the .xml file you get from Apify when you pick format=xml is not valid RSS.

Now comes the strange part: I just tried making a ReadyBot.io bot with your dataset - and it worked just fine. Try clearing everything you have done until now and create a new ReadyBot.io feed bot with the following RSS URL:

https://api.apify.com/v2/actor-tasks/changchiyou~wildrift-news-zh-tw/runs/last/dataset/items?token=this_is_an_example_token&format=rss

Make sure to replace the token query parameter in the URL (this_is_an_example_token) with your actual Integration token - you'll find that one here in Apify Console - Settings (left menu) - Integrations (tab) - Personal API tokens.

Hope this helps. Let us know if you run into any issues with this approach. Good luck! :)

CH

changchiyou

a year ago

@jindrich.bar Thanks for your reply! But I have to apologize for forgetting to update this issue after I solved this problem.


I fixed the wrong fields provided by https://blog.apify.com/how-to-turn-any-website-into-an-rss-feed-a8f9f216e1b0/ :

  1. url->link
  2. date -> pubDate

After that, the first replacement actually works for me after rerunning the task again and obtaining the new database (havn't try without clean=true):

https://api.apify.com/v2/actor-tasks/%5BTASK_ID%5D/runs/last/dataset/items?token=%5BYOUR_API_TOKEN%5D&format=rss&clean=true

I believe I forgot to wait for a while after initially running the task, and that's the reason why I got The feed could not be parsed or is empty(empty database) error at first.

Althought I don't know clearly where is the key point of this bug(fields, url param), but this issue has already been solved yesterday.