Facebook Pages Scraper avatar
Facebook Pages Scraper
Try for free

7 days trial then $20.00/month - No credit card required now

View all Actors
Facebook Pages Scraper

Facebook Pages Scraper

apify/facebook-pages-scraper
Try for free

7 days trial then $20.00/month - No credit card required now

Facebook scraping tool to crawl and extract basic data from one or multiple Facebook Pages. Extract Facebook page name, page URL address, category, likes, check-ins, and other public data. Download data in JSON, CSV, Excel and use it in apps, spreadsheets, and reports.

User avatar

Multiple retries instead of instant fail

Closed

Ernest Bursa (ernest) opened this issue
5 months ago

We found out that scraper stays in retry loop instead of failing when provided with bad url.

12023-11-28T21:42:52.475Z ACTOR: Pulling Docker image of build n0sxaG0Cxh0Ctm7cr from repository.
22023-11-28T21:42:52.577Z ACTOR: Creating Docker container.
32023-11-28T21:42:52.645Z ACTOR: Starting Docker container.
42023-11-28T21:43:00.291Z INFO  System info {"apifyVersion":"3.1.13","apifyClientVersion":"2.8.4","crawleeVersion":"3.6.2","osType":"Linux","nodeVersion":"v16.20.2"}
52023-11-28T21:43:00.650Z INFO  Decoding 1 Facebook URLs
62023-11-28T21:43:00.700Z INFO  CheerioCrawler: Starting the crawler.
72023-11-28T21:43:06.108Z WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. Facebook access error
82023-11-28T21:43:06.109Z     at validatedUrlHandler (file:///usr/src/app/src/main.js:83:19) {"id":"WcE27lkdE8DXhVY","url":"https://www.facebook.com/WithlovefromLeah","retryCount":1}
92023-11-28T21:43:11.778Z WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. Facebook access error
102023-11-28T21:43:11.787Z     at validatedUrlHandler (file:///usr/src/app/src/main.js:83:19) {"id":"WcE27lkdE8DXhVY","url":"https://www.facebook.com/WithlovefromLeah","retryCount":2}
112023-11-28T21:43:16.725Z WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. Facebook access error
122023-11-28T21:43:16.731Z     at validatedUrlHandler (file:///usr/src/app/src/main.js:83:19) {"id":"WcE27lkdE8DXhVY","url":"https://www.facebook.com/WithlovefromLeah","retryCount":3}
132023-11-28T21:43:21.457Z WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. Facebook access error
142023-11-28T21:43:21.458Z     at validatedUrlHandler (file:///usr/src/app/src/main.js:83:19) {"id":"WcE27lkdE8DXhVY","url":"https://www.facebook.com/WithlovefromLeah","retryCount":4}
152023-11-28T21:43:27.441Z WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. Facebook access error
162023-11-28T21:43:27.442Z     at validatedUrlHandler (file:///usr/src/app/src/main.js:83:19) {"id":"WcE27lkdE8DXhVY","url":"https://www.facebook.com/WithlovefromLeah","retryCount":5}
172023-11-28T21:43:31.913Z WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. Facebook access error
182023-11-28T21:43:31.915Z     at validatedUrlHandler (file:///usr/src/app/src/main.js:83:19) {"id":"WcE27lkdE8DXhVY","url":"https://www.facebook.com/WithlovefromLeah","retryCount":6}
192023-11-28T21:43:37.175Z WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. Facebook access error
202023-11-28T21:43:37.177Z     at validatedUrlHandler (file:///usr/src/app/src/main.js:83:19) {"id":"WcE27lkdE8DXhVY","url":"https://www.facebook.com/WithlovefromLeah","retryCount":7}
212023-11-28T21:43:37.364Z ACTOR: The Actor run has reached the timeout of 45 seconds, aborting it. You can increase the timeout in Settings > Run options.
User avatar

Hi! Currently only detectable case when page was removed from Facebook by action (deleted by admin or banned by Meta). For all other cases its impossible to tell if page not available because of random blocking or permanently. I´m going to close the issue now, but if there would be anything else we could help with, please let us know.

User avatar

Sorry, I was not clear - I'd like to request a worker to fail after detecting that the page will not present any results instead of retrying indefinitely, increasing the scraping cost. I hope you will have a good idea of how to address this. Thank you!

User avatar

As Alexey wrote above - this error does not mean that the request will not fetch any data, it could be a random blocking, or some other random issue, meaning that eventually, it WILL fetch the data. Failing the run for each error like this means failing both 'bad' and 'good' pages.

Basically, what I am trying to say - there's currently no certain way to detect whether the page will present the results or not.

User avatar

Could we expose the parameter of how many times we would like to retry? For me, up to 3 retries make sense. Otherwise, it should fail. But I'm okay that default will be higher, and as a user, I'd be able to overwrite that.

User avatar

Hi! Done:

  • Allowed to specify custom maxRequestRetries in json input
  • Changed error prompt to Page access was blocked or page is not available, retrying with new session

Sample run: https://console.apify.com/view/runs/oIVV0Q5iaRzWqiSQV I´m going to close the issue now, but if there would be anything else we could help with, please let us know.

User avatar

I'm not sure if you updated the schema accordingly to the change. There is no way to specify maxRequestRetries in UI.

User avatar

It's updated in JSON input - It should work by default for most users and there's a reason to have a higher number (after testing, etc), so schema does not have it. But you could 'Switch to JSON editor' when you're on the Input tab, add this option, and start the run.

Alexey, correct me if I am wrong, also we should make sure it's reflected in the Readme.

User avatar

Andrey, confirmed, custom max requests is special use case and must be avoided in UI input form. Low max requests might lead to data loss, that's why it should not be exposed in visual input and also json option should not be recommended in readme.

Developer
Maintained by Apify
Actor metrics
  • 486 monthly users
  • 97.1% runs succeeded
  • 1.3 days response time
  • Created in Feb 2020
  • Modified about 17 hours ago