7 days trial then $20.00/month - No credit card required now
Facebook Pages Scraper
7 days trial then $20.00/month - No credit card required now
Facebook scraping tool to crawl and extract basic data from one or multiple Facebook Pages. Extract Facebook page name, page URL address, category, likes, check-ins, and other public data. Download data in JSON, CSV, Excel and use it in apps, spreadsheets, and reports.
We found out that scraper stays in retry loop instead of failing when provided with bad url.
12023-11-28T21:42:52.475Z ACTOR: Pulling Docker image of build n0sxaG0Cxh0Ctm7cr from repository. 22023-11-28T21:42:52.577Z ACTOR: Creating Docker container. 32023-11-28T21:42:52.645Z ACTOR: Starting Docker container. 42023-11-28T21:43:00.291Z INFO System info {"apifyVersion":"3.1.13","apifyClientVersion":"2.8.4","crawleeVersion":"3.6.2","osType":"Linux","nodeVersion":"v16.20.2"} 52023-11-28T21:43:00.650Z INFO Decoding 1 Facebook URLs 62023-11-28T21:43:00.700Z INFO CheerioCrawler: Starting the crawler. 72023-11-28T21:43:06.108Z WARN CheerioCrawler: Reclaiming failed request back to the list or queue. Facebook access error 82023-11-28T21:43:06.109Z at validatedUrlHandler (file:///usr/src/app/src/main.js:83:19) {"id":"WcE27lkdE8DXhVY","url":"https://www.facebook.com/WithlovefromLeah","retryCount":1} 92023-11-28T21:43:11.778Z WARN CheerioCrawler: Reclaiming failed request back to the list or queue. Facebook access error 102023-11-28T21:43:11.787Z at validatedUrlHandler (file:///usr/src/app/src/main.js:83:19) {"id":"WcE27lkdE8DXhVY","url":"https://www.facebook.com/WithlovefromLeah","retryCount":2} 112023-11-28T21:43:16.725Z WARN CheerioCrawler: Reclaiming failed request back to the list or queue. Facebook access error 122023-11-28T21:43:16.731Z at validatedUrlHandler (file:///usr/src/app/src/main.js:83:19) {"id":"WcE27lkdE8DXhVY","url":"https://www.facebook.com/WithlovefromLeah","retryCount":3} 132023-11-28T21:43:21.457Z WARN CheerioCrawler: Reclaiming failed request back to the list or queue. Facebook access error 142023-11-28T21:43:21.458Z at validatedUrlHandler (file:///usr/src/app/src/main.js:83:19) {"id":"WcE27lkdE8DXhVY","url":"https://www.facebook.com/WithlovefromLeah","retryCount":4} 152023-11-28T21:43:27.441Z WARN CheerioCrawler: Reclaiming failed request back to the list or queue. Facebook access error 162023-11-28T21:43:27.442Z at validatedUrlHandler (file:///usr/src/app/src/main.js:83:19) {"id":"WcE27lkdE8DXhVY","url":"https://www.facebook.com/WithlovefromLeah","retryCount":5} 172023-11-28T21:43:31.913Z WARN CheerioCrawler: Reclaiming failed request back to the list or queue. Facebook access error 182023-11-28T21:43:31.915Z at validatedUrlHandler (file:///usr/src/app/src/main.js:83:19) {"id":"WcE27lkdE8DXhVY","url":"https://www.facebook.com/WithlovefromLeah","retryCount":6} 192023-11-28T21:43:37.175Z WARN CheerioCrawler: Reclaiming failed request back to the list or queue. Facebook access error 202023-11-28T21:43:37.177Z at validatedUrlHandler (file:///usr/src/app/src/main.js:83:19) {"id":"WcE27lkdE8DXhVY","url":"https://www.facebook.com/WithlovefromLeah","retryCount":7} 212023-11-28T21:43:37.364Z ACTOR: The Actor run has reached the timeout of 45 seconds, aborting it. You can increase the timeout in Settings > Run options.
Hi! Currently only detectable case when page was removed from Facebook by action (deleted by admin or banned by Meta). For all other cases its impossible to tell if page not available because of random blocking or permanently. I´m going to close the issue now, but if there would be anything else we could help with, please let us know.
Sorry, I was not clear - I'd like to request a worker to fail after detecting that the page will not present any results instead of retrying indefinitely, increasing the scraping cost. I hope you will have a good idea of how to address this. Thank you!
As Alexey wrote above - this error does not mean that the request will not fetch any data, it could be a random blocking, or some other random issue, meaning that eventually, it WILL fetch the data. Failing the run for each error like this means failing both 'bad' and 'good' pages.
Basically, what I am trying to say - there's currently no certain way to detect whether the page will present the results or not.
Could we expose the parameter of how many times we would like to retry? For me, up to 3 retries make sense. Otherwise, it should fail. But I'm okay that default will be higher, and as a user, I'd be able to overwrite that.
Hi! Done:
- Allowed to specify custom
maxRequestRetries
in json input - Changed error prompt to
Page access was blocked or page is not available, retrying with new session
Sample run: https://console.apify.com/view/runs/oIVV0Q5iaRzWqiSQV I´m going to close the issue now, but if there would be anything else we could help with, please let us know.
I'm not sure if you updated the schema accordingly to the change. There is no way to specify maxRequestRetries in UI.
It's updated in JSON input - It should work by default for most users and there's a reason to have a higher number (after testing, etc), so schema does not have it. But you could 'Switch to JSON editor' when you're on the Input tab, add this option, and start the run.
Alexey, correct me if I am wrong, also we should make sure it's reflected in the Readme.
Andrey, confirmed, custom max requests is special use case and must be avoided in UI input form. Low max requests might lead to data loss, that's why it should not be exposed in visual input and also json option should not be recommended in readme.
- 486 monthly users
- 97.1% runs succeeded
- 1.3 days response time
- Created in Feb 2020
- Modified about 17 hours ago