Leboncoin extractor avatar
Leboncoin extractor
Try for free

3 days trial then $30.00/month - No credit card required now

View all Actors
Leboncoin extractor

Leboncoin extractor

anchor/leboncoin
Try for free

3 days trial then $30.00/month - No credit card required now

Extract information from leboncoin.fr : no limitation, you get fast results in CSV, Excel... or API format. Le meilleur outil de scrapping pour leboncoin

User avatar

Can't get any results with Leboncoin Extractor

Closed

BasileDataimo opened this issue
a year ago

Hi !

I just tried to retrieve an ad on LBC but Apify respond with a "No results". Is this actor still working or is it discontinued?

Best regards,

User avatar

BasileDataimo

a year ago

Run id : https://console.apify.com/actors/xke8akCiaoyOQmnFg/runs/QueUCCpPl0zOFKuF5#output Test url : https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm Log error :

12023-05-25T12:56:33.633Z ERROR PuppeteerCrawler: Request failed and reached maximum retries. Error: net::ERR_TOO_MANY_RETRIES at https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm
22023-05-25T12:56:33.635Z     at navigate (/home/myuser/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Frame.js:235:23)
32023-05-25T12:56:33.637Z     at processTicksAndRejections (node:internal/process/task_queues:96:5)
42023-05-25T12:56:33.640Z     at async Frame.goto (/home/myuser/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Frame.js:205:21)
52023-05-25T12:56:33.642Z     at async CDPPage.goto (/home/myuser/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Page.js:1053:16)
62023-05-25T12:56:33.644Z     at async PuppeteerCrawler._handleNavigation (/home/myuser/node_modules/@crawlee/browser/internals/browser-crawler.js:285:40)
72023-05-25T12:56:33.646Z     at async PuppeteerCrawler._runRequestHandler (/home/myuser/node_modules/@crawlee/browser/internals/browser-crawler.js:227:13)
82023-05-25T12:56:33.648Z     at async PuppeteerCrawler._runRequestHandler (/home/myuser/node_modules/@crawlee/puppeteer/internals/puppeteer-crawler.js:110:9)
92023-05-25T12:56:33.650Z     at async wrap (/home/myuser/node_modules/@apify/timeout/index.js:52:21) {"id":"oZ4vpWZEiS02ftY","url":"https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm","method":"GET","uniqueKey":"https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm"}
User avatar

It works on my side. So let me try to debug your case. Could you send my your last run "INPUT" so that I can reproduce your case ?

User avatar

BasileDataimo

a year ago

Hello Guillim,

I just started 1 run with the same settings as you (residential proxy), the default function in your documentation, and the search page url : https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2 I can't see any result. Here is the input :

1{
2  "pageFunction": "async function pageFunction(context) {\n    let data = {}\n    let userData = context.request.userData\n    data.url = context.request.url\n    data.label = userData.label\n    // data.title = await context.page.title();\n    // context.log.info(data.title);\n\n    if(userData && userData.label === 'product'){   \n        context.log.info('label product.');     \n        data.img = await context.page.locator('[data-qa-id=adview_spotlight_container] img >> nth=0').getAttribute('src')\n        data.title = await context.page.locator('[data-qa-id=adview_title] >> nth=0').innerText()\n        data.price = await context.page.locator('[data-qa-id=adview_price] >> nth=0').innerText()\n        data.date = await context.page.locator('[data-qa-id=adview_date] >> nth=0').innerText()\n        data.description = await context.page.locator('[data-qa-id=adview_description_container] >> nth=0').innerText()\n        // data.link = userData.link\n    }else{\n        context.log.info('not label product, so search or pagination.');\n        let products = []\n        // we are looking for product to be queued, let's write it down\n        userData.label = 'product';\n        const elements = context.page.locator('[data-qa-id=aditem_container]');\n        const links = await elements.evaluateAll(elems => elems.map(elem => \"https://www.leboncoin.fr\"+elem.getAttribute('href')));\n        // await context.enqueueRequest('https://www.leboncoin.fr/recherche?category=21&text=got&price=17-50', {test : 'test'}, false);\n        links.forEach(async link => {\n            await context.enqueueRequest(link, userData , false);\n        })\n        // data.products = products\n    }\n    context.log.info(`function ended`);\n    return data;\n}\n",
3  "proxyConfiguration": {
4    "useApifyProxy": true,
5    "apifyProxyGroups": [
6      "RESIDENTIAL"
7    ],
8    "apifyProxyCountry": "FR"
9  },
10  "startUrls": [
11    {
12      "url": "https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2"
13    }
14  ]
15}

Here are the logs:

12023-05-26T08:55:40.992Z ACTOR: Pulling Docker image from repository.
22023-05-26T08:55:41.697Z ACTOR: Creating Docker container.
32023-05-26T08:55:41.973Z ACTOR: Starting Docker container.
42023-05-26T08:55:43.277Z Starting X virtual framebuffer using: Xvfb :99 -ac -screen 0 1920x1080x24+32 -nolisten tcp
52023-05-26T08:55:43.281Z Executing main command
62023-05-26T08:55:45.155Z INFO  System info {"apifyVersion":"3.1.2","apifyClientVersion":"2.6.2","crawleeVersion":"3.2.2","osType":"Linux","nodeVersion":"v16.19.0"}
72023-05-26T08:55:45.873Z INFO  PuppeteerCrawler: Starting the crawl
82023-05-26T08:56:45.872Z INFO  Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":60354,"retryHistogram":[]}
92023-05-26T08:56:45.880Z INFO  PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":1,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.077},"cpuInfo":{"isOverloaded":true,"limitRatio":0.4,"actualRatio":0.905},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
102023-05-26T08:56:46.354Z WARN  PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","retryCount":1}
112023-05-26T08:57:45.875Z INFO  Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":120354,"retryHistogram":[]}
122023-05-26T08:57:45.885Z INFO  PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.019},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
132023-05-26T08:57:49.951Z WARN  PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","retryCount":2}
142023-05-26T08:58:45.873Z INFO  Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":180354,"retryHistogram":[]}
152023-05-26T08:58:45.889Z INFO  PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
162023-05-26T08:58:54.367Z WARN  PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","retryCount":3}
172023-05-26T08:59:45.874Z INFO  Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":240355,"retryHistogram":[]}
182023-05-26T08:59:45.891Z INFO  PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
192023-05-26T08:59:58.081Z ERROR PuppeteerCrawler: Request failed and reached maximum retries. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","method":"GET","uniqueKey":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2"}
202023-05-26T08:59:58.138Z INFO  PuppeteerCrawler: All requests from the queue have been processed, the crawler will shut down.
212023-05-26T08:59:58.417Z INFO  PuppeteerCrawler: Crawl finished. Final request statistics: {"requestsFinished":0,"requestsFailed":1,"retryHistogram":[null,null,null,1],"requestAvgFailedDurationMillis":60243,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":60243,"requestsTotal":1,"crawlerRuntimeMillis":252898}
222023-05-26T08:59:58.418Z INFO  PuppeteerCrawler: Error analysis: {"totalErrors":1,"uniqueErrors":1,"mostCommonErrors":["1x: Navigation timed out after 60 seconds. (/home/myuser/node_modules/@crawlee/core/crawlers/crawler_utils.js:13:11)"]}
232023-05-26T08:59:58.420Z Crawler finished.
242023-05-26T08:59:58.421Z INFO  Actor finished successfully (exit code 0)

You may find attached the screenshot.

Best regards,

User avatar

BasileDataimo

a year ago

I also tried with a single ad url : "https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm" Same results. Logs :

12023-05-26T08:53:19.241Z ACTOR: Pulling Docker image from repository.
22023-05-26T08:53:19.399Z ACTOR: Creating Docker container.
32023-05-26T08:53:19.723Z ACTOR: Starting Docker container.
42023-05-26T08:53:20.276Z Starting X virtual framebuffer using: Xvfb :99 -ac -screen 0 1920x1080x24+32 -nolisten tcp
52023-05-26T08:53:20.277Z Executing main command
62023-05-26T08:53:21.783Z INFO  System info {"apifyVersion":"3.1.2","apifyClientVersion":"2.6.2","crawleeVersion":"3.2.2","osType":"Linux","nodeVersion":"v16.19.0"}
72023-05-26T08:53:22.943Z INFO  PuppeteerCrawler: Starting the crawl
82023-05-26T08:53:58.542Z INFO  PuppeteerCrawler: handling: https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm
92023-05-26T08:54:02.540Z INFO  PuppeteerCrawler: not label product, so search or pagination.
102023-05-26T08:54:02.541Z WARN  PuppeteerCrawler: Reclaiming failed request back to the list or queue. TypeError: context.page.locator is not a function
112023-05-26T08:54:02.542Z     at pageFunction (evalmachine.<anonymous>:22:39)
122023-05-26T08:54:02.543Z     at file:///home/myuser/main.js:28:24
132023-05-26T08:54:02.544Z     at runMicrotasks (<anonymous>)
142023-05-26T08:54:02.544Z     at processTicksAndRejections (node:internal/process/task_queues:96:5)
152023-05-26T08:54:02.545Z     at async wrap (/home/myuser/node_modules/@apify/timeout/index.js:52:21) {"id":"oZ4vpWZEiS02ftY","url":"https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm","retryCount":1}
162023-05-26T08:54:22.943Z INFO  Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":60360,"retryHistogram":[]}
172023-05-26T08:54:22.947Z INFO  PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.02},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0.036},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
182023-05-26T08:55:06.045Z WARN  PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"oZ4vpWZEiS02ftY","url":"https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm","retryCount":2}
192023-05-26T08:55:22.943Z INFO  Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":120361,"retryHistogram":[]}
202023-05-26T08:55:22.948Z INFO  PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.019},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
212023-05-26T08:56:10.133Z WARN  PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"oZ4vpWZEiS02ftY","url":"https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm","retryCount":3}
222023-05-26T08:56:22.943Z INFO  Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":180360,"retryHistogram":[]}
232023-05-26T08:56:22.953Z INFO  PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
242023-05-26T08:57:13.378Z ERROR PuppeteerCrawler: Request failed and reached maximum retries. Navigation timed out after 60 seconds. {"id":"oZ4vpWZEiS02ftY","url":"https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm","method":"GET","uniqueKey":"https://www.leboncoin.fr/ventes_immobilieres/2316274360.htm"}
252023-05-26T08:57:13.447Z INFO  PuppeteerCrawler: All requests from the queue have been processed, the crawler will shut down.
262023-05-26T08:57:13.861Z INFO  PuppeteerCrawler: Crawl finished. Final request statistics: {"requestsFinished":0,"requestsFailed":1,"retryHistogram":[null,null,null,1],"requestAvgFailedDurationMillis":60129,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":60129,"requestsTotal":1,"crawlerRuntimeMillis":231278}
272023-05-26T08:57:13.862Z INFO  PuppeteerCrawler: Error analysis: {"totalErrors":1,"uniqueErrors":1,"mostCommonErrors":["1x: Navigation timed out after 60 seconds. (/home/myuser/node_modules/@crawlee/core/crawlers/crawler_utils.js:13:11)"]}
282023-05-26T08:57:13.863Z Crawler finished.
292023-05-26T08:57:13.864Z INFO  Actor finished successfully (exit code 0)
User avatar

BasileDataimo

a year ago

Can you give me your input so I can copy/past it and test it with the same data ?

User avatar

I think I found out where the problem is. It comes from the customisation of the "Function" running on Leboncoin. Your configuration is almost good.

What to do: On the Apify plateform, click on your "Leboncoin extractor" Actor and on the "Source" tab, click on the "Input" sub-tab. At the bottom of this page, you should be able to click "Restore default input" like in the screeshot attached. After you've clicked, you should see the "Funciton" changed. You will need to write again the URL you want to scrape ( https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2 )

If that does not work, please copy this and paste this in the "Function" editor :

async function pageFunction(context) { let data = {} let userData = context.request.userData data.url = context.request.url data.label = userData.label

1let items = await context.page.evaluate(() => {
2    const item = $('[data-qa-id=aditem_container]')
3    const itemInfo = item.map(function(i,elem) {
4        let obj = {}
5        obj.title = $(this).find('[data-qa-id=aditem_title]').text()
6        obj.price = $(this).find('[data-test-id=price]').text()
7        obj.location = $(this).find('span').filter(function() { return this.title.match(/[0-9]{5}/);}).text()
8        obj.date = $(this).find('span').filter(function() { return this.title.match(/:/);}).text()
9        obj.img = $(this).find('[data-test-id=adcard-consumer-goods-list] img').attr('src')
10        obj.rank = i+1
11        return obj
12    }).get()
13    return itemInfo
14})
15let itemsWithDataProp = items.map(obj => { 
16    for(const key of Object.keys(data) ){
17        obj[key] = data[key]
18    }
19    return obj
20})
21return itemsWithDataProp;

}

User avatar

BasileDataimo

a year ago

Hello Guillim,

I tried this morning the "Restore default input". I didn't got any results either. Here is the input :

1{
2  "pageFunction": "async function pageFunction(context) {\n    let data = {}\n    let userData = context.request.userData\n    data.url = context.request.url\n    data.label = userData.label\n\n    let items = await context.page.evaluate(() => {\n        const item = $('[data-qa-id=aditem_container]')\n        const itemInfo = item.map(function(i,elem) {\n            let obj = {}\n            obj.title = $(this).find('[data-qa-id=aditem_title]').text()\n            obj.price = $(this).find('[data-test-id=price]').text()\n            obj.location = $(this).find('span').filter(function() { return this.title.match(/[0-9]{5}/);}).text()\n            obj.date = $(this).find('span').filter(function() { return this.title.match(/:/);}).text()\n            obj.img = $(this).find('[data-test-id=adcard-consumer-goods-list] img').attr('src')\n            obj.rank = i+1\n            return obj\n        }).get()\n        return itemInfo\n    })\n    let itemsWithDataProp = items.map(obj => { \n        for(const key of Object.keys(data) ){\n            obj[key] = data[key]\n        }\n        return obj\n    })\n    return itemsWithDataProp;\n}\n",
3  "proxyConfiguration": {
4    "useApifyProxy": true,
5    "apifyProxyGroups": [
6      "RESIDENTIAL"
7    ],
8    "apifyProxyCountry": "FR"
9  },
10  "startUrls": [
11    {
12      "url": "https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2"
13    }
14  ]
15}

And the log :

12023-05-30T08:53:29.483Z ACTOR: Pulling Docker image from repository.
22023-05-30T08:53:29.638Z ACTOR: Creating Docker container.
32023-05-30T08:53:29.780Z ACTOR: Starting Docker container.
42023-05-30T08:53:30.508Z Starting X virtual framebuffer using: Xvfb :99 -ac -screen 0 1920x1080x24+32 -nolisten tcp
52023-05-30T08:53:30.511Z Executing main command
62023-05-30T08:53:32.233Z INFO  System info {"apifyVersion":"3.1.2","apifyClientVersion":"2.6.2","crawleeVersion":"3.2.2","osType":"Linux","nodeVersion":"v16.19.0"}
72023-05-30T08:53:32.923Z INFO  PuppeteerCrawler: Starting the crawl
82023-05-30T08:54:32.923Z INFO  Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":60380,"retryHistogram":[]}
92023-05-30T08:54:32.927Z INFO  PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
102023-05-30T08:54:33.349Z WARN  PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","retryCount":1}
112023-05-30T08:55:32.923Z INFO  Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":120380,"retryHistogram":[]}
122023-05-30T08:55:32.930Z INFO  PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
132023-05-30T08:55:36.803Z WARN  PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","retryCount":2}
142023-05-30T08:56:32.923Z INFO  Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":180380,"retryHistogram":[]}
152023-05-30T08:56:32.935Z INFO  PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.019},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0.075},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
162023-05-30T08:56:40.603Z WARN  PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","retryCount":3}
172023-05-30T08:57:32.923Z INFO  Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":240380,"retryHistogram":[]}
182023-05-30T08:57:32.939Z INFO  PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
192023-05-30T08:57:44.058Z ERROR PuppeteerCrawler: Request failed and reached maximum retries. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","method":"GET","uniqueKey":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2"}
202023-05-30T08:57:44.119Z INFO  PuppeteerCrawler: All requests from the queue have been processed, the crawler will shut down.
212023-05-30T08:57:44.374Z INFO  PuppeteerCrawler: Crawl finished. Final request statistics: {"requestsFinished":0,"requestsFailed":1,"retryHistogram":[null,null,null,1],"requestAvgFailedDurationMillis":60273,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":60273,"requestsTotal":1,"crawlerRuntimeMillis":251831}
222023-05-30T08:57:44.377Z INFO  PuppeteerCrawler: Error analysis: {"totalErrors":1,"uniqueErrors":1,"mostCommonErrors":["1x: Navigation timed out after 60 seconds. (/home/myuser/node_modules/@crawlee/core/crawlers/crawler_utils.js:13:11)"]}
232023-05-30T08:57:44.379Z Crawler finished.
242023-05-30T08:57:44.382Z INFO  Actor finished successfully (exit code 0)
User avatar

BasileDataimo

a year ago

Then I tried with a CC of your input, and still same log :

12023-05-30T17:00:11.545Z ACTOR: Pulling Docker image from repository.
22023-05-30T17:00:12.313Z ACTOR: Creating Docker container.
32023-05-30T17:00:12.487Z ACTOR: Starting Docker container.
42023-05-30T17:00:16.156Z Starting X virtual framebuffer using: Xvfb :99 -ac -screen 0 1920x1080x24+32 -nolisten tcp
52023-05-30T17:00:16.160Z Executing main command
62023-05-30T17:00:19.200Z INFO  System info {"apifyVersion":"3.1.2","apifyClientVersion":"2.6.2","crawleeVersion":"3.2.2","osType":"Linux","nodeVersion":"v16.19.0"}
72023-05-30T17:00:20.807Z INFO  PuppeteerCrawler: Starting the crawl
82023-05-30T17:01:11.853Z INFO  PuppeteerCrawler: handling: https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2
92023-05-30T17:01:20.808Z INFO  Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":60679,"retryHistogram":[]}
102023-05-30T17:01:21.856Z WARN  PuppeteerCrawler: Reclaiming failed request back to the list or queue. Waiting for selector `h1` failed: Waiting failed: 10000ms exceeded
112023-05-30T17:01:21.859Z  {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","retryCount":1}
122023-05-30T17:01:30.809Z INFO  PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
132023-05-30T17:02:20.807Z INFO  Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":120679,"retryHistogram":[]}
142023-05-30T17:02:25.439Z WARN  PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","retryCount":2}
152023-05-30T17:02:30.810Z INFO  PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.019},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
162023-05-30T17:03:20.808Z INFO  Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":180679,"retryHistogram":[]}
172023-05-30T17:03:28.934Z WARN  PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","retryCount":3}
182023-05-30T17:03:30.812Z INFO  PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
192023-05-30T17:04:20.808Z INFO  Statistics: PuppeteerCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":240680,"retryHistogram":[]}
202023-05-30T17:04:30.814Z INFO  PuppeteerCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
212023-05-30T17:04:32.575Z ERROR PuppeteerCrawler: Request failed and reached maximum retries. Navigation timed out after 60 seconds. {"id":"6mcnX8ZTLKwk0bB","url":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2","method":"GET","uniqueKey":"https://www.leboncoin.fr/recherche?category=9&locations=r_12&real_estate_type=1%2C2"}
222023-05-30T17:04:32.628Z INFO  PuppeteerCrawler: All requests from the queue have been processed, the crawler will shut down.
232023-05-30T17:04:32.912Z INFO  PuppeteerCrawler: Crawl finished. Final request statistics: {"requestsFinished":0,"requestsFailed":1,"retryHistogram":[null,null,null,1],"requestAvgFailedDurationMillis":60091,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":60091,"requestsTotal":1,"crawlerRuntimeMillis":252783}
242023-05-30T17:04:32.914Z INFO  PuppeteerCrawler: Error analysis: {"totalErrors":1,"uniqueErrors":1,"mostCommonErrors":["1x: Navigation timed out after 60 seconds. (/home/myuser/node_modules/@crawlee/core/crawlers/crawler_utils.js:13:11)"]}
252023-05-30T17:04:32.916Z Crawler finished.
262023-05-30T17:04:32.949Z INFO  Actor finished successfully (exit code 0)
User avatar

Ok. I think I will have to ping Apify on this one. I don't have enough info to help you straight. I will let you know when they come back to me

User avatar

BasileDataimo

a year ago

Thanks Guillim!

Don't hesitate if you want me to do more tests. Perhaps Apify can share our runs with you so you can inspect the data?

Best regards,

User avatar

guillim (anchor)

10 months ago

Still working on it (I am not letting you down 😅)

User avatar

guillim (anchor)

10 months ago

Apify and I found out that leboncoin was blocking your actor, and we don't really know why mine are not blocked. So i increased the bypass-anti-scrapping capabilities of the actor to give you more chances to succeed. The new version of the actor was just released. If you could give it a shot (be sure you are on the latest version of the actor before testing)

Let me know if it works for you !

See attached what leboncoin triggers on your side

User avatar

BasileDataimo

10 months ago

Hello Guillim,

It works, thank you ! I manage to get results with a 100% success during my 2 runs.

But I got some questions for you :

  1. As I understand, to get the details of an ad, I need to first scrap a list page (around 38 results / page) and then scrap each individual ad url for the 38 ads. Am I correct ?
  2. What is your recommendation regarding the usage of your actor. Especially, do you recommend a crawl frequency ? Do you recommend to crawl multiple pages during the same call or do multiple calls ?
  3. In terms of time & costs : the scraping of a single page took 55s ($0.045) and the scrapping of 3 other pages took 6m25s ($0.394). Where does the difference come from ? Is the duration and price of a run always so variable ?

Best regards,

User avatar

guillim (anchor)

10 months ago

good to hear that !

  1. yes, exaclty
  2. to avoid ban, the best is to avoid parallel crawling. but it's more up to you, you can test and see.
  3. it depends on how many tries the crawler needs to do before it bypassses leboncoin anti scrapping protection. The better leboncoin are, the longer it may take. Enjoy !
User avatar

BasileDataimo

10 months ago

Hello Guillim,

Thanks for those details.

I will do some more tests, but I'm afraid the costs will be too expensive... My current calculations are around 200$/day just to retrieve the new ads (estimated around 2000/day). And then we need to add the cost of all searches (around 30$/day for a search every 5 minutes). And the cost of refreshing old ads to see if there are still online... (more than 400$/day if we check them every 2 weeks)

Do you see any way to decrease these costs by crawling multiple ads in a row in the same run ? Or any other mean you can think about !

Best regards,

User avatar

guillim (anchor)

10 months ago

I agree with you : if your calculation is correct, that’s way too much.

There are different tricks I had to setup to bypass Leboncoin anti scrapping protection. One of them is creating a real browser instead of just a simulation, and it costs more. But, it’s one of the way to make sure it works.

Depending on your balance « make sure it works » VS « reduce cost » I could remove this feature.

honestly, it’s really hard to find a cheap solution when fighting antiscrapping. If a website tells you differently and ne the web, it’s probably a scam 😅

User avatar

BasileDataimo

10 months ago

Hello Guillim,

Thanks for your concerns. I understand that LBC protections are quite advanced. If having a selenium (or equivalent) is the only way to bypass them, so be it. I understand that LBC also have a private API for searches, I can see references to api.leboncoin.fr in the DOM. Would it be possible to make calls to the API from inside the robot web browser ? If so, perhaps we could get more data from the same session and reduce costs ?

I agree scraping LBC is not cheap... The only way to have a cheaper solution would be to share the data across multiple clients, so that we can only crawl it once and retrieve it several times. But I am not aware of such services ?

Regards,

User avatar

guillim (anchor)

10 months ago

Hi,

There are not so many options to bypass anti-scrapping protection, they require quite high skilled scripts, and heavy crawlers. The API is also protected by the API from what I could read here : https://github.com/tdurieux/leboncoin-api/blob/master/README.md

One of the last option would be to store the data of ads, and fetch only new ads while occurring in search results. Requires some dev on your part I guess.

But yes, you would reduce significantly cost sharing the scrapped data between your customers. That's what most website do, even though they say the opposite. There are no services doing exactly what you would want, so you could dev your own solution, or you could maybe try some combination of zapier and xano.

User avatar

BasileDataimo

10 months ago

Hi,

Yes the API must be very protected because it exposes data already formatted, very quickly and in batch... the dream ! But I don't know how the protection is working... they can't have fingerprints nor user agents. Perhaps we will try to limit the ads we crawl to reduce the costs. But this is not ideal.

On our side, we already have developed a solution to store ads data and avoid multiple crawls. We have other website sources and we use IP proxies to avoid detections. It works well, excepted for Leboncoin.fr and Seloger.com because of DataDome. By the way, would it be possible to crawl Seloger.com with your actor ?

We used Apify for Leboncoin only because we don't have the knowledge nor internal resources to bypass DataDome. But it's more expensive than expected... it's getting closer to the cost of an external contractor. You don't do freelance by any chance ? ;)

Best regards,

User avatar

guillim (anchor)

10 months ago

Hi

No sorry I don’t do freelance anymore. Apify actors are what it’s left of my side projects. I cannot dedicate more time than some actor maintenance or development.

If you assure me you will be a paying client for this actor, I can spend a few days to develop “seloger” and release it on apify, for the same rates as Leboncoin.

To be honest, you will never find any contactor or any technology that would really compete with apify actor pricing considering real crawlers. That’s my sole opinion of course, and definitely biased.

If you find a better deal, let me know, I would use it as well !

User avatar

BasileDataimo

10 months ago

Hi,

Noted, no freelancing anymore, thanks Guillim. I'd like to test the LBC actor at a real scale and see the final costs before engaging on Seloger. If we are at 300€/month/actor, we need to ensure it's profitable. But it's good to know you can add it !

I think Apify has good prices. It's just that we also have proxies and servers internally, so we pay 2 services instead of one. Paying around 50$/month for each actor/marketplace on Apify would be too expensive. For a big marketplace like Leboncoin it's possible though. I saw other similar services like RapidApi, but never tested them...

One last question to start implementing a solution with you actor. Is it possible to retrieve the whole HTML through the API and do the extracting on our side ? I think it can lower the processing time and therefore the costs. Something like that :

1async function pageFunction(context) {
2response = {
3    url: context.request.url,
4    html: context.page,
5   userData: context.request.userData,
6  }
7return response
8}

Best regards,

User avatar

guillim (anchor)

10 months ago

Sure it would possible, but it would have no effect on the costs though. JS execution at this stage is incredibly fast. It’s really emulating chrome that is expensive when starting an actor

User avatar

BasileDataimo

10 months ago

Yes, you are right, the processing time in JS won't be very different. But regarding the costs, my experience is that if we can retrieve and store the HTML on our side, we can fix any scraping issue and re-run the extraction without having to do a new run.

I did some some tests again to extract the while HTML, but I didn't get any results (see attachment). Did you make any changes ? When I look at the logs, it seems to be the safety blocking again ?

12023-06-16T06:46:17.414Z ERROR PuppeteerCrawler: Request failed and reached maximum retries. Error: net::ERR_TOO_MANY_RETRIES at https://www.leboncoin.fr/recherche?category=9&locations=r_12&owner_type=private&real_estate_type=1%2C2
22023-06-16T06:46:17.415Z     at navigate (/home/myuser/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Frame.js:111:23)
User avatar

guillim (anchor)

10 months ago

I didn’t change anything. Might be some random blocking that the crawler couldn’t bypass in 15 tries. Happens sometimes.

User avatar

BasileDataimo

10 months ago

It worked after 4 tries (see PJ). But it takes a total of 37 minutes and 1,3$ to get 38 ads (without the detail pages). I'll continue to integrate the actor to test the solution during 1 month and see the real cost. But if you can find ways to reduce time/cost, it'll be greatly appreciated !

User avatar

guillim (anchor)

10 months ago

Ok, I'll check it out.

Developer
Maintained by Community
Actor metrics
  • 11 monthly users
  • 93.2% runs succeeded
  • 0.0 days response time
  • Created in Oct 2021
  • Modified 6 days ago