seloger mass products scraper (by search URL) ⚡ avatar

seloger mass products scraper (by search URL) ⚡

Try for free

3 days trial then $25.00/month - No credit card required now

Go to Store
seloger mass products scraper (by search URL) ⚡

seloger mass products scraper (by search URL) ⚡

azzouzana/seloger-mass-products-scraper-by-search-url
Try for free

3 days trial then $25.00/month - No credit card required now

🔥Très simple! Entrez le lien vers la page de recherche et obtenir les résultats! ⚡ Extraire rapidement les infos détaillées sur les propriétés ( titre, description, photos, évaluations énergétique prix, contacts, transport et plus encore) à faible coût, avec exportation en JSON, CSV, HTML, EXCEL...

XO

Start Issue

Closed

xo7 opened this issue
a month ago

Hi it's seems since last version (0.0.172) actor crash at startup with my url :

https://www.seloger.com/list.htm?projects=2,5&types=2,12,11,1&natures=1,2,4&places=[{%22subDivisions%22:[%2275%22]}]&surface=NaN/45&sort=d_dt_crea&mandatorycommodities=0&enterprise=0&qsVersion=1.0&m=search_refine-redirection-search_results

with log : 2024-12-13T18:27:13.713Z ACTOR: Pulling Docker image of build jZtjIWkrVu7xa6wDJ from repository. 2024-12-13T18:27:14.412Z ACTOR: Creating Docker container. 2024-12-13T18:27:14.511Z ACTOR: Starting Docker container. 2024-12-13T18:27:18.781Z INFO System info {"apifyVersion":"3.2.6","apifyClientVersion":"2.10.0","crawleeVersion":"3.12.1","osType":"Linux","nodeVersion":"v20.18.1"} 2024-12-13T18:27:18.930Z bypassing bot protection... Please be patient :) 2024-12-13T18:28:16.408Z WARN Request: We've encountered a POST Request with a payload. This is fine. Just letting you know that if your requests point to the same URL and differ only in method and payload, you should see the "useExtendedUniqueKey" option of Request constructor. 2024-12-13T18:28:30.094Z pass through....

and finish without any result or query, can you fix that ?

thanks

azzouzana avatar

Hi!

checking this...

azzouzana avatar

Well, seems like Apify FR proxies were flagged and blocked by datadome. I'll set it up to use my own proxies (Which I pay for) in one hour and later will change it so Actor's users will input their own proxies as an input (something that I didn't aiming to keep it as simple as possible for users)

azzouzana avatar

I have a question: I noticed that you're always trying to scrape all of the 7K listings. Is that needed? (I'm asking to understand your use case)

azzouzana avatar

I'll update it to use more EU proxies origin countries and not only France, should help. Will let you know.

azzouzana avatar

Should be up again! Could you please confirm? And please reach out to me and let's discuss your specific use-case and see whether improvements/adjustments could be made :) My Discord username is @azzouzana

JE

jeremy.xo7

a month ago

I try to fetch new announces with specific filters one Time by day

Thanks for your help I will test it

azzouzana avatar

I can plan to work on a mode that would definitely help you out so the actor, based on previous executions outcome, would only scrape new listings & return delisted items

azzouzana avatar

Hi, I’ve released the Delta Mode feature, which, based on a checkbox input, instructs the actor to return only new or delisted ads since its last run. To use it, please use the version 0.1. Test it out with a small listing count first and let me know how it works. Thank you!

XO

xo7

a month ago

I try new version but this fail with message : 2024-12-18T04:36:22.055Z Not paying user, only handling first 50 results. To get all results, please subscribe (can you check it ?)

Some points if this can help you in future :

  • an easy way to improve perf and avoid "caching" can be an input param to select a date, and only scrape with details "announce" after this date (this can be more efficient than fetch all)
  • other quick tips : limit the announce number to fetch

Thanks

azzouzana avatar

Thanks a lot for the feedback.

I've just pushed an attempt for isPaying check, please let me know how it goes. (If you're a paid Apify user & you're still facing that, please let me know, most likely something with Apify platform but I believe it should be good). Now regardless, and to test the new monitoring mode, could you please try doing so with a listing that that less than 50 and let me know your feedback.

  • an easy way to improve perf and avoid "caching" can be an input param to select a date, and only scrape with details "announce" after this date (this can be more efficient than fetch all) => Thanks. definitely makes sense! Noted!
  • other quick tips : limit the announce number to fetch => I previously worked on this but it didn't work well with monitoring mode enabled, will have to think about this again. Probably they have to be mutually exclusive.
XO

xo7

a month ago

Hi,

I progress in my testing, it's seems better but I encounter error :

12024-12-20T01:29:01.023Z /usr/src/app/node_modules/@crawlee/core/storages/dataset.js:41
22024-12-20T01:29:01.026Z         throw new Error(`Data item${s}is too large (size: ${bytes} bytes, limit: ${limitBytes} bytes)`);
32024-12-20T01:29:01.028Z               ^
42024-12-20T01:29:01.030Z
52024-12-20T01:29:01.032Z Error: Data item is too large (size: 71529285 bytes, limit: 9436240 bytes)
62024-12-20T01:29:01.034Z     at checkAndSerialize (/usr/src/app/node_modules/@crawlee/core/storages/dataset.js:41:15)
72024-12-20T01:29:01.036Z     at Dataset.pushData (/usr/src/app/node_modules/@crawlee/core/storages/dataset.js:206:29)
82024-12-20T01:29:01.038Z     at Actor.pushData (/usr/src/app/node_modules/apify/actor.js:527:24)
92024-12-20T01:29:01.040Z     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
102024-12-20T01:29:01.043Z     at async file:///usr/src/app/src/main.js:86:5

Can you catch this error and continue process ?

Best

azzouzana avatar

Something with the dataset size limit. Please share the run and will check this first thing tomorrow. (Also, did you confirm monitoring is Ok with a not-so-large search results?)

XO

xo7

a month ago

You can find the run here : https://console.apify.com/organization/TzEYl4RGm5rKPyOU5/actors/dqFjeUv7Nrv7lRatk/runs/ZMRSNKZwQFCvDOmOs

Regarding documentation (if this can help) :

1The size of the data is limited by the receiving API and therefore pushData() will only allow objects whose JSON representation is smaller than 9MB. When an array is passed, none of the included objects may be larger than 9MB, but the array itself may be of any size.
2
3The function internally chunks the array into separate items and pushes them sequentially. The chunking process is stable (keeps order of data), but it does not provide a transaction safety mechanism. Therefore, in the event of an uploading error (after several automatic retries), the function's Promise will reject and the dataset will be left in a state where some of the items have already been saved to the dataset while other items from the source array were not. To overcome this limitation, the developer may, for example, read the last item saved in the dataset and re-attempt the save of the data from this item onwards to prevent duplicates.

Regarding monitoring mode with a small base it's seems to work has expected

Thanks

XO

xo7

a month ago

I think this is related as now you try to send 1 line with everything in "newsAds" (and for a big results set you reach 9MB limit)

I think you should use the same output as before (1 line by announce) and maybe add a "state": "new" or "state": "delisted" in the row, this will be more usefull to debug and check results in the console.

azzouzana avatar

Thank for the feedback!

For the size limitation, that's definitely it. Will work on it this weekend.

XO

xo7

23 days ago

hello any news ?

azzouzana avatar

Hey 👋 I've worked on adjusting the delta mode, and there's a field "apify_monitoring_status" which signifies whether the ad is new or delisted. Could you test it out with a small listening and let me know. Thanks!

XO

xo7

20 days ago

Hey, it's seems this work (yay) but I have an issue regarding monitoring mode.

Between each execution, it's seems monitoring mode detect all urls as "new" (and so crawl all list) can you share how you identify ad as "new", can you confirm if this is based on permalink without parameter ?

Thanks

XO

xo7

20 days ago

If this can help you, it seems monitoring detection occur after fetching because at the end, I have a dataset output only with "new" and "delisted". Can you update code to only "deep scrape" if "new" only ?

Best

azzouzana avatar

Hi!

Thanks for the feedback & catching that bug and thanks also for bearing up with me :) Definitely worth it. Working on it today. I'll let you know.

XO

xo7

13 days ago

Hi,

Happy new year, I come back to you about the issue (can you fix it) ?

Thanks

XO

xo7

13 days ago

It's seems ok now .. Thanks for you help

azzouzana avatar

Hi & happy new year!

This was OK since last week but I forgot to follow up here. Thanks for your feedback! I'm closing this issue now.

Developer
Maintained by Community

Actor Metrics

  • 6 monthly users

  • 2 stars

  • 99% runs succeeded

  • 1.6 hours response time

  • Created in Jul 2024

  • Modified 20 days ago