Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
KT

page scrolling

Open

kempt_trophy opened this issue
6 days ago

The data I need appears when I scroll, how do I customize this in this Actor?

jiri.spilka avatar

Hi, thank you for using Website Content Crawler.

I reached out internally, and all credit goes to @jindrichbar for finding a possible solution.

He made a few specific changes to the settings:

1"crawlerType": "playwright:firefox" (was previously Chrome)
2"dynamicContentWaitSecs": 20
3"htmlTransformer": "none"
4"removeCookieWarnings": false
5"removeElementsCssSelector": ".i.want.everything"
6"requestTimeoutSecs": 60
7"useSitemaps": false
8"waitForSelector": "[data-icon=\"clipboard\"]"

Please see his example run here.

The ideal approach would be to retrieve the data using the Perplexity AI Actor. However, I noticed that you’ve already raised an issue there without a solution yet.

I hope this solution works for you for now. I’ll go ahead and close this issue, but please feel free to reach out with any other questions.

KT

kempt_trophy

5 days ago

Thanks for the reply. If you don't mind, can you help me figure out why I'm not getting the results I want with your settings?

https://console.apify.com/actors/aYG0l9s7dbB7j3gbS/runs/uq3mwiAXydfARqGCz

jiri.spilka avatar

For this URL, I had to change the selector to: "waitForSelector": "[data-icon="arrow-up"]" (as [data-icon="clipboard"] was not present). Please see my example run for reference.

I agree this solution is quite brittle, and using the Perplexity API would likely be a more convenient approach. Another option could be to use the RAG Web Browser Actor paired with an LLM Actor (though this one hasn’t been released yet).

I’d love to hear more about your use case if you don’t mind sharing it!

Please feel free to ask any additional questions.

KT

kempt_trophy

5 days ago

I make a query to AI and get the data in its message. I only need to retrieve them. And since the Actor I need is not working, I decided to use this to get the results. I would like to get only the required result.

Also my link tends to be much longer. For example, like here:

https://console.apify.com/view/runs/qFOgaOMpqntQZpuzs

jiri.spilka avatar

Thank you for sharing! The URL length shouldn’t be an issue.

KT

kempt_trophy

3 days ago

https://console.apify.com/actors/aYG0l9s7dbB7j3gbS/runs/w09Th2kXq6cF7Hjtx#output

The length of the link seems to make a difference. I don't get the perplexity response data. Please help me to understand

Developer
Maintained by Apify
Actor metrics
  • 3.8k monthly users
  • 635 stars
  • 100.0% runs succeeded
  • 2.7 days response time
  • Created in Mar 2023
  • Modified 7 days ago