Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
CG

Crawling not returning the text from pages

Closed

cgoul opened this issue
a month ago

I try to grab the content of "https://ung.no/oss/sEEF4x6B9VDmdLLCPcVaoQ", sometimes it works, most often it does not and returns "404 - Fant ikke siden..." (page not found). When opening the url in a browser it shows "404 - Fant ikke siden" a second before showing it's real content. Any suggestions?

Oscardz avatar

Hello, This website takes a long time to load, so what you can do is increase the "dynamicContentWaitSecs" to 20 seconds. Hope this works. Best regards,

CG

cgoul

a month ago

Not really, when i use the api with av list of 1000 urls, some of them still fails. But when I test them in the apify-client, they are working, the content is there.

CG

cgoul

a month ago

Issue solved with use waitForSelector and maxconcurrency = 2

Developer
Maintained by Apify
Actor metrics
  • 3.8k monthly users
  • 636 stars
  • 100.0% runs succeeded
  • 2.7 days response time
  • Created in Mar 2023
  • Modified 7 days ago