Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

Go to Store
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
VI

can't crawl the whole website

Closed

visable opened this issue
10 days ago

it only crawls one page. Tried increasing the max depth to 10, tried also crawler type (adaptive and firefox)

jiri.spilka avatar

Hi, thank you for using Website Content Crawler!

The issue is caused by the scope of your startURLs. Many customers choose to crawl only the specified startURLs and their sub-pages. In your case, with https://www.****startseite.html, the scope is limited to this page and its sub-pages. Since there are no sub-pages, the crawler doesn't proceed further.

To fix this, please remove startseite.html from the URL so the crawler can access other pages as well. You can see an example run, which I aborted early to avoid wasting resources.

I hope this helps! I'll close this issue now, but feel free to ask any additional questions or raise a new issue.

VI

visable

9 days ago

thank you for the quick reply but now we have https://console.apify.com/organization/PljXs4KlVGTIiQCKc/actors/runs/CON38M0HYT7BwZNuE#output this error

2024-12-05T14:49:22.447Z INFO AdaptiveCrawler: Running browser request handler for https://www.pema-tec.multiscreensite.com/ 2024-12-05T14:49:25.726Z ERROR AdaptiveCrawler: Request failed and reached maximum retries. page.goto: SSL_ERROR_UNKNOWN

any idea why is this happening?

jiri.spilka avatar

I'm glad I could help.

The site https://www.pema-tec.multiscreensite.com/ is not reachable. When I attempted to access it, I encountered a Site not found error.

However, it appears that the issue lies with the www prefix. If you use https://pema-tec.multiscreensite.com/ (without www) as the start URL, the site works correctly.

jiri.spilka avatar

I’ll go ahead and close this issue now, but please feel free to ask additional questions or raise a new issue.

Developer
Maintained by Apify

Actor Metrics

  • 3.9k monthly users

  • 718 stars

  • >99% runs succeeded

  • 2.2 days response time

  • Created in Mar 2023

  • Modified 15 hours ago