Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
davhad avatar

Various questions about operation and optimization of website content crawler

Closed

David Haddad (davhad) opened this issue
24 days ago

Hi, I have an issue with the following actor and run:

https://console.apify.com/actors/aYG0l9s7dbB7j3gbS/runs/i7Onl59rEebKNG1aT#output

  1. In the run shared with you it was clear that the webpage has multiple pages under the same domain and the config of the actor was 25 max pages so I'm unsure why only one page shows up in the output.

  2. I'm getting multiple instances of an actor run using up many resources but then retrieving no pages. Is tehre a way to setup the actor in a way that if a few seconds pass and still no pages are found or rtrieved for it to abort? I don't see the point of paying 25 cents for a website that doesn't generate any pages.

  3. I'm using an actor in two different use cases, with each run passing a different starting url. I can see the integration webhooks however I'd like to create a separate webhook for each separate use case. However I only see a way to add multiple webhooks that are triggered each time. How can I solve it. The process that the webhook will trigger on crawl completion is completely different based on each use case.

  4. I'm wondering how to use fewer resources during each run, and wondering if saveScreenshots set to false would make a big difference in resources used and if so the savings I could expect.

jiri.spilka avatar

Hi, thank you for your interest in this Actor. I checked your run, and there seems to be an issue with handling canonical URLs. I’ll need to take a closer look.

Regarding your other questions, they’re all great points. Please give me a bit more time, and I’ll get back to you in a day or two with explanations. Then we can discuss how to address runs that yield 0 results.

jiri.spilka avatar

Apologies for a slower response.

Here are my answers:

  1. The site is reporting incorrect canonical URLs, which causes pages to be skipped.

Please set ignore canonical URLs to true:

If enabled, the Actor will ignore the canonical URL reported by the page and use the actual URL instead. This feature is helpful for websites that report invalid canonical URLs, as it prevents the Actor from skipping those pages in the results.
  1. I understand your concern, and I apologize for the inconvenience. It’s challenging to determine when to abort the Actor on runs with no results. I’ll discuss this internally, including with customer support, to address issues around your empty runs and how to remedy them. I'll let you know.

  2. I’m not entirely clear on your use cases. Have you considered using a task to handle them? For example, you could create a different task for each use case.

  3. Regarding costs, it’s difficult to provide a one-size-fits-all solution as each domain may require a slightly different approach.

For example, on https://www.v****ay.io/, using the Cheerio browser, which doesn’t render JavaScript, is significantly faster—about 44 seconds compared to the default Adaptive browser (~2 minutes) and Playwright (~3 minutes 45 seconds). I checked the results, and the content appears accurate.

If JavaScript rendering is needed, you can use Playwright but reduce the waitForDynamicContent time from 10 seconds to say 5 seconds. This speeds up the crawl,... [trimmed]

davhad avatar

Hi Jiri, thanks for your detailed feedback. Will check and respond🙏

davhad avatar

Hi Jiri, clear on 1 & 3. Still awaiting 2. For 4, the website are heterogenous and no way of knowing ahead of time. Is there any way for the actor's logic to be adaptive on your end or maybe it already is?

jiri.spilka avatar

Hi David, Thank you again for using the Actor. I understand that configuring the Actor can be complex, as is web scraping.

Starting with point 4 – crawling speed: Yes, you’re right; if the websites are heterogeneous, you can’t simply use Cheerio. The default setting uses an adaptive crawler. In the example above (previous comment), the adaptive crawler took around 2 minutes (slower than Cheerio but faster than Playwright).

From the documentation:

The crawler automatically switches between Cheerio and Playwright for dynamic pages to maximize performance wherever possible.

Regarding the point 2 – runs with 0 results: I checked a few of your runs, but I couldn’t access some websites, such as http://www.ve******ty.dk/ and http://www.ve****tier.lk/ (Bad Gateway). In these cases, the crawler stops early, after about 15 seconds.

Problematic runs are those taking around 6 minutes without results (e.g. 56spqGx8Ryo9ij8qS). Again, I can’t access the site (http://www.ha*****nd.com/). When you check the run, you’ll see the crawler is retrying with different settings.

There are two variables controlling this:

1"maxRequestRetries": 5,
2"requestTimeoutSecs": 60

This setup means it takes about 300 (5 * 60) seconds to give up, with an additional 60 seconds for some overhead. You could try lowering maxRequestRetries and requestTimeoutSecs, but there’s a risk that content won’t load for slower sites.

I apologize for not having a foolproof solution. W... [trimmed]

davhad avatar

Hi Jiri, thanks so much for this. This is very helpful to understand and you've been spot on.

I'll reduce the retries to 1. For the adaptive approach it's logical I find.

Appreciate your ho and we can mark this as osed.

Are you managing under your responsibility other actors on apify as well? If so which ones?

Thanks.

jiri.spilka avatar

Hi David, I'm glad I could help.

Are you managing under your responsibility other actors on apify as well? If so which ones?

The Website Content Crawler is our flagship tool, and I've contributed to integrations within the AI ecosystem around it—for example, the OpenAI Assistant, Pinecone vector database integration, and others.

Recently, we developed the RAG-Web-Browser, which lets you crawl and extract content based on Google search results.

I'll close this issue now. Please don’t hesitate to reach out with any further questions. It was a pleasure to work on this issue.

Developer
Maintained by Apify
Actor metrics
  • 3.8k monthly users
  • 636 stars
  • 100.0% runs succeeded
  • 2.7 days response time
  • Created in Mar 2023
  • Modified 7 days ago