Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoHi, I have an issue with the following actor and run:
https://console.apify.com/actors/aYG0l9s7dbB7j3gbS/runs/i7Onl59rEebKNG1aT#output
-
In the run shared with you it was clear that the webpage has multiple pages under the same domain and the config of the actor was 25 max pages so I'm unsure why only one page shows up in the output.
-
I'm getting multiple instances of an actor run using up many resources but then retrieving no pages. Is tehre a way to setup the actor in a way that if a few seconds pass and still no pages are found or rtrieved for it to abort? I don't see the point of paying 25 cents for a website that doesn't generate any pages.
-
I'm using an actor in two different use cases, with each run passing a different starting url. I can see the integration webhooks however I'd like to create a separate webhook for each separate use case. However I only see a way to add multiple webhooks that are triggered each time. How can I solve it. The process that the webhook will trigger on crawl completion is completely different based on each use case.
-
I'm wondering how to use fewer resources during each run, and wondering if saveScreenshots set to false would make a big difference in resources used and if so the savings I could expect.
Hi, thank you for your interest in this Actor. I checked your run, and there seems to be an issue with handling canonical URLs. I’ll need to take a closer look.
Regarding your other questions, they’re all great points. Please give me a bit more time, and I’ll get back to you in a day or two with explanations. Then we can discuss how to address runs that yield 0 results.
Apologies for a slower response.
Here are my answers:
- The site is reporting incorrect canonical URLs, which causes pages to be skipped.
Please set ignore canonical URLs
to true:
If enabled, the Actor will ignore the canonical URL reported by the page and use the actual URL instead. This feature is helpful for websites that report invalid canonical URLs, as it prevents the Actor from skipping those pages in the results.
-
I understand your concern, and I apologize for the inconvenience. It’s challenging to determine when to abort the Actor on runs with no results. I’ll discuss this internally, including with customer support, to address issues around your empty runs and how to remedy them. I'll let you know.
-
I’m not entirely clear on your use cases. Have you considered using a task to handle them? For example, you could create a different task for each use case.
-
Regarding costs, it’s difficult to provide a one-size-fits-all solution as each domain may require a slightly different approach.
For example, on https://www.v****ay.io/
, using the Cheerio browser, which doesn’t render JavaScript, is significantly faster—about 44 seconds compared to the default Adaptive browser (~2 minutes) and Playwright (~3 minutes 45 seconds). I checked the results, and the content appears accurate.
If JavaScript rendering is needed, you can use Playwright but reduce the waitForDynamicContent
time from 10 seconds to say 5 seconds. This speeds up the crawl,... [trimmed]
Hi Jiri, thanks for your detailed feedback. Will check and respond🙏
Hi Jiri, clear on 1 & 3. Still awaiting 2. For 4, the website are heterogenous and no way of knowing ahead of time. Is there any way for the actor's logic to be adaptive on your end or maybe it already is?
Hi David, Thank you again for using the Actor. I understand that configuring the Actor can be complex, as is web scraping.
Starting with point 4 – crawling speed: Yes, you’re right; if the websites are heterogeneous, you can’t simply use Cheerio. The default setting uses an adaptive crawler. In the example above (previous comment), the adaptive crawler took around 2 minutes (slower than Cheerio but faster than Playwright).
From the documentation:
The crawler automatically switches between Cheerio and Playwright for dynamic pages to maximize performance wherever possible.
Regarding the point 2 – runs with 0 results: I checked a few of your runs, but I couldn’t access some websites, such as http://www.ve******ty.dk/ and http://www.ve****tier.lk/ (Bad Gateway). In these cases, the crawler stops early, after about 15 seconds.
Problematic runs are those taking around 6 minutes without results (e.g. 56spqGx8Ryo9ij8qS
).
Again, I can’t access the site (http://www.ha*****nd.com/). When you check the run, you’ll see the crawler is retrying with different settings.
There are two variables controlling this:
1"maxRequestRetries": 5, 2"requestTimeoutSecs": 60
This setup means it takes about 300 (5 * 60) seconds to give up, with an additional 60 seconds for some overhead.
You could try lowering maxRequestRetries
and requestTimeoutSecs
, but there’s a risk that content won’t load for slower sites.
I apologize for not having a foolproof solution. W... [trimmed]
Hi Jiri, thanks so much for this. This is very helpful to understand and you've been spot on.
I'll reduce the retries to 1. For the adaptive approach it's logical I find.
Appreciate your ho and we can mark this as osed.
Are you managing under your responsibility other actors on apify as well? If so which ones?
Thanks.
Hi David, I'm glad I could help.
Are you managing under your responsibility other actors on apify as well? If so which ones?
The Website Content Crawler is our flagship tool, and I've contributed to integrations within the AI ecosystem around it—for example, the OpenAI Assistant, Pinecone vector database integration, and others.
Recently, we developed the RAG-Web-Browser, which lets you crawl and extract content based on Google search results.
I'll close this issue now. Please don’t hesitate to reach out with any further questions. It was a pleasure to work on this issue.
- 3.8k monthly users
- 636 stars
- 100.0% runs succeeded
- 2.7 days response time
- Created in Mar 2023
- Modified 7 days ago