Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

Developed by

Apify

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.0 (41)

Pricing

Pay per usage

1591

Total users

62K

Monthly users

8.2K

Runs succeeded

>99%

Issues response

7.9 days

Last modified

2 hours ago

AS

Predict the number of pages before running the actor

Closed

alex_simas opened this issue
a year ago

Hello,

I'm testing your crawler and its content extraction capacity is really fantastic. However, would it be possible before running the actor to know the number of pages on a website? I searched the documentation but was unsuccessful.

jindrich.bar avatar

Hello Alex, and thank you for your interest in this Actor!

Indeed, Website Content Crawler cannot do this. This is because WCC (like any other web crawler) only follows the links on web pages, discovering one new web page at a time - and stops only once all the links lead to already discovered pages.

The only idea I have right now is sitemap discovery - by parsing the website's sitemap, you should get a list of all the URLs on the website, and therefore get the count of all the pages on the website. However, Website Content Crawler doesn't do this, as we didn't see the need for it yet.

Could you share your ideas about this feature - most importantly, what's your use case for this? Thanks again!

AS

alex_simas

a year ago

Hello Jindrich Bär, thanks for the quick response.

I imagined it would be like this. Your question about the use case is in fact the most important and I will make sure that the next time I use this space I make it clear why I have this question.

Use case: Estimate the cost and duration of scrapping considering a given infrastructure resource.

Maybe this should be my question, is there any way I can predict the cost and execution time of scrapping considering 1GB of CPU and 4GB of memory?

I know there are more variables that can impact this forecast, but as I am going to offer the service to the end public, I wanted to have a certain predictability to make more assertive pricing.

jindrich.bar avatar

Alright, now it makes perfect sense!

The cost and scraping time depend on the following three variables:

  • The website size (more pages = longer and more pricy crawl)
    • this is quite hard to estimate, see above
  • The amount of RAM available (more RAM = more space for concurrent processes, so faster processing, but you're paying more $ per hour)
  • The crawler type (Firefox and Chrome take some time to load the page, Cheerio loads the pages almost instantly, but doesn't execute client-side JS)
    • you want to go for Cheerio whenever it's possible... but it's not always possible (e.g. on pages with dynamic content loading).

On our testing account, the average price per result is $0.011 (or 12 seconds) and the median is $0.006 (or 3 seconds). The majority of the runs for these statistics are Firefox with 8 GB of RAM - and they are from reasonably long crawls (the Actor takes a few seconds to spin up, so the efficiency of short runs will be worse).

You can always check out the "Monitoring" tab (https://console.apify.com/actors/aYG0l9s7dbB7j3gbS/monitoring), which shows you statistics about your runs - In the Stats per run section, you can find the Duration/Cost per result chart, which shows you the average price for crawling one website (based on your last 200 runs).

Feel free to ask any additional questions. Thanks!

AS

alex_simas

a year ago

Thank you very much for your answer, Jindřich Bär!

In fact, it was quite enlightening for me, as I have only been using technology for a short time and I confess that I have a technical deficiency to overcome in order to make the best use of the crawler.

The service's documentation is very rich and I'm going to dive deeper into this incredible tool. I signed up to create several website scenarios that I intend to scrape and further explore my use case empirically, vary the settings, and then evaluate the statistics of each scenario.