Pricing

Pay per usage

Go to Store

Website Content Crawler

Try for free

Developed by

Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

3.9 (41)

Pricing

Pay per usage

1544

Total users

60K

Monthly users

7.8K

Runs succeeded

>99%

Issues response

7.9 days

Last modified

3 days ago

Developer tools

Back to issues Create new issue

Various questions about operation and optimization of website content crawler

Closed

David Haddad (davhad) opened this issue

Hi, I have an issue with the following actor and run:

https://console.apify.com/actors/aYG0l9s7dbB7j3gbS/runs/i7Onl59rEebKNG1aT#output

In the run shared with you it was clear that the webpage has multiple pages under the same domain and the config of the actor was 25 max pages so I'm unsure why only one page shows up in the output.
I'm getting multiple instances of an actor run using up many resources but then retrieving no pages. Is tehre a way to setup the actor in a way that if a few seconds pass and still no pages are found or rtrieved for it to abort? I don't see the point of paying 25 cents for a website that doesn't generate any pages.
I'm using an actor in two different use cases, with each run passing a different starting url. I can see the integration webhooks however I'd like to create a separate webhook for each separate use case. However I only see a way to add multiple webhooks that are triggered each time. How can I solve it. The process that the webhook will trigger on crawl completion is completely different based on each use case.
I'm wondering how to use fewer resources during each run, and wondering if saveScreenshots set to false would make a big difference in resources used and if so the savings I could expect.

Jiří Spilka (jiri.spilka)

Hi, thank you for your interest in this Actor. I checked your run, and there seems to be an issue with handling canonical URLs. I’ll need to take a closer look.

Regarding your other questions, they’re all great points. Please give me a bit more time, and I’ll get back to you in a day or two with explanations. Then we can discuss how to address runs that yield 0 results.

Jiří Spilka (jiri.spilka)

Apologies for a slower response.

Here are my answers:

The site is reporting incorrect canonical URLs, which causes pages to be skipped.

Please set ignore canonical URLs to true:

If enabled, the Actor will ignore the canonical URL reported by the page and use the actual URL instead. This feature is helpful for websites that report invalid canonical URLs, as it prevents the Actor from skipping those pages in the results.

I understand your concern, and I apologize for the inconvenience. It’s challenging to determine when to abort the Actor on runs with no results. I’ll discuss this internally, including with customer support, to address issues around your empty runs and how to remedy them. I'll let you know.
I’m not entirely clear on your use cases. Have you considered using a task to handle them? For example, you could create a different task for each use case.
Regarding costs, it’s difficult to provide a one-size-fits-all solution as each domain may require a slightly different approach.

For example, on https://www.v****ay.io/, using the Cheerio browser, which doesn’t render JavaScript, is significantly faster—about 44 seconds compared to the default Adaptive browser (~2 minutes) and Playwright (~3 minutes 45 seconds). I checked the results, and the content appears accurate.

If JavaScript rendering is needed, you can use Playwright but reduce the waitForDynamicContent time from 10 seconds to say 5 seconds. This speeds up the crawl,... [trimmed]

David Haddad (davhad)

Hi Jiri, thanks for your detailed feedback. Will check and respond🙏

David Haddad (davhad)

Hi Jiri, clear on 1 & 3. Still awaiting 2. For 4, the website are heterogenous and no way of knowing ahead of time. Is there any way for the actor's logic to be adaptive on your end or maybe it already is?

Jiří Spilka (jiri.spilka)

Hi David, Thank you again for using the Actor. I understand that configuring the Actor can be complex, as is web scraping.

Starting with point 4 – crawling speed: Yes, you’re right; if the websites are heterogeneous, you can’t simply use Cheerio. The default setting uses an adaptive crawler. In the example above (previous comment), the adaptive crawler took around 2 minutes (slower than Cheerio but faster than Playwright).

From the documentation:

The crawler automatically switches between Cheerio and Playwright for dynamic pages to maximize performance wherever possible.

Regarding the point 2 – runs with 0 results: I checked a few of your runs, but I couldn’t access some websites, such as http://www.ve******ty.dk/ and http://www.ve****tier.lk/ (Bad Gateway). In these cases, the crawler stops early, after about 15 seconds.

Problematic runs are those taking around 6 minutes without results (e.g. 56spqGx8Ryo9ij8qS). Again, I can’t access the site (http://www.ha*****nd.com/). When you check the run, you’ll see the crawler is retrying with different settings.

There are two variables controlling this:

"maxRequestRetries": 5,
"requestTimeoutSecs": 60

This setup means it takes about 300 (5 * 60) seconds to give up, with an additional 60 seconds for some overhead. You could try lowering maxRequestRetries and requestTimeoutSecs, but there’s a risk that content won’t load for slower sites.

I apologize for not having a foolproof solution. W... [trimmed]

David Haddad (davhad)

Hi Jiri, thanks so much for this. This is very helpful to understand and you've been spot on.

I'll reduce the retries to 1. For the adaptive approach it's logical I find.

Appreciate your ho and we can mark this as osed.

Are you managing under your responsibility other actors on apify as well? If so which ones?

Thanks.

Jiří Spilka (jiri.spilka)

Hi David, I'm glad I could help.

Are you managing under your responsibility other actors on apify as well? If so which ones?

The Website Content Crawler is our flagship tool, and I've contributed to integrations within the AI ecosystem around it—for example, the OpenAI Assistant, Pinecone vector database integration, and others.

Recently, we developed the RAG-Web-Browser, which lets you crawl and extract content based on Google search results.

I'll close this issue now. Please don’t hesitate to reach out with any further questions. It was a pleasure to work on this issue.

Add comment

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

1.2K

4.4

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

5.0

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

5.0

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

591

3.8

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

430

4.1

AI-Powered Web Content & Link Extractor

scrapercoder/ai-powered-web-content-link-extractor

Crawls websites to extract clean, structured content for AI/LLM use, ideal for training datasets, knowledge bases, and RAG systems. Json output includes: * text: Normalized page content * links: Extracted sub-URLs

wallnut.ai

Web Scraper

apify/web-scraper

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

88K

4.5

HTML/Website Media Scraper

aweworkz/html-web-media-scraper

The Website Media scraper extracts all media files, i.e images, videos, audio, and other related media elements, from multiple websites. It then provides the corresponding descriptions or the alt="" content. You'll need to use proxies to run this actor for some websites with bot blocking features.

aweworkz

165

AI Web Scraper - Powered by Crawl4AI

raizen/ai-web-scraper

A blazing-fast AI web scraper powered by Crawl4AI. Perfect for LLMs, AI agents, AI automation, model training, sentiment analysis, and content generation. Supports deep crawling, multiple extraction strategies and flexible output (Markdown/JSON). Seamlessly integrates with Make.com, n8n, and Zapier.

Raizen Technology

144

1.0

Web Scraping API

zeeb0t/web-scraping-api---scrape-any-website

Web Scraping API that quickly and reliably scrapes any website—no selectors required. Premium proxies, CAPTCHA solving, JavaScript rendering, and automated structured data extraction are all included. It’s just $2 per 1,000 web pages scraped, with no minimum spend.