Pricing

Pay per usage

Go to Store

Website Content Crawler

Try for free

apify/website-content-crawler

Developed by

Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.6 (38)

Pricing

Pay per usage

1.1k

Monthly users

Runs succeeded

>99%

Response time

2.3 days

Last modified

7 days ago

AI Developer tools

Back to issues Create new issue

Crawler doesn't follow any links in Start URL

Closed

nhanna opened this issue

I am trying to get Website Content Crawler running with "https://portfoliocharts.com/commentary-all/" as the Start URL - all other settings are default. The crawler simply visits that page and doesn't follow any of the links on it even though it's a very straightforward page with a clear

links to follow. I don't understand how this very basic use case isn't functioning.

nhanna

This is the task: https://console.apify.com/view/runs/mcmoAPYoVLxnNIVYs

Jakub Kopecký (jakub.kopecky)

Hi, thanks for using the Website Content Crawler.

The issue here is that the URLs of the blog posts are not in the same path as your Start URL, which means they weren't matching and thus weren't queued for crawling. You can resolve this by adding a glob pattern in includeUrlGlobs under "Crawler settings" -> "Include URLs (globs)".

For example, in this run, I added the glob https://portfoliocharts.com/[0-9][0-9][0-9][0-9]/[0-9][0-9]/[0-9][0-9]/* to match the blog post URLs, and now the pages are being crawled: https://console.apify.com/view/runs/sGoTobrZYg08vFYxc

Please try to run the Actor again and let me know if you encounter any issues.

Jakub Kopecky

nhanna

Thank you, rookie mistake!

Add comment

Pricing

Pricing model

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage.

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

475

5.0

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

224

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

460

/llms.txt Generator

jakub.kopecky/llmstxt-generator

The /llms.txt Generator 🕸️📄 extracts website content to create an llms.txt file for AI apps 🤖✨ like LLM fine-tuning and indexing. Output is available 📥 in the Key-Value Store for easy download and integration into workflows. 🚀

Jakub Kopecký

5.0

Webpage Singer 🎶

josef.prochazka/webpage-singer

Ever wondered what a website would sound like as a song? This Actor takes any webpage, turns its content into lyrics, and transforms it into a track in your favorite genre. Just drop in a URL, pick a style, and let the AI do the rest.

Josef Procházka

5.0

Backlink Opportunity Finder

easyapi/backlink-opportunity-finder

🔍 Discover high-quality backlink opportunities to boost your domain authority and search rankings. Extract valuable data about potential websites for building authoritative backlinks, including domain metrics, relevance analysis, and estimated SEO impact.

EasyApi

Example Website Screenshot Crawler

dz_omar/example-website-screenshot-crawler

Automated website screenshot crawler using Pyppeteer and Apify. This open-source actor captures screenshots from specified URLs, uploads them to the Apify Key-Value Store, and provides easy access to the results, making it ideal for monitoring website changes and archiving web content.

Abdlhakim hefaia

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

6.9k

4.7

Puppeteer Scraper

apify/puppeteer-scraper

Crawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.

Apify

5.9k

5.0

Web Scraper

apify/web-scraper

Crawls arbitrary websites using the Chrome browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

79.2k

4.5