Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoHi, is there any way to also crawl links on subdomains if they are linked from the main site? like blog.site.com ?
Hello, and thank you for your interest in the Actor! Yes, this is possible. The easiest way to achieve it is using the "Include URLs (globs)" - you'd need to put both site.com/**
and blog.site.com/**
there. The setting in called includeUrlGlobs
in the JSON configuration, if you use that. See https://apify.com/apify/website-content-crawler/input-schema#includeUrlGlobs for details.
I wont' know ahead of time the subdomain, does "*.site.com" work?
It should!
Well, it looks like you need to do something like https://*.site.com/**
Actor Metrics
4k monthly users
-
839 stars
>99% runs succeeded
1 days response time
Created in Mar 2023
Modified 17 hours ago