Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoStruggling making this actor to grab all the links from https://www.ung.no/oss/. It shows 10 links to articles starting with https://www.ung.no/oss/, (ex.: https://www.ung.no/oss/77N221s2AQ8C5Ms95a86kT). Any suggestions?
Hello, and thank you for your interest in this Actor!
Website Content Crawler is really good at following links - but not that good at pressing buttons. From what I understand, you want to click the Se flere spørsmål og svar
button and scrape the links that appear. Unfortunately, the way the page implements this is not really standard, so WCC is not ready for this.
Fortunately, some time ago, we added sitemap parsing to WCC. This means the Actor can load the website's sitemap.xml
file and enqueue the URLs from it. This includes all the links hidden behind the JS buttons (approximately 430000 URLs in total). If you want to scrape all the questions and answers from the website, this is the best option. All you have to do is to set the useSitemaps
input option to true
. Note that this approach does not scrape the results in the order they appear on the page (uses approximately the sitemap order instead).
Did this answer your question? I'll close this issue now, but feel free to ask additional questions if you have any. Cheers!
Thank you, yes it answers the question.
- 3.8k monthly users
- 636 stars
- 100.0% runs succeeded
- 2.7 days response time
- Created in Mar 2023
- Modified 7 days ago