
Website Content Crawler
Pricing
Pay per usage

Website Content Crawler
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
3.9 (41)
Pricing
Pay per usage
1537
Total users
59K
Monthly users
7.9K
Runs succeeded
>99%
Issues response
7.8 days
Last modified
2 days ago
Scraper only returns 6 news items
Closed
When scraping the Darnu Group news page (https://darnugroup.lt/lt/naujienos-2/), the Actor only returns 6 news texts, even though there are more news texts available on the site.
Hello, and thank you for your interest in this Actor!
The initial page you're scraping loads the links dynamically as you scroll down. While WCC has some features for capturing the lazy-loaded content, it does not work in 100% of cases.
You can load the other URLs on the website by consuming the sitemaps. Enable the Load URLs from Sitemaps
input option (see the attached screenshot), which will try to find the matching URLs in the sitemap.xml
files on the website.
Note that we recently improved the performance of the sitemap processing by rewriting the logic completely. I strongly advise using the beta
build of the Actor - you can switch to it in Run options > Build
.
You can check my example run here - the only two differences are beta
build and Load URLs from Sitemaps
.
I'll close this issue now, but feel free to ask additional questions, if you have any. Cheers!