Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

Developed by

Apify

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

3.9 (41)

Pricing

Pay per usage

1537

Total users

59K

Monthly users

7.9K

Runs succeeded

>99%

Issues response

7.8 days

Last modified

2 days ago

KR

Scraper only returns 6 news items

Closed

kristupas opened this issue
3 days ago

When scraping the Darnu Group news page (https://darnugroup.lt/lt/naujienos-2/), the Actor only returns 6 news texts, even though there are more news texts available on the site.

jindrich.bar avatar

Hello, and thank you for your interest in this Actor!

The initial page you're scraping loads the links dynamically as you scroll down. While WCC has some features for capturing the lazy-loaded content, it does not work in 100% of cases.

You can load the other URLs on the website by consuming the sitemaps. Enable the Load URLs from Sitemaps input option (see the attached screenshot), which will try to find the matching URLs in the sitemap.xml files on the website. Note that we recently improved the performance of the sitemap processing by rewriting the logic completely. I strongly advise using the beta build of the Actor - you can switch to it in Run options > Build.

You can check my example run here - the only two differences are beta build and Load URLs from Sitemaps.

I'll close this issue now, but feel free to ask additional questions, if you have any. Cheers!