Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

Go to Store
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

VI

scraped data is redundant

Closed
visable opened this issue
a month ago

can you please explain the scrapped data and how to scrape this website properly?

jiri.spilka avatar

Hi (again),
That’s a very good question. For the targeted page, you need to select a country first; otherwise, the crawler won’t be able to retrieve the content.

You can achieve this by first using your browser to select the country, then copying the cookies. For example, you can use the Copy Cookies Google Chrome extension.

Once you have the cookies, paste them into the Website Content Crawler settings under the initialCookies field.
Here’s my example run, which I aborted after confirming that the crawling was working.

As with the previous issue, I noticed that the data is structured. You might get better results using a custom Web Scraper if you have some coding experience.
I hope this helps! Please let me know if it works for you. Jiri

jiri.spilka avatar

I'll go ahead and close this issue now, but feel free to ask any questions or raise a new issue.

Developer
Maintained by Apify

Actor Metrics

  • 5.5k monthly users

  • 999 bookmarks

  • >99% runs succeeded

  • 1.1 days response time

  • Created in Mar 2023

  • Modified 14 days ago