Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

Go to Store
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
CG

How grab all the urls from https://www.ung.no/oss/

Closed

cgoul opened this issue
2 months ago

Struggling making this actor to grab all the links from https://www.ung.no/oss/. It shows 10 links to articles starting with https://www.ung.no/oss/, (ex.: https://www.ung.no/oss/77N221s2AQ8C5Ms95a86kT). Any suggestions?

jindrich.bar avatar

Hello, and thank you for your interest in this Actor!

Website Content Crawler is really good at following links - but not that good at pressing buttons. From what I understand, you want to click the Se flere spørsmål og svar button and scrape the links that appear. Unfortunately, the way the page implements this is not really standard, so WCC is not ready for this.

Fortunately, some time ago, we added sitemap parsing to WCC. This means the Actor can load the website's sitemap.xml file and enqueue the URLs from it. This includes all the links hidden behind the JS buttons (approximately 430000 URLs in total). If you want to scrape all the questions and answers from the website, this is the best option. All you have to do is to set the useSitemaps input option to true. Note that this approach does not scrape the results in the order they appear on the page (uses approximately the sitemap order instead).

Did this answer your question? I'll close this issue now, but feel free to ask additional questions if you have any. Cheers!

CG

cgoul

2 months ago

Thank you, yes it answers the question.

Developer
Maintained by Apify

Actor Metrics

  • 3.9k monthly users

  • 718 stars

  • >99% runs succeeded

  • 2.2 days response time

  • Created in Mar 2023

  • Modified 18 hours ago