Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

Go to Store
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
set the depth to 0

Opened 6 days ago by environmental_mammal, last comment 6 days ago by Jiří Spilka (jiri.spilka)

Few requests

Opened 16 days ago by nimble_caretaker, last comment 10 days ago by nimble_caretaker

TypeError: Cannot read properties of undefined (reading 'content-type')

Opened 23 days ago by sevcik, last comment 11 days ago by Jiří Spilka (jiri.spilka)

Poor CPU utilization due to low usage limit

Opened 3 months ago by write2souvik, last comment 3 months ago by write2souvik

Crawling takes longer when calling API vs on site

Opened 4 months ago by adi-kamaraj, last comment 4 months ago by Jan Buchar (janbuchar)

Unable to crawl https://openai.com/index/extracting-concepts-from-gpt-4/

Opened 4 months ago by imda_peckyoke, last comment 4 months ago by Jindřich Bär (jindrich.bar)

My Runs do not end

Opened 4 months ago by matthias.amberg, last comment 4 months ago by matthias.amberg

Parsing website with CloudFlare protection

Opened 4 months ago by sash2s, last comment 4 months ago by sash2s

Unable to crawl the whole website

Opened 4 months ago by simpleworks, last comment 4 months ago by Jan Buchar (janbuchar)

Automating Web Content Crawling for Real-Time Updates

Opened 4 months ago by glovebubble, last comment 4 months ago by Jan Buchar (janbuchar)

Getting duplicate URLs in web crawling

Opened 4 months ago by simpleworks, last comment 4 months ago by Jan Buchar (janbuchar)

Memory limit control

Opened 5 months ago by vitthalrao.lavate, last comment 4 months ago by intriguing_game

Treat hash URLs as separate pages to crawl

Opened 5 months ago by civic-roundtable, last comment 5 months ago by civic-roundtable

Crawling claims to succeed, but crawls nothing and returns no results

Opened 5 months ago by chrislrobert, last comment 5 months ago by chrislrobert

Chrome+Playwright crawler is deprecated and not working anymore

Opened 5 months ago by sestek, last comment 5 months ago by Jindřich Bär (jindrich.bar)

Actor run timed out

Opened 6 months ago by david_conveyor, last comment 6 months ago by Ivan Vasilev (ivanvia)

Adaptive crawler is failing to crawl the start URL

Opened 7 months ago by embrace_ai, last comment 7 months ago by embrace_ai

bug: iframe contents don't get extracted properly

Opened 7 months ago by bllndman, last comment 6 months ago by bllndman

Can we limit the number of pages inside a child?

Opened 7 months ago by sai_sampath, last comment 7 months ago by Jindřich Bär (jindrich.bar)

Failed Crawling for G2 web pages

Opened 8 months ago by motivated_leaflet, last comment 8 months ago by motivated_leaflet

Developer
Maintained by Apify

Actor Metrics

  • 3.9k monthly users

  • 718 stars

  • >99% runs succeeded

  • 2.2 days response time

  • Created in Mar 2023

  • Modified 18 hours ago