Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

Go to Store
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
Why does this job take so long?

Opened 3 days ago by grossjo, last comment 3 days ago by Dušan Vystrčil (dusan.vystrcil)

Strictly Necessary cookies are absolutely necessary for core functions such as navigating the page or accessing secure areas.

Opened 3 days ago by lei.niu.bigco, last comment 3 days ago by Dušan Vystrčil (dusan.vystrcil)

Can't get globs to work

Opened 4 days ago by matt333, last comment 4 days ago by matt333

Scrapping only few elements on the page and save them in the separate fields

Opened 7 days ago by igocza, last comment 6 days ago by igocza

not the whole content is scraped

Opened 7 days ago by igocza, last comment 6 days ago by Jiří Spilka (jiri.spilka)

can't crawl the whole website

Opened 9 days ago by visable, last comment 5 days ago by Jiří Spilka (jiri.spilka)

No urls to crawl, runs forever

Opened 11 days ago by betterbrain-dev, last comment 5 days ago by Jiří Spilka (jiri.spilka)

The subpages are not crawled

Opened 11 days ago by digitrans, last comment 11 days ago by Jiří Spilka (jiri.spilka)

File extension prevent crawler from accessing pages

Opened 12 days ago by russet_backpack, last comment 12 days ago by Jiří Spilka (jiri.spilka)

Actor timeout

Opened 12 days ago by MavenAGI, last comment 5 days ago by Jiří Spilka (jiri.spilka)

Diacritics error

Opened 15 days ago by kata.hrus, last comment 15 days ago by Jiří Spilka (jiri.spilka)

Testing the actor but get this result

Opened 15 days ago by davidsen.anders, last comment 14 days ago by Jiří Spilka (jiri.spilka)

doesnt work

Opened 16 days ago by topaz_frog, last comment 16 days ago by Jiří Spilka (jiri.spilka)

This run seems to be having a big problem

Opened 20 days ago by pure_chipmunk, last comment 19 days ago by Jiří Spilka (jiri.spilka)

Can I limit the max number of pages to crawl for each start URL?

Opened 21 days ago by pai911, last comment 19 days ago by Jiří Spilka (jiri.spilka)

Actor is not parsing the rest of the pages

Opened 21 days ago by Custombizio, last comment 14 days ago by Jiří Spilka (jiri.spilka)

Keep the urls in text or markdown?

Opened 22 days ago by cgoul, last comment 20 days ago by cgoul

How do I setup pagination with a URL

Opened 22 days ago by sacdrexelmba, last comment 18 days ago by optimal_valuation

Crawling only first page

Opened 23 days ago by Atish, last comment 23 days ago by Dušan Vystrčil (dusan.vystrcil)

Combining startUrl and includeUrlGlob

Opened a month ago by nauticallygreat, last comment 25 days ago by Jiří Spilka (jiri.spilka)

Developer
Maintained by Apify

Actor Metrics

  • 3.9k monthly users

  • 718 stars

  • >99% runs succeeded

  • 2.2 days response time

  • Created in Mar 2023

  • Modified 13 hours ago