Website Content Crawler avatar
Website Content Crawler
Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
IA

Crawler misses text content

Closed

innovum_admin opened this issue
3 months ago

For the page https://hbk.ch/board/ it misses the part with:

""" Chairman: Marwan Naja Vice-chairman: Frédéric Berney Managing Director: Marc Nadas """

It has happened to other pages as well. In the logs there are no errors, it just silently misses content. Is there a way to reduce the chances of it happening? Why does it happen?

jindrich.bar avatar

Hello and thank you for your interest in this Actor!

The (default) readableText text extractor is sometimes too eager to clean the web page content.

If you want to always get the original content of the page, you can simply switch HTML Processing > HTML Transformer to None. If you end up with some extra content in your dataset (that you don't want to have there), you can use the Remove HTML elements (CSS selector) to remove those based on their CSS selectors.

Ultimately, the choice is yours and depends on your use case - if you're looking for the cleanest possible data, go with readableText, if you need your data to be complete, pick None. Check my example run on your webpage with None text extractor - it seems to retain all the data in the result.

I'll close this issue now, but feel free to ask additional questions if you have any. Cheers!

Developer
Maintained by Apify
Actor metrics
  • 2.8k monthly users
  • 434 stars
  • 99.9% runs succeeded
  • 2.9 days response time
  • Created in Mar 2023
  • Modified 3 days ago