Website Content Crawler avatar
Website Content Crawler
Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
CR

Content incorrectly removed from page

Closed

civic-roundtable opened this issue
2 months ago

When I scrape https://elections.ri.gov/elections/risk-limiting-audit-center the bulk of the text is incorrectly removed.

I have tried the following settings:

  • "aggressivePrune": false
  • "removeCookieWarnings": false
  • "removeElementsCssSelector": "dummy_keep_everything"

But the majority of the page content ends up in removedElementsHtmlUrl and not in the final output HTML, Markdown, or Text.

If I set "htmlTransformer": "none", then the page content appears. But extractus, readableText, and readableTextIfPossible all incorrectly strip the content. I'd like to keep the htmlTransformer in place for all of the other pages being scraped (as well as removing text like "Skip to main content").

Any idea what's going on here, or how I can debug the site to understand where the transformer is going wrong?

jindrich.bar avatar

Hello, and thank you for your interest in this Actor!

You already correctly figured out that sometimes, the transformers remove too much content. However, the transformers function as black boxes and can't be easily debugged.

To remove the extra fluff while using htmlTransformer: none, the you can try the following:

  • Use the removeElementsCssSelector to manually remove additional unwanted content when using the none transformer. You can start by reverting this option to the default value.
  • Ensure aggressivePrune is set to true - this feature removes repeating content from the scraped data (even across pages).

If you truly need the htmlTransformer for other pages, the easiest way out might be handling this page separately, in its own Run.

I'll close this issue now, but feel free to ask additional questions if you have any. Cheers!

Developer
Maintained by Apify
Actor metrics
  • 2.8k monthly users
  • 434 stars
  • 99.9% runs succeeded
  • 2.9 days response time
  • Created in Mar 2023
  • Modified 3 days ago