Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

Developed by

Apify

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.5 (39)

Pricing

Pay per usage

1327

Total users

49.9k

Monthly users

7k

Runs succeeded

>99%

Issue response

5 days

Last modified

18 hours ago

ML

Text content missing from crawling results

Closed

motivated_leaflet opened this issue
a year ago

I noticed that a lot of valid text can be missing from crawled page results.

Example url: https://grata.com/company/about-us

Crawling result: Despite making up nearly half of the US economy, private companies are notoriously difficult to find and engage even for the most innovative dealmakers. In the past, finding the right small and mid-market companies for a deal has required a massive effort, and a bit of luck. This hurts target companies too, who often miss out on game changing opportunities as they are practically invisible. So, what if there was a different, better way to do business? A solution that gave innovative dealmakers an edge? A tool to help them find the right company - at the right time? That's why we created Grata - to make that difference. Our deal sourcing platform streamlines the process of finding the best companies to target: making it easier and faster to gain visibility into the entire market, gather relevant intelligence on target companies, build bespoke lists, and find similar companies to uncover more of the best opportunities in less time.

Expected results: The page contains a lot more text.

Is there a CSS selector or crawler setting that need to modified to capture the additional text? Thanks!

jindrich.bar avatar

Hello - and thank you for your interest in this Actor!

The HTML post-processing setup causes this - sometimes, the HTML cleaning removes too much content (not only the navbars and footers but also some of the actual content of the webpage).

You can fix this by setting the HTML Processing > HTML Transformer to None - this way, the extracted content will only get stripped of the specified CSS selectors, but won't get cleaned with any other tools.

Check out my example run with the fixed input (feel free to copy the input from there verbatim).

I'll close this issue now, but feel free to ask additional questions if you have any. Thanks again!