
Website Content Crawler
Pricing
Pay per usage

Website Content Crawler
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
4.5 (39)
Pricing
Pay per usage
1327
Total users
49.9k
Monthly users
7k
Runs succeeded
>99%
Issue response
5 days
Last modified
18 hours ago
Text content missing from crawling results
Closed
I noticed that a lot of valid text can be missing from crawled page results.
Example url: https://grata.com/company/about-us
Crawling result: Despite making up nearly half of the US economy, private companies are notoriously difficult to find and engage even for the most innovative dealmakers. In the past, finding the right small and mid-market companies for a deal has required a massive effort, and a bit of luck. This hurts target companies too, who often miss out on game changing opportunities as they are practically invisible. So, what if there was a different, better way to do business? A solution that gave innovative dealmakers an edge? A tool to help them find the right company - at the right time? That's why we created Grata - to make that difference. Our deal sourcing platform streamlines the process of finding the best companies to target: making it easier and faster to gain visibility into the entire market, gather relevant intelligence on target companies, build bespoke lists, and find similar companies to uncover more of the best opportunities in less time.
Expected results: The page contains a lot more text.
Is there a CSS selector or crawler setting that need to modified to capture the additional text? Thanks!
Hello - and thank you for your interest in this Actor!
The HTML post-processing setup causes this - sometimes, the HTML cleaning removes too much content (not only the navbars and footers but also some of the actual content of the webpage).
You can fix this by setting the HTML Processing > HTML Transformer
to None
- this way, the extracted content will only get stripped of the specified CSS selectors, but won't get cleaned with any other tools.
Check out my example run with the fixed input (feel free to copy the input from there verbatim).
I'll close this issue now, but feel free to ask additional questions if you have any. Thanks again!