Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

Developed by

Apify

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

3.7 (41)

Pricing

Pay per usage

1526

Total users

59K

Monthly users

7.8K

Runs succeeded

>99%

Issues response

7.6 days

Last modified

3 days ago

MA

Include HTML Element (instead of exclude)

Closed

matthias.amberg opened this issue
a year ago

A lot of CMS Webpages have a marker that defines the actual content of the page (excluding navigation and stuff like that). For instance as a

actual content or actual content

A feature where you can explicitly include certain HTML Query paths instead of excluding would be nice in the HTML processing settings.

jindrich.bar avatar

Hello @matthias.amberg and thank you for your interest in this Actor!

I see why you might want something like this, but at the same time, it feels our current extraction setup is robust enough (so there should be no need for this). In other words - our HTML extractors should do this step automatically.

Either way, I'm very curious about this - we'll be happy to hear about your use case for this! Do you have an example of a page where this feature would help you? Thanks!

jiri.spilka avatar

Hi, this issue has been inactive for a long time. The Actor includes configuration parameters that allow users to specify keepElementsCssSelector (a selector to choose which elements to keep) and removeElementsCssSelector (a selector to specify which elements to remove). You can find more details in the input schema description.

I’ll close this issue for now, but feel free to ask any questions or raise a new issue.