Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoGoing through the documentation, I could not find anything related to the expected schema (with types) of the returned dataset, except for some examples. This makes me unsure as to what fields are required and what fields can be None? Also I am unsure about the types of some fields, e.g. is "loadedTime" expected to be a strict str or a datetime?
Hello and thank you for your interest in this Actor.
You are right - there is no format documentation for the output of this Actor. I guess we've never seen the need for one, as the Actor's main purpose was providing data for RAG use cases (when feeding data to an LLM, you usually only need the text
or markdown
field for the content and the url
field for the source). Most of the other fields were meant primarily for debugging. However, this is a great point - proper documentation is something this Actor needs and we've already started working on it. It's hard to give estimates because of the upcoming holiday season, but we'll let you know as soon as it's out.
Regarding your second question - the loadedTime
field will always be a string (in ISO 8601 to be precise, so something like 2023-12-20T13:08:06Z
). This is because there is no dedicated type for storing dates in JSON. Now, from the names of types in your question, I assume you are using Python? I'm not too sure about our Python client library, but I would be surprised if it converted those ISO 8601 date-time strings into the actual datetime
objects (so I'm assuming you'll be getting strings and have to parse them yourself). This is definitely the case in case you are downloading the dataset items (and parsing the JSON) without the Client.
I'll keep this issue open until we add the documentation - in the meantime, feel free to ask any additional question... [trimmed]
thanks a lot for the response. I have another question, since this actor is for RAG use-cases, are there any ideas about which chunking methods are best suited for it? Since it does not return the html tags anymore, would it be still be possible to chunk the scraped documents according to the html structure, e.g. for documentation sites?
Hello again (and sorry for the delay - we've all been on vacation during the holiday period).
You actually can tell the Actor to store the original (or preprocessed) HTML code in the output - simply check the Output settings > Save HTML
option, and you'll find your processed HTML in the html
field of the dataset. You can modify the preprocessing logic on the Input tab under HTML Processing
- add/remove some CSS selectors to be removed, change the HTML transformer (None
leaves the original content - minus the removed elements from the Remove HTML elements (CSS selector)
), etc. If this seems too complicated, don't worry - you usually don't have to modify this too much, as we've picked the default values very carefully :)
Aside from the html
output, you can also tell the Actor to store the Markdown output (see Output settings > Save Markdown
toggle). This seems kinda intuitive to me - it's a much more economical format (no tags or bloated syntax, so you're saving context size), but it still contains the important formatting (headings, paragraphs, lists, etc.). I'm not sure about the support for processing Markdown content for RAG, though - but I guess it shouldn't be too hard to implement a simple splitter on your own.
Either way, sorry for the wait again - both for this message and the dataset documentation. The documentation is still in the works, but I'll let you know as soon as it's out. Thank you!
Actor Metrics
3.8k monthly users
-
682 stars
>99% runs succeeded
2.6 days response time
Created in Mar 2023
Modified 9 days ago