Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

Go to Store
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
PC

Documentation for Dataset Schema

Open

precocious_clouds opened this issue
a year ago

Going through the documentation, I could not find anything related to the expected schema (with types) of the returned dataset, except for some examples. This makes me unsure as to what fields are required and what fields can be None? Also I am unsure about the types of some fields, e.g. is "loadedTime" expected to be a strict str or a datetime?

jindrich.bar avatar

Hello and thank you for your interest in this Actor.

You are right - there is no format documentation for the output of this Actor. I guess we've never seen the need for one, as the Actor's main purpose was providing data for RAG use cases (when feeding data to an LLM, you usually only need the text or markdown field for the content and the url field for the source). Most of the other fields were meant primarily for debugging. However, this is a great point - proper documentation is something this Actor needs and we've already started working on it. It's hard to give estimates because of the upcoming holiday season, but we'll let you know as soon as it's out.

Regarding your second question - the loadedTime field will always be a string (in ISO 8601 to be precise, so something like 2023-12-20T13:08:06Z). This is because there is no dedicated type for storing dates in JSON. Now, from the names of types in your question, I assume you are using Python? I'm not too sure about our Python client library, but I would be surprised if it converted those ISO 8601 date-time strings into the actual datetime objects (so I'm assuming you'll be getting strings and have to parse them yourself). This is definitely the case in case you are downloading the dataset items (and parsing the JSON) without the Client.

I'll keep this issue open until we add the documentation - in the meantime, feel free to ask any additional question... [trimmed]

PC

precocious_clouds

a year ago

thanks a lot for the response. I have another question, since this actor is for RAG use-cases, are there any ideas about which chunking methods are best suited for it? Since it does not return the html tags anymore, would it be still be possible to chunk the scraped documents according to the html structure, e.g. for documentation sites?

jindrich.bar avatar

Hello again (and sorry for the delay - we've all been on vacation during the holiday period).

You actually can tell the Actor to store the original (or preprocessed) HTML code in the output - simply check the Output settings > Save HTML option, and you'll find your processed HTML in the html field of the dataset. You can modify the preprocessing logic on the Input tab under HTML Processing - add/remove some CSS selectors to be removed, change the HTML transformer (None leaves the original content - minus the removed elements from the Remove HTML elements (CSS selector)), etc. If this seems too complicated, don't worry - you usually don't have to modify this too much, as we've picked the default values very carefully :)

Aside from the html output, you can also tell the Actor to store the Markdown output (see Output settings > Save Markdown toggle). This seems kinda intuitive to me - it's a much more economical format (no tags or bloated syntax, so you're saving context size), but it still contains the important formatting (headings, paragraphs, lists, etc.). I'm not sure about the support for processing Markdown content for RAG, though - but I guess it shouldn't be too hard to implement a simple splitter on your own.

Either way, sorry for the wait again - both for this message and the dataset documentation. The documentation is still in the works, but I'll let you know as soon as it's out. Thank you!

Developer
Maintained by Apify

Actor Metrics

  • 3.9k monthly users

  • 711 stars

  • >99% runs succeeded

  • 2.2 days response time

  • Created in Mar 2023

  • Modified 2 hours ago