Pricing

Pay per usage

Go to Store

Website Content Crawler

Try for free

Developed by

Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

3.9 (41)

Pricing

Pay per usage

1537

Total users

59K

Monthly users

7.9K

Runs succeeded

>99%

Issues response

7.8 days

Last modified

2 days ago

Developer tools

Back to issues Create new issue

Documentation for Dataset Schema

Closed

precocious_clouds opened this issue

Going through the documentation, I could not find anything related to the expected schema (with types) of the returned dataset, except for some examples. This makes me unsure as to what fields are required and what fields can be None? Also I am unsure about the types of some fields, e.g. is "loadedTime" expected to be a strict str or a datetime?

Jindřich Bär (jindrich.bar)

Hello and thank you for your interest in this Actor.

You are right - there is no format documentation for the output of this Actor. I guess we've never seen the need for one, as the Actor's main purpose was providing data for RAG use cases (when feeding data to an LLM, you usually only need the text or markdown field for the content and the url field for the source). Most of the other fields were meant primarily for debugging. However, this is a great point - proper documentation is something this Actor needs and we've already started working on it. It's hard to give estimates because of the upcoming holiday season, but we'll let you know as soon as it's out.

Regarding your second question - the loadedTime field will always be a string (in ISO 8601 to be precise, so something like 2023-12-20T13:08:06Z). This is because there is no dedicated type for storing dates in JSON. Now, from the names of types in your question, I assume you are using Python? I'm not too sure about our Python client library, but I would be surprised if it converted those ISO 8601 date-time strings into the actual datetime objects (so I'm assuming you'll be getting strings and have to parse them yourself). This is definitely the case in case you are downloading the dataset items (and parsing the JSON) without the Client.

I'll keep this issue open until we add the documentation - in the meantime, feel free to ask any additional question... [trimmed]

precocious_clouds

thanks a lot for the response. I have another question, since this actor is for RAG use-cases, are there any ideas about which chunking methods are best suited for it? Since it does not return the html tags anymore, would it be still be possible to chunk the scraped documents according to the html structure, e.g. for documentation sites?

Jindřich Bär (jindrich.bar)

Hello again (and sorry for the delay - we've all been on vacation during the holiday period).

You actually can tell the Actor to store the original (or preprocessed) HTML code in the output - simply check the Output settings > Save HTML option, and you'll find your processed HTML in the html field of the dataset. You can modify the preprocessing logic on the Input tab under HTML Processing - add/remove some CSS selectors to be removed, change the HTML transformer (None leaves the original content - minus the removed elements from the Remove HTML elements (CSS selector)), etc. If this seems too complicated, don't worry - you usually don't have to modify this too much, as we've picked the default values very carefully :)

Aside from the html output, you can also tell the Actor to store the Markdown output (see Output settings > Save Markdown toggle). This seems kinda intuitive to me - it's a much more economical format (no tags or bloated syntax, so you're saving context size), but it still contains the important formatting (headings, paragraphs, lists, etc.). I'm not sure about the support for processing Markdown content for RAG, though - but I guess it shouldn't be too hard to implement a simple splitter on your own.

Either way, sorry for the wait again - both for this message and the dataset documentation. The documentation is still in the works, but I'll let you know as soon as it's out. Thank you!

Jiří Spilka (jiri.spilka)

Hi, this issue has been inactive for a long time, and a solution was already provided by @jindrich.bar. Additionally, the Actor's README has been updated with an output sample and a very brief explanation (I agree it could be improved).

Regarding chunking, there isn’t a one-size-fits-all solution. The chunking strategy depends on the specific use case. For example, a simple recursive text splitter strategy is used in our vector database integrations, such as the Pinecone Integration.

I’ll go ahead and close this issue for now.

Add comment

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

1.2K

4.4

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

5.0

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

5.0

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

591

3.8

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

430

4.1

AI-Powered Web Content & Link Extractor

scrapercoder/ai-powered-web-content-link-extractor

Crawls websites to extract clean, structured content for AI/LLM use, ideal for training datasets, knowledge bases, and RAG systems. Json output includes: * text: Normalized page content * links: Extracted sub-URLs

wallnut.ai

Web Scraper

apify/web-scraper

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

88K

4.5

HTML/Website Media Scraper

aweworkz/html-web-media-scraper

The Website Media scraper extracts all media files, i.e images, videos, audio, and other related media elements, from multiple websites. It then provides the corresponding descriptions or the alt="" content. You'll need to use proxies to run this actor for some websites with bot blocking features.

aweworkz

165

AI Web Scraper - Powered by Crawl4AI

raizen/ai-web-scraper

A blazing-fast AI web scraper powered by Crawl4AI. Perfect for LLMs, AI agents, AI automation, model training, sentiment analysis, and content generation. Supports deep crawling, multiple extraction strategies and flexible output (Markdown/JSON). Seamlessly integrates with Make.com, n8n, and Zapier.

Raizen Technology

144

1.0

Web Scraping API

zeeb0t/web-scraping-api---scrape-any-website

Web Scraping API that quickly and reliably scrapes any website—no selectors required. Premium proxies, CAPTCHA solving, JavaScript rendering, and automated structured data extraction are all included. It’s just $2 per 1,000 web pages scraped, with no minimum spend.