Pricing

$30.00 / 1,000 results

Try for free

Go to Store

AI Website Content Markdown Scraper

Try for free

Developed by

AI_Builder

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

4.6 (3)

Pricing

$30.00 / 1,000 results

Total users

624

Monthly users

Runs succeeded

>99%

Last modified

a month ago

Automation

📄 Apify Actor: Markdown Website Crawler 🧠 Overview This Apify Actor crawls a website starting from a list of given URLs, performs a search using a selected search engine to find more relevant URLs within the same domain, scrapes and cleans the main content of the pages, and outputs the result in Markdown format.

It uses Selenium with a headless Chrome browser to accurately render JavaScript-heavy websites and extract readable content. Unwanted scripts, ads, headers, footers, and cookie banners are removed to ensure clean and focused output.

⚙️ Input Schema The Actor accepts the following input fields:

Field Type Description start_urls Array Array of objects with a url key. These are the starting points of the crawl. max_depth Integer Maximum crawl depth (how far it should follow links from the start page). max_urls Integer Maximum number of pages to scrape in total. search_engine String (Optional) Which search engine to use to find additional URLs. One of: Google, Bing, or DuckDuckGo. Default: Google

Example input json Copier Modifier { "start_urls": [ { "url": "https://apify.com" } ], "max_depth": 1, "max_urls": 10, "search_engine": "Google" } 📤 Output Format Each result pushed to the dataset contains:

Field Type Description url String The URL of the scraped page. title String The page's title (as seen in the browser tab). content String The cleaned Markdown version of the main page content.

🔍 Functionality

Search Engine Discovery Uses Google, Bing, or DuckDuckGo to search for the domain.

Extracts links that belong to the same root domain.

Adds those links to the crawl queue.

Crawling & Scraping Opens each valid page.

Strips unwanted elements: scripts, headers, footers, styles, iframes, videos, cookie banners.

Extracts main, article, section, and div content.

Converts the HTML to Markdown using markdownify.

Cleaning Markdown Removes broken or irrelevant Markdown syntax.

Filters out image tags, inline SVGs, tracking text, and known cookie policy messages.

Trims and normalizes white space.

🛑 Limitations The scraper is designed to stay within the same root domain as the starting URL.

Heavy JavaScript pages may still fail if they block bots or detect automation.

Search engine interaction is subject to changes in their HTML structure and may break over time.

🧪 Development Notes Browser automation is powered by Selenium and ChromeDriver.

Designed for use in Apify's headless actor environment with Chromium.

Requests are tracked using Apify's RequestQueue with deduplication.

🧼 Cleanup The browser (driver.quit()) is gracefully closed at the end.

Requests are marked as handled after processing.

🚀 Usage This Actor is ideal for:

Archiving or monitoring content changes.

SEO content extraction.

Research on company websites or competitor analysis.

Share Actor:

🔥fireScraper AI Prompt Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-prompt-Website-Content-Markdown-Scraper

fireScrape AI is an advanced web scraper built with Crawlee and Puppeteer. It crawls websites, extracts meaningful content, converts it into Markdown, then runs your custom prompt on the extracted text—ideal for generating enriched datasets, summaries or analyses for LLMs and AI pipelines

mohamed el hadi msaid

5.0

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

3.8

Website to MarkDown (AI-Ready)

mintii/website-to-markdown-ai-ready

Use this to scrape webpages and use for AI Tools and LLMs.

Martin from Mintii

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

5.0

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

1.4K

4.6

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

460

4.4

Ai Ready Web Page To Markdown Converter

mustafa.irshaid.113/ai-ready-web-page-to-markdown-converter

Convert any webpage into structured Markdown and HTML using just a URL. Get the page title, link, and content—perfect for SEO, devs, and AI crawlers. Fast, clean, and ideal for repurposing or analysis. Start turning websites into Markdown instantly.

Mustafa Irshaid

Website extract

mrahil/my-actor

It is website extractor

Mohammed Rahil

Dynamic Markdown Scraper

louisdeconinck/dynamic-markdown-scraper

Effortlessly feed LLM AIs with clean Markdown using our advanced web scraper. Seamlessly scrape dynamic, JavaScript-rendered websites while preserving original formatting. Ideal for AI training, documentation, and content migration.

Louis Deconinck

5.0

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.