Website Content to Markdown Scraper for LLM Training

Deprecated

Pricing

$19.99/month + usage

See alternative Actors

Go to Apify Store

Website Content to Markdown Scraper for LLM Training

Deprecated

See alternative Actors

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

Pricing

$19.99/month + usage

Rating

0.0

(0)

Developer

EasyApi

Maintained by Community

Actor stats

Bookmarked

109

Total users

Monthly active users

14 days ago

Last modified

📘 Website Content to Markdown Scraper for LLM Training

This powerful Apify Actor transforms web content into clean, readable Markdown format, perfect for training Large Language Models (LLMs). It's an essential tool for AI researchers, data scientists, and developers working on natural language processing tasks.

✨ Features

🌐 Scrape content from multiple web pages
📝 Convert HTML to clean Markdown format
🧠 Generate high-quality training data for LLMs
🔍 Intelligent main content extraction
🕸️ Customizable crawling depth
🏠 Option to stay within the same domain
🚀 Fast and efficient with concurrent scraping
🕵️‍♂️ Stealth mode to avoid detection

📥 Input

Configure your scraping job with these options:

startUrls: List of URLs to start scraping from
maxDepth: Maximum depth of links to follow (default: 1)
sameDomain: Whether to stay on the same domain while crawling (default: true)
maxResults: Maximum number of pages to scrape (default: 100)

📤 Output

For each scraped page, you'll get:

🔗 URL of the page
📌 Page title
📄 Main content in Markdown format, ideal for LLM training

💡 Use Cases

🤖 LLM Training: Prepare web content as high-quality training data for language models
📚 Content Aggregation: Collect articles and blog posts for research or curation
📊 Web Analysis: Extract text content for sentiment analysis or topic modeling
📑 Documentation: Convert web-based documentation into Markdown for easy integration
🔍 SEO Analysis: Extract and analyze content from competitor websites

🚀 Getting Started

Set your input parameters in the Apify console or via API
Run the Actor and watch as it transforms web content into Markdown
Access your results in JSON format, with Markdown content ready for LLM training or further processing

🆘 Support

If you encounter any issues or have questions, please reach out through Apify's support channels.

Transform web content into clean, LLM-ready Markdown with just a few clicks! 🚀📝🧠

Input Example

A full explanation of an input example in JSON.

{
    "maxDepth": 1,
    "maxResults": 10,
    "sameDomain": true,
    "startUrls": [
        "https://apify.com"
    ]
}

Output sample

The results will be wrapped into a dataset which you can always find in the Storage tab. Here's an excerpt from the data you'd get if you apply the input parameters above:

And here is the same data but in JSON. You can choose in which format to download your data: JSON, JSONL, Excel spreadsheet, HTML table, CSV, or XML.

[
	{
		"url": "https://apify.com",
		"title": "Apify: Full-stack web scraping and data extraction platform",
		"markdown": "powering the world's top data-driven teams\n\n#### \n\nSimplify scraping with\n\n![Crawlee](/img/icons/crawlee-mark.svg)Crawlee\n\nGive your crawlers an unfair advantage with Crawlee, our popular library for building reliable scrapers in Node.js.\n\n  \n\nimport\n\n{\n\n \n\nPuppeteerCrawler,\n\n \n\nDataset\n\n}\n\n \n\nfrom 'crawlee';\n\nconst crawler = new PuppeteerCrawler(\n\n{\n\n    \n\nasync requestHandler(\n\n{\n\n \n\nrequest, page,\n\n \n\nenqueueLinks\n\n}\n\n) \n\n{\n\nurl: request.url,\n\ntitle: await page.title(),\n\nawait enqueueLinks();\n\nawait crawler.run(\\['https://crawlee.dev'\\]);\n\n![Simplify scraping example](/img/homepage/develop_headstart.svg)\n\n#### Use your favorite libraries\n\nApify works great with both Python and JavaScript, with Playwright, Puppeteer, Selenium, Scrapy, or any other library.\n\n[Start with our code templates](/templates)\n\nfrom scrapy.spiders import CrawlSpider, Rule\n\nclass Scraper(CrawlSpider):\n\nname = \"scraper\"\n\nstart\\_urls = \\[\"https://the-coolest-store.com/\"\\]\n\ndef parse\\_item(self, response):\n\nitem = Item()\n\nitem\\[\"price\"\\] = response.css(\".price\\_color::text\").get()\n\nreturn item\n\n#### Turn your code into an Apify Actor\n\nActors are serverless microapps that are easy to develop, run, share, and integrate. The infra, proxies, and storages are ready to go.\n\n[Learn more about Actors](/actors)\n\nimport\n\n{ Actor\n\n}\n\n from 'apify'\n\nawait Actor.init();\n\n![Turn code into Actor example](/img/homepage/deploy_code.svg)\n\n#### Deploy to the cloud\n\nNo config required. Use a single CLI command or build directly from GitHub.\n\n[Deploy to Apify](https://console.apify.com/actors/new)\n\n\\> apify push\n\nInfo: Deploying Actor 'computer-scraper' to Apify.\n\nRun: Updated version 0.0 for scraper Actor.\n\nRun: Building Actor scraper\n\nACTOR: Pushing Docker image to repository.\n\nACTOR: Build finished.\n\nActor build detail -> https://console.apify.com/actors#/builds/0.0.2\n\nSuccess: Actor was deployed to Apify cloud and built there.\n\n![Deploy to cloud example](/img/homepage/deploy_cloud.svg)\n\n#### Run your Actors\n\nStart from Apify Console, CLI, via API, or schedule your Actor to start at any time. It’s your call.\n\n    POST/v2/acts/4cT0r1D/runs\n\nRun object\n\n    {\n        \"id\": \"seHnBnyCTfiEnXft\",\n        \"startedAt\": \"2022-12-01T13:42:00.364Z\",\n        \"finishedAt\": null,\n        \"status\": \"RUNNING\",\n        \"options\": {\n            \"build\": \"version-3\",\n            \"timeoutSecs\": 3600,\n            \"memoryMbytes\": 4096\n        },\n        \"defaultKeyValueStoreId\": \"EiGjhZkqseHnBnyC\",\n        \"defaultDatasetId\": \"vVh7jTthEiGjhZkq\",\n        \"defaultRequestQueueId\": \"TfiEnXftvVh7jTth\"\n    }\n\n![Run Actors example](/img/homepage/code_start.svg)\n\n#### Never get blocked\n\nUse our large pool of datacenter and residential proxies. Rely on smart IP address rotation with human-like browser fingerprints.\n\n[Learn more about Apify Proxy](/proxy)\n\nawait Actor.createProxyConfiguration(\n\n{\n\ncountryCode: 'US',\n\ngroups: \\['RESIDENTIAL'\\],\n\n![Never get blocked example](/img/homepage/code_blocked.svg)\n\n#### Store and share crawling results\n\nUse distributed queues of URLs to crawl. Store structured data or binary files. Export datasets in CSV, JSON, Excel or other formats.\n\n[Learn more about Apify Storage](/storage)\n\n    GET/v2/datasets/d4T453t1D/items\n\nDataset items\n\n    [\n        {\n            \"title\": \"myPhone 99 Super Max\",\n            \"description\": \"Such phone, max 99, wow!\",\n            \"price\": 999\n        },\n        {\n            \"title\": \"myPad Hyper Thin\",\n            \"description\": \"So thin it's 2D.\",\n            \"price\": 1499\n        }\n    ]\n\n![Store example](/img/homepage/code_store.svg)\n\n#### Monitor performance over time\n\nInspect all Actor runs, their logs, and runtime costs. Listen to events and get custom automated alerts.\n\n![Performance tooltip](/img/homepage/performance-tooltip.svg)\n\n#### Integrations. Everywhere.\n\nConnect to hundreds of apps right away using ready-made integrations, or set up your own with webhooks and our API.\n\n[See all integrations](/integrations)\n\n[\n\nCrawls arbitrary websites using the Chrome browser and extracts data from pages using JavaScript code. The Actor supports both recursive crawling and lists of URLs and automatically manages concurrency for maximum performance. This is Apify's basic tool for web crawling and scraping.\n\n](/apify/web-scraper)[\n\nExtract data from hundreds of Google Maps locations and businesses. Get Google Maps data including reviews, images, contact info, opening hours, location, popular times, prices & more. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.\n\n](/compass/crawler-google-places)[\n\nCrawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.\n\n](/apify/cheerio-scraper)[\n\nYouTube crawler and video scraper. Alternative YouTube API with no limits or quotas. Extract and download channel name, likes, number of views, and number of subscribers.\n\n](/streamers/youtube-scraper)[\n\nCrawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.\n\n](/apify/puppeteer-scraper)[\n\nScrape Booking with this hotels scraper and get data about accommodation on Booking.com. You can crawl by keywords or URLs for hotel prices, ratings, addresses, number of reviews, stars. You can also download all that room and hotel data from Booking.com with a few clicks: CSV, JSON, HTML, and Excel\n\n](/voyager/booking-scraper)[\n\nUse this Amazon scraper to collect data based on URL and country from the Amazon website. Extract product information without using the Amazon API, including reviews, prices, descriptions, and Amazon Standard Identification Numbers (ASINs). Download data in various structured formats.\n\n](/junglee/Amazon-crawler)[\n\nScrape tweets from any Twitter user profile. Top Twitter API alternative to scrape Twitter hashtags, threads, replies, followers, images, videos, statistics, and Twitter history. Export scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.\n\n](/quacker/twitter-scraper)\n\n[Browse 2,000+ Actors](/store)"
	},
	{
		"url": "https://apify.com/actors",
		"title": "Actors - fast and easy scraping in the cloud · Apify",
		"markdown": "Actors are serverless cloud programs that run on the Apify platform and do computing jobs. They are called Actors because, like human actors, they perform actions based on a script.\n\n![](https://cdn-cms.apify.com/actor_with_border_f3508ec394.svg)\n\n### Long-running serverless jobs[](#long-running-serverless-jobs)\n\nApify Actors can perform time-consuming jobs that are longer than the lifespan of a single HTTP transaction.\n\n![](https://cdn-cms.apify.com/Serverless_jobs_54794c9759.svg)\n\n### Publish your Actor[](#publish-your-actor)\n\nJoin hundreds of developers who share their Actors on Apify Store and earn money from coding.\n\n[Go to Apify Store](/store)\n\n![](https://cdn-cms.apify.com/Publish_your_Actor_8e239d4ed0.svg)\n\n### Auto-generated user interface[](#auto-generated-user-interface)\n\nActors can easily define a user interface for their input configuration. Take advantage of lower-level features and settings, or run Actors using our API.\n\n[Learn about Input Schema](https://docs.apify.com/academy/deploying-your-code/input-schema)\n\n![](https://cdn-cms.apify.com/Auto_generated_user_interface_7019512533.svg)\n\n![](https://cdn-cms.apify.com/GH_3bc8f59fdc.svg)\n\nHost code anywhere\n\nEdit your code on our platform, fetch from a Git repository, or push from your machine.\n\n![](https://cdn-cms.apify.com/Docker_support_cf77c5d57b.svg)\n\nDocker support\n\nActors run inside Docker containers on Apify servers. Use a custom Dockerfile.\n\n![](https://cdn-cms.apify.com/Ready_for_scale_925788ef57.svg)\n\nReady for scale\n\nRun as many Actors as you need. The Apify platform provisions the necessary resources.\n\n![](https://cdn-cms.apify.com/Custom_memory_and_CPU_178b540e7d.svg)\n\nCustom memory and CPU\n\nAssign each Actor any RAM volume needed. CPU share is allocated automatically.\n\n![](https://cdn-cms.apify.com/Command_line_tool_4d7d12cd5e.svg)\n\nCommand-line tool\n\nDevelop and test your Actors locally, push them to the Apify platform when you're ready.\n\n![](https://cdn-cms.apify.com/Logging_2034cc75a0.svg)\n\nLogging\n\nView and download logs to debug your code and monitor performance on production.\n\n![](https://cdn-cms.apify.com/Full_support_for_Scrapy_2_ceae73e8b1.png)\n\nActorize your Scrapy spiders[](#actorize-your-scrapy-spiders)\n-------------------------------------------------------------\n\nDeploy your Scrapy code to the cloud with just a few commands. Turn your Scrapy projects into Actors, run, schedule, monitor and monetize them.\n\n[Learn more](/run-scrapy-in-cloud)"
	},
    ...
]

📄 Article Content Extractor - Extract clean article content and metadata from any web page with structured output.
🔍 Keyword Density Checker - Analyze webpage content for keyword density and frequency with precise calculations.
🤖 AI-powered Search - Transform search queries into structured AI-powered summaries with references.
📚 arXiv Search Scraper - Extract comprehensive research paper data with detailed metadata.
🔬 Nature Search Results Scraper - Extract research article data with comprehensive metadata.
📚 Medium Posts Search Scraper - Extract detailed article data from Medium's search results.
📚 Substack Posts Scraper - Scrape Substack posts and articles with comprehensive content data.
🌐 URL Metadata Crawler - Extract comprehensive metadata from web pages including meta tags and Open Graph data.
📝 YouTube Description Extractor - Extract complete descriptions from YouTube videos automatically.
📚 WikiHow Article Scraper - Scrape WikiHow articles with detailed step-by-step content.
🔍 Google News Scraper - Collect up to 5000 news articles with flexible search options.
📚 PubMed Search Scraper - Scrape research papers and academic articles with comprehensive metadata.
📚 Goodreads Book Scraper - Extract comprehensive book data and content from Goodreads.
📚 Medium User Posts Scraper - Extract detailed post data from Medium user profiles.
📚 Substack Publications Scraper - Scrape detailed publication information from Substack.

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

EasyApi

207

5.0

(2)

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

205

3.8

(3)

Article Content Extractor 📄

easyapi/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Supports multiple URLs and returns well-formatted JSON with title, description, content, author, publish date and more. 🔍📄

EasyApi

5.0

(1)

ExpiredDomains.net Scraper 🔍

easyapi/expireddomains-net-scraper

Scrape expired domains data from ExpiredDomains.net. Extract detailed domain information including domain status, backlinks, creation date, and availability across multiple TLDs.

EasyApi

119

3.2

(4)

Udemy Course Scraper 📚

easyapi/udemy-course-scraper

Extract detailed course information from Udemy.com with this powerful scraper. Collect comprehensive data about online courses, including ratings, content details, instructors, and pricing. Perfect for market research, content aggregation, and educational platform development.

EasyApi

5.0

(1)

AI Content Topic Generator 🎯

easyapi/ai-content-topic-generator

🚀 Generate trending content ideas and topics based on keywords! Get AI-powered suggestions with SEO benefits analysis and relevance explanations. Perfect for content creators, marketers, and SEO specialists looking to boost engagement and search rankings. ✨

EasyApi

AI Content Detector 🔍

easyapi/ai-content-detector

🤖 Analyze text content to determine if it's AI-generated with high accuracy. Get detailed probability analysis and authoritative conclusions about content authenticity. Perfect for content verification, academic integrity, and digital publishing quality control.

EasyApi

5.0

(1)

AI Text Summarizer 📝

easyapi/ai-text-summarizer

🤖 Transform long texts into concise, meaningful summaries with AI! Support multiple languages, customizable summary lengths, and different summary styles. Perfect for content creators, researchers, and professionals who need quick, accurate text summarization.

EasyApi

🎯 Google Play Keywords Discovery Tool

easyapi/google-play-keywords-discovery-tool

🔍 Discover untapped keywords and search suggestions from Google Play search engines in real-time. Get comprehensive insights into search trends, user intent, and long-tail opportunities to supercharge your keyword research.

EasyApi

5.0

(1)

Backlink Opportunity Finder

easyapi/backlink-opportunity-finder

🔍 Discover high-quality backlink opportunities to boost your domain authority and search rankings. Extract valuable data about potential websites for building authoritative backlinks, including domain metrics, relevance analysis, and estimated SEO impact.

EasyApi

103

4.2

(2)

Pricing Page Analyzer 💰

easyapi/pricing-page-analyzer

🔍 Analyze any pricing page and get actionable insights to optimize conversion rates. Get detailed recommendations on pricing structure, visual hierarchy, feature presentation, and user experience to enhance your pricing strategy.

EasyApi

5.0

(1)