Dynamic Markdown Scraper avatar

Dynamic Markdown Scraper

Try for free

2 hours trial then $19.00/month - No credit card required now

Go to Store
Dynamic Markdown Scraper

Dynamic Markdown Scraper

louisdeconinck/dynamic-markdown-scraper
Try for free

2 hours trial then $19.00/month - No credit card required now

Effortlessly feed LLM AIs with clean Markdown using our advanced web scraper. Seamlessly scrape dynamic, JavaScript-rendered websites while preserving original formatting. Ideal for AI training, documentation, and content migration.

A powerful web scraper that converts difficult to scrape web pages into clean, well-formatted Markdown content. This scraper crawls websites and automatically transforms their HTML content into Markdown format while maintaining the original structure and formatting. It handles dynamic content and JavaScript-rendered pages with ease.

Features

  • Crawls websites and converts content to Markdown format
  • Maintains proper heading structure, lists, and code blocks
  • Handles dynamic content and JavaScript-rendered pages
  • Handles images and links correctly
  • Respects same-domain crawling
  • Filters out unwanted content (navigation, footers, etc.)
  • Configurable maximum crawl limits
  • Smart content extraction focusing on main article content
  • Built with TypeScript for better maintainability

Use Cases

  • Feed website content to LLM AI for further processing
  • Extract content from websites for documentation, blog posts, or technical writing
  • Scrape and convert web pages for use in static sites, blogs, or other projects
  • Automate content migration from legacy systems to modern platforms

Input Configuration

The scraper accepts the following input parameters:

  • startUrls: Array of URLs where the crawler should begin (required)
  • maxRequestsPerCrawl: Maximum number of pages to crawl (optional, defaults to unlimited)

Example input:

1{
2    "startUrls": [
3        { "url": "https://apify.com" }
4    ],
5    "maxRequestsPerCrawl": 100
6}

Output Format

The scraper saves the following data for each processed page:

  • url: The URL of the scraped page
  • title: Page title
  • markdown: Converted Markdown content
  • capturedAt: Timestamp of when the page was scraped

Example output:

1{
2	"url": "https://apify.com/storage",
3	"title": "Storage optimized for scraping · Apify",
4	"markdown": "# Apify Storage\n\nScalable and reliable cloud data storage designed for web scraping and automation workloads.\n\n[View documentation](https://docs.apify.com/platform/storage)\n\nBenefits\n\n## Specialized storage from Apify[](https://apify.com/storage#specialized-storage-from-apify)\n\n![Enterprise_grade_reliability_performance_and_scalability_9890860f85.svg](https://cdn-cms.apify.com/Enterprise_grade_reliability_performance_and_scalability_9890860f85.svg)\n\n### Enterprise-grade reliability, performance, and scalability[](https://apify.com/storage#enterprise-grade-reliability-performance-and-scalability)\n\nStore a few records or a few hundred million, with the same low latency and high reliability. We use Amazon Web Services for the underlying data storage, giving you high availability and peace of mind.\n\n### Low-cost storage for web scraping and crawling[](https://apify.com/storage#low-cost-storage-for-web-scraping-and-crawling)\n\nApify provides low-cost storage carefully designed for the large workloads typical of web scraping and crawling operations.\n\n![Low_cost_storage_for_web_scraping_and_crawling_b313f7d95e.svg](https://cdn-cms.apify.com/Low_cost_storage_for_web_scraping_and_crawling_b313f7d95e.svg)\n\n![Easy_to_use_634e40ae76.svg](https://cdn-cms.apify.com/Easy_to_use_634e40ae76.svg)\n\n### Easy to use[](https://apify.com/storage#easy-to-use)\n\nData can be viewed on the web, giving you a quick way to review and share it with other people. The Apify [API](https://docs.apify.com/api/v2) and [SDK](https://docs.apify.com/sdk/js/) makes it easy to integrate our storage into your apps.\n\nFeatures\n\n## We’ve got you covered[](https://apify.com/storage#weve-got-you-covered)\n\n[![Dataset_78dfe4e3a4.svg](https://cdn-cms.apify.com/Dataset_78dfe4e3a4.svg)\n\n**Dataset**  \nStore results from your web scraping, crawling or data processing jobs into Apify datasets and export them to various formats like JSON, CSV, XML, RSS, Excel or HTML.\n\n\n\n\n\n](https://docs.apify.com/platform/storage/dataset)[![Request_queue_9e9602319e.svg](https://cdn-cms.apify.com/Request_queue_9e9602319e.svg)\n\n**Request queue**  \nMaintain a queue of URLs of web pages in order to recursively crawl websites, starting from initial URLs and adding new links as they are found while skipping duplicates.\n\n\n\n\n\n](https://docs.apify.com/platform/storage/request-queue)[![Key_value_store_bc65220b7d.svg](https://cdn-cms.apify.com/Key_value_store_bc65220b7d.svg)\n\n**Key-value store**  \nStore arbitrary data records along with their MIME content type. The records are accessible under a unique name and can be written and read at a rapid rate.\n\n\n\n\n\n](https://docs.apify.com/platform/storage/key-value-store)\n\n## Ready to build your first Actor?[](https://apify.com/storage#ready-to-build-your-first-actor)\n\n[Start developing](https://apify.com/templates)",
5	"capturedAt": "2025-01-23T14:01:21.956Z"
6}
Developer
Maintained by Community

Actor Metrics

  • 1 monthly user

  • 1 star

  • Created in Jan 2025

  • Modified 11 hours ago