AI Website Content Markdown Scraper avatar
AI Website Content Markdown Scraper

Pricing

$30.00 / 1,000 results

Go to Store
AI Website Content Markdown Scraper

AI Website Content Markdown Scraper

Developed by

AI_Builder

AI_Builder

Maintained by Community

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

3.8 (3)

Pricing

$30.00 / 1,000 results

24

Total users

591

Monthly users

40

Runs succeeded

98%

Last modified

2 days ago

📄 Apify Actor: Markdown Website Crawler 🧠 Overview This Apify Actor crawls a website starting from a list of given URLs, performs a search using a selected search engine to find more relevant URLs within the same domain, scrapes and cleans the main content of the pages, and outputs the result in Markdown format.

It uses Selenium with a headless Chrome browser to accurately render JavaScript-heavy websites and extract readable content. Unwanted scripts, ads, headers, footers, and cookie banners are removed to ensure clean and focused output.

⚙️ Input Schema The Actor accepts the following input fields:

Field Type Description start_urls Array Array of objects with a url key. These are the starting points of the crawl. max_depth Integer Maximum crawl depth (how far it should follow links from the start page). max_urls Integer Maximum number of pages to scrape in total. search_engine String (Optional) Which search engine to use to find additional URLs. One of: Google, Bing, or DuckDuckGo. Default: Google

Example input json Copier Modifier { "start_urls": [ { "url": "https://apify.com" } ], "max_depth": 1, "max_urls": 10, "search_engine": "Google" } 📤 Output Format Each result pushed to the dataset contains:

Field Type Description url String The URL of the scraped page. title String The page's title (as seen in the browser tab). content String The cleaned Markdown version of the main page content.

🔍 Functionality

  1. Search Engine Discovery Uses Google, Bing, or DuckDuckGo to search for the domain.

Extracts links that belong to the same root domain.

Adds those links to the crawl queue.

  1. Crawling & Scraping Opens each valid page.

Strips unwanted elements: scripts, headers, footers, styles, iframes, videos, cookie banners.

Extracts main, article, section, and div content.

Converts the HTML to Markdown using markdownify.

  1. Cleaning Markdown Removes broken or irrelevant Markdown syntax.

Filters out image tags, inline SVGs, tracking text, and known cookie policy messages.

Trims and normalizes white space.

🛑 Limitations The scraper is designed to stay within the same root domain as the starting URL.

Heavy JavaScript pages may still fail if they block bots or detect automation.

Search engine interaction is subject to changes in their HTML structure and may break over time.

🧪 Development Notes Browser automation is powered by Selenium and ChromeDriver.

Designed for use in Apify's headless actor environment with Chromium.

Requests are tracked using Apify's RequestQueue with deduplication.

🧼 Cleanup The browser (driver.quit()) is gracefully closed at the end.

Requests are marked as handled after processing.

🚀 Usage This Actor is ideal for:

Archiving or monitoring content changes.

SEO content extraction.

Research on company websites or competitor analysis.

Share Actor: