This actor is designed to aggregate .md and .mdx files containing Markdown documentation from specified GitHub repositories. It navigates through the repository's file structure and downloads the files, which are useful for training or finetuning models.

Features

Downloads .md and .mdx files from GitHub repositories.
Utilizes KeyValueStore to maintain coherence across concurrent executions.
Ensures documentation coherence by avoiding downloads from commits and other branches.

Usage

Set the startUrl to the home directory of the docs folder in the GitHub repository and run the actor.

Input Parameters

startUrl: The starting URL of the GitHub repository's documentation directory.
globPattern: Glob pattern to match files within the repository. Defaults to '**/*.{md,mdx}'.
maxConcurrency: The maximum number of requests processed concurrently. Default is 1000.
maxRequestsPerMinute: The maximum number of requests made per minute. Default is 600.
minConcurrency: The minimum number of concurrent requests during execution. Default is 5.
desiredConcurrency: The initially desired number of concurrent requests. Default is 15.

Output

The actor outputs each Markdown file's content into the default dataset. Each entry contains the file name and content.

Example Input

{
  "startUrl": "https://github.com/apify/apify-docs/tree/master",
  "globPattern": "**/*.mdx",
  "crawlerOptions": {
    "maxConcurrency": 10
  }
}

Support

For support, contact info@fornace.it.

On this page

GitHub Markdown Documentation Downloader
- - Features
  - Usage
  - Input Parameters
  - Output
  - Example Input
  - Support

Share Actor:

Github List Scraper

janbuchar/github-list-scraper

This Actor scrapes repositories from GitHub **Awesome Lists**, **topic listings**, and **individual repositories**, collecting useful metadata for each project.

Jan Buchar

Github Repo Markdown Scraper

louisdeconinck/github-repo-markdown-scraper

Transform GitHub repositories into a single, comprehensive markdown document effortlessly. Our tool streamlines analysis and processing, offering configurable file size limits, pattern filtering, and batch processing. Perfect for LLM AI prompts, it handles large repositories with ease.

Louis Deconinck

5.0

Github emails from commits

saswave/github-emails-from-commits

From a Github repository url, extract all emails from commits and their occurence number. Allow you to generate a list of emails from targeted github repositories

SASWAVE

Copy GitHub Issue

lukaskrivka/copy-github-issue

Copy a GitHub issue into any number of other repositories.

Lukáš Křivka

Github Trending Repositories / Developers

saswave/github-trending-repositories-developers

From a Github Trending category, extract all related informations about repositories or developers trending date range in Daily / Weekly / Monthly. With filters based on language spoken, code language, sponsorable status and date range

SASWAVE

Github Search Scraper

saswave/github-search-scraper

Github search scraper. Get all data from search results list

SASWAVE

Github Profile Scraper

saswave/github-profile-scraper

GitHub User Profile Scraper. Extracts data from GitHub profiles, including followers, following, LinkedIn, Twitter, achievements and much more. Ideal for developers, researchers, and marketers. From a list of Github profile or a repository stargazers link

SASWAVE

Webpage to Markdown

extremescrapes/webpage-to-markdown

This actor cost-effectively converts websites into structured markdown optimized for AI processing. It extracts webpage content, formats it into clean markdown, and ensures compatibility with AI models.

Extreme Scrapes

Ai Ready Web Page To Markdown Converter

mustafa.irshaid.113/ai-ready-web-page-to-markdown-converter

Convert any webpage into structured Markdown and HTML using just a URL. Get the page title, link, and content—perfect for SEO, devs, and AI crawlers. Fast, clean, and ideal for repurposing or analysis. Start turning websites into Markdown instantly.

Mustafa Irshaid

GitHub Repository Scraper

fresh_cliff/github-scraper

This actor scrapes detailed information from GitHub repositories using reliable HTTP requests and HTML parsing. It extracts repository metadata including star counts, fork counts, topics/tags, license information, primary programming language, and last updated timestamps.