GithubScraper avatar
GithubScraper
Under maintenance
Try for free

10 minutes trial then $20.00/month - No credit card required now

View all Actors
This Actor is under maintenance.

This Actor may be unreliable while under maintenance. Would you like to try a similar Actor instead?

See alternative Actors
GithubScraper

GithubScraper

fornace/githubscraper
Try for free

10 minutes trial then $20.00/month - No credit card required now

Automatically scrapes and downloads Markdown documentation from GitHub repositories, for easy AI finetuning.

GitHub Markdown Documentation Downloader

This actor is designed to aggregate .md and .mdx files containing Markdown documentation from specified GitHub repositories. It navigates through the repository's file structure and downloads the files, which are useful for training or finetuning models.

Features

  • Downloads .md and .mdx files from GitHub repositories.
  • Utilizes KeyValueStore to maintain coherence across concurrent executions.
  • Ensures documentation coherence by avoiding downloads from commits and other branches.

Usage

Set the startUrl to the home directory of the docs folder in the GitHub repository and run the actor.

Input Parameters

  • startUrl: The starting URL of the GitHub repository's documentation directory.
  • globPattern: Glob pattern to match files within the repository. Defaults to '**/*.{md,mdx}'.
  • maxConcurrency: The maximum number of requests processed concurrently. Default is 1000.
  • maxRequestsPerMinute: The maximum number of requests made per minute. Default is 600.
  • minConcurrency: The minimum number of concurrent requests during execution. Default is 5.
  • desiredConcurrency: The initially desired number of concurrent requests. Default is 15.

Output

The actor outputs each Markdown file's content into the default dataset. Each entry contains the file name and content.

Example Input

1{
2  "startUrl": "https://github.com/apify/apify-docs/tree/master",
3  "globPattern": "**/*.mdx",
4  "crawlerOptions": {
5    "maxConcurrency": 10
6  }
7}

Support

For support, contact info@fornace.it.

Developer
Maintained by Community
Actor metrics
  • 1 monthly user
  • 2 stars
  • Created in Dec 2023
  • Modified 7 months ago