GitHub Repository Scraper

Pricing

$10.00/month + usage

Try for free

Go to Apify Store

GitHub Repository Scraper

Try for free

Scrape and extract GitHub repository data, metadata, statistics, stars, forks, issues, and project information from multiple repositories at once.

Pricing

$10.00/month + usage

Rating

5.0

(1)

Developer

VulnV

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

17 days ago

Last modified

GitHub Repository Scraper - Extract Repository Data at Scale

Overview

The GitHub Repository Scraper is a powerful Apify Actor designed to extract comprehensive data from GitHub repositories efficiently. Perfect for competitive analysis, market research, developer insights, or building repository databases — this scraper provides detailed information about repositories, statistics, and project metadata.

✅ Bulk URL processing | ✅ Comprehensive repository data | ✅ Statistics extraction | ✅ Metadata analysis | ✅ Concurrent processing

Complete Repository Data Extraction

Basic Information — Repository name, description, owner, creation date
Statistics — Stars, forks, watchers, usage metrics
Technical Details — Programming languages, file counts, commit information
Project Metadata — Topics, license information, default branch
Enhanced Repository Data — GitHub IDs, clone URLs, file listings, branch info
Owner Information — Detailed owner profiles with avatars and organization status
Repository Structure — File counts, directory listings, README information
Access URLs — Multiple clone formats (HTTPS, SSH, GitHub CLI), download links

Key Features

Bulk Processing — Process multiple GitHub repository URLs in one run
Smart URL Parsing — Automatically extracts repository paths from full GitHub URLs
Proxy Support — Built-in Apify proxy integration for reliable scraping
Error Handling — Robust error handling with detailed status reporting
Clean JSON Output — Structured, ready-to-use data format
Concurrent Processing — Configurable concurrency for optimal performance
Format Flexibility — Accepts various URL formats and automatically normalizes them

🧾 Input Configuration

Submit an array of GitHub repository URLs via the input schema:

{
  "urls": [
    "https://github.com/microsoft/vscode",
    "https://github.com/facebook/react",
    "https://github.com/nodejs/node",
    "https://github.com/torvalds/linux"
  ],
  "maxConcurrency": 5,
  "includeNotFound": false,
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": ["RESIDENTIAL"]
  }
}

Input Parameters

URLs (required):
- Array of GitHub repository URLs to scrape
- Supported formats: https://github.com/owner/repo, github.com/owner/repo
- Invalid URLs will be automatically filtered out with warnings
Max Concurrency (optional):
- Number of concurrent requests for scraping (1-20)
- Default: 5
- Higher values = faster processing but may increase chance of rate limiting
Include Not Found (optional):
- Whether to include repositories that return 404 (not found) in the results
- Default: false
- When enabled, includes error information for non-existent repositories

Proxy Configuration (recommended):

Configure Apify proxy settings to avoid rate limiting
Recommended for bulk scraping operations

Format:

"proxyConfiguration": {
  "useApifyProxy": true,
  "apifyProxyGroups": ["RESIDENTIAL"]
}

Available proxy groups: RESIDENTIAL, DATACENTER, GOOGLE_SERP
Use RESIDENTIAL for best reliability when scraping GitHub

Proxy Configuration Examples

For small-scale scraping (< 100 repositories):

"proxyConfiguration": {
  "useApifyProxy": true,
  "apifyProxyGroups": ["DATACENTER"]
}

For large-scale or production scraping (recommended):

"proxyConfiguration": {
  "useApifyProxy": true,
  "apifyProxyGroups": ["RESIDENTIAL"]
}

No proxy (not recommended for bulk operations):

// Omit proxyConfiguration entirely - may result in rate limiting

📤 Output Format

Each GitHub repository returns comprehensive structured data including enhanced metadata extracted from GitHub's embedded data:

{
  "url": "https://github.com/microsoft/vscode",
  "repoPath": "microsoft/vscode",
  "success": true,
  "data": {
    "url": "https://github.com/microsoft/vscode",
    "type": "repo",
    "description": "Visual Studio Code",
    "website": "https://code.visualstudio.com",
    "forkedfrom": null,
    "tags": ["editor", "typescript", "electron", "ide"],
    "usedby": 250000,
    "watchers": 3200,
    "stars": 162000,
    "forks": 28500,
    "langs": [
      {"name": "TypeScript", "perc": "93.2%"},
      {"name": "JavaScript", "perc": "4.1%"},
      {"name": "CSS", "perc": "1.5%"}
    ],

    // Enhanced data from GitHub's embedded JSON
    "id": 41881900,
    "name": "vscode",
    "full_name": "microsoft/vscode",
    "owner": "microsoft",
    "default_branch": "main",
    "is_fork": false,
    "is_empty": false,
    "is_private": false,
    "is_org_owned": true,
    "created_at": "2015-09-03T20:23:30.000Z",
    "clone_url": "https://github.com/microsoft/vscode.git",
    "ssh_url": "git@github.com:microsoft/vscode.git",
    "api_url": "https://api.github.com/repos/microsoft/vscode",

    // Owner information
    "owner_info": {
      "login": "microsoft",
      "type": "Organization",
      "url": "https://github.com/microsoft",
      "avatar_url": "https://avatars.githubusercontent.com/u/6154722?v=4"
    },

    // File and repository structure
    "file_count": 15420,
    "files": [
      {"name": "README.md", "path": "README.md", "type": "file"},
      {"name": "package.json", "path": "package.json", "type": "file"},
      {"name": "src", "path": "src", "type": "directory"}
    ],

    // Clone and download URLs
    "clone_urls": {
      "https": "https://github.com/microsoft/vscode.git",
      "ssh": "git@github.com:microsoft/vscode.git",
      "github_cli": "gh repo clone microsoft/vscode"
    },
    "download_url": "/microsoft/vscode/archive/refs/heads/main.zip",

    // Branch and commit information
    "ref_info": {
      "name": "main",
      "type": "branch",
      "current_oid": "585acf48f88e399989d54f001029424b2b7c358a",
      "can_edit": false
    },
    "commit_count": "185,234",

    // README information
    "readme_info": {
      "displayName": "README.md",
      "repoName": "vscode",
      "refName": "main",
      "path": "README.md",
      "loaded": true
    },

    // Metadata
    "enriched_at": "2024-12-29T15:30:45.123Z",
    "data_source": "github_scraper_enhanced"
  }
}

Error Handling

Failed repositories return structured error information:

{
  "url": "https://github.com/invalid/repo",
  "repoPath": "invalid/repo",
  "success": false,
  "error": "Repository not found or private"
}

When includeNotFound is enabled, 404 repositories return structured data:

{
  "url": "https://github.com/nonexistent/repo",
  "repoPath": "nonexistent/repo",
  "success": true,
  "data": {
    "exists": false,
    "error": "Repository not found",
    "statusCode": 404
  }
}

Common Error Cases:

Repository not found or private — Repository doesn't exist or is private
Network error — Connection issues or scraping errors
Invalid URLs are filtered out before processing with warning logs

💼 Common Use Cases

Competitive Analysis & Market Research

Analyze competitor repositories and project activity
Track technology trends through repository statistics
Research popular libraries and frameworks in specific domains
Monitor open source project adoption rates

Developer & Technology Research

Study programming language usage patterns
Analyze repository structures and best practices
Research active open source projects in specific technologies
Track development activity and contribution patterns

Portfolio & Investment Analysis

Research technology companies and their open source contributions
Analyze developer productivity and project health metrics
Track repository growth and community engagement
Identify trending projects and technologies

Academic & Educational Research

Study software development patterns and practices
Analyze open source community dynamics
Research programming language evolution
Track educational resource repositories

📊 Output & Export Options

Dataset Storage

All extracted data stored in Apify dataset
Each repository becomes one dataset item
Status tracking for successful and failed extractions

Export Formats

JSON — Raw structured data for API integration
CSV — Spreadsheet-compatible format for analysis
Excel — Formatted spreadsheet with repository data

Data Processing

Clean, validated URLs
Structured error reporting
Comprehensive logging for troubleshooting

⚡ Quick Start Guide

Configure Input:
- Add GitHub repository URLs to the urls array
- Set desired maxConcurrency (recommended: 5-10)
- Configure proxyConfiguration with useApifyProxy: true and appropriate proxy groups for reliable scraping
Run the Actor:
- Execute through Apify Console or API
- Monitor progress through real-time logs
- Review extracted data in the dataset
Export Results:
- Download data in your preferred format
- Integrate with your existing tools and workflows

🆘 Support & Feedback

For questions, feature requests, or technical support:

Visit the Apify Community Forum
Contact us through the Apify platform
Submit issues for improvements and bug reports

🌟 Explore More Actors

✨ Need more scraping solutions? Discover additional actors on Apify for comprehensive web automation and data extraction. Explore our full range of tools at 🌐 Explore More Actors on Apify.

📧 For inquiries or custom development, reach out at apify@vulnv.com.

GitHub Repository Scraper

fresh_cliff/github-scraper

This actor scrapes detailed information from GitHub repositories using reliable HTTP requests and HTML parsing. It extracts repository metadata including star counts, fork counts, topics/tags, license information, primary programming language, and last updated timestamps.

Brennan Crawford

Github List Scraper

janbuchar/github-list-scraper

This Actor scrapes repositories from GitHub **Awesome Lists**, **topic listings**, and **individual repositories**, collecting useful metadata for each project.

Jan Buchar

Github Users Scraper

getdataforme/github-users-actor

This actor works well and helps to scrape the users on github repository.

GetDataForMe

GitHub Stars

sauain/github-stars

Input will be the URL of any GitHub repository, and output will be GitHub Stars.

Saurav Jain

Github Profile Scraper

saswave/github-profile-scraper

GitHub User Profile Scraper. Extracts data from GitHub profiles, including followers, following, LinkedIn, Twitter, achievements and much more. Ideal for developers, researchers, and marketers. From a list of Github profile or a repository stargazers link

SASWAVE

125

Github Search Scraper

saswave/github-search-scraper

Github search scraper. Get all data from search results list

SASWAVE

Github Profile Scraper

vulnv/github-profile-scraper

Scrapes GitHub user profiles including bio, repositories, followers, contributions, and more. Accepts a list of usernames and extracts comprehensive profile data.

VulnV

5.0

Github Users Scraper

dtrungtin/github-users-scraper

Github Users Scraper is an Apify actor for extracting users or emails from Github. It allows you to extract all watchers, stargazers, and members from a repository page.

Tin

247

4.0

Github User Scraper

hello.datawizards/github-User-Info-Scraper

Extract GitHub user profiles with github-User-Info-Scraper. Get repositories, followers, bio, organizations, websites, and popular projects in JSON. Ideal for developer research, open-source tracking, community insights, and portfolio analysis with Apify Proxy support.

datawizards

Github Repo Markdown Scraper

louisdeconinck/github-repo-markdown-scraper

Transform GitHub repositories into a single, comprehensive markdown document effortlessly. Our tool streamlines analysis and processing, offering configurable file size limits, pattern filtering, and batch processing. Perfect for LLM AI prompts, it handles large repositories with ease.