
GitHub Repository Scraper
Pricing
$10.00/month + usage

GitHub Repository Scraper
Scrape and extract GitHub repository data, metadata, statistics, stars, forks, issues, and project information from multiple repositories at once.
5.0 (1)
Pricing
$10.00/month + usage
1
4
4
Last modified
6 days ago
GitHub Repository Scraper - Extract Repository Data at Scale
Overview
The GitHub Repository Scraper is a powerful Apify Actor designed to extract comprehensive data from GitHub repositories efficiently. Perfect for competitive analysis, market research, developer insights, or building repository databases — this scraper provides detailed information about repositories, statistics, and project metadata.
✅ Bulk URL processing | ✅ Comprehensive repository data | ✅ Statistics extraction | ✅ Metadata analysis | ✅ Concurrent processing
Complete Repository Data Extraction
- Basic Information — Repository name, description, owner, creation date
- Statistics — Stars, forks, watchers, usage metrics
- Technical Details — Programming languages, file counts, commit information
- Project Metadata — Topics, license information, default branch
- Enhanced Repository Data — GitHub IDs, clone URLs, file listings, branch info
- Owner Information — Detailed owner profiles with avatars and organization status
- Repository Structure — File counts, directory listings, README information
- Access URLs — Multiple clone formats (HTTPS, SSH, GitHub CLI), download links
Key Features
- Bulk Processing — Process multiple GitHub repository URLs in one run
- Smart URL Parsing — Automatically extracts repository paths from full GitHub URLs
- Proxy Support — Built-in Apify proxy integration for reliable scraping
- Error Handling — Robust error handling with detailed status reporting
- Clean JSON Output — Structured, ready-to-use data format
- Concurrent Processing — Configurable concurrency for optimal performance
- Format Flexibility — Accepts various URL formats and automatically normalizes them
🧾 Input Configuration
Submit an array of GitHub repository URLs via the input schema:
{"urls": ["https://github.com/microsoft/vscode","https://github.com/facebook/react","https://github.com/nodejs/node","https://github.com/torvalds/linux"],"maxConcurrency": 5,"includeNotFound": false,"proxyConfiguration": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}}
Input Parameters
-
URLs (required):
- Array of GitHub repository URLs to scrape
- Supported formats:
https://github.com/owner/repo
,github.com/owner/repo
- Invalid URLs will be automatically filtered out with warnings
-
Max Concurrency (optional):
- Number of concurrent requests for scraping (1-20)
- Default: 5
- Higher values = faster processing but may increase chance of rate limiting
-
Include Not Found (optional):
- Whether to include repositories that return 404 (not found) in the results
- Default: false
- When enabled, includes error information for non-existent repositories
-
Proxy Configuration (recommended):
- Configure Apify proxy settings to avoid rate limiting
- Recommended for bulk scraping operations
- Format:
"proxyConfiguration": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}
- Available proxy groups:
RESIDENTIAL
,DATACENTER
,GOOGLE_SERP
- Use
RESIDENTIAL
for best reliability when scraping GitHub
Proxy Configuration Examples
For small-scale scraping (< 100 repositories):
"proxyConfiguration": {"useApifyProxy": true,"apifyProxyGroups": ["DATACENTER"]}
For large-scale or production scraping (recommended):
"proxyConfiguration": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}
No proxy (not recommended for bulk operations):
// Omit proxyConfiguration entirely - may result in rate limiting
📤 Output Format
Each GitHub repository returns comprehensive structured data including enhanced metadata extracted from GitHub's embedded data:
{"url": "https://github.com/microsoft/vscode","repoPath": "microsoft/vscode","success": true,"data": {"url": "https://github.com/microsoft/vscode","type": "repo","description": "Visual Studio Code","website": "https://code.visualstudio.com","forkedfrom": null,"tags": ["editor", "typescript", "electron", "ide"],"usedby": 250000,"watchers": 3200,"stars": 162000,"forks": 28500,"langs": [{"name": "TypeScript", "perc": "93.2%"},{"name": "JavaScript", "perc": "4.1%"},{"name": "CSS", "perc": "1.5%"}],// Enhanced data from GitHub's embedded JSON"id": 41881900,"name": "vscode","full_name": "microsoft/vscode","owner": "microsoft","default_branch": "main","is_fork": false,"is_empty": false,"is_private": false,"is_org_owned": true,"created_at": "2015-09-03T20:23:30.000Z","clone_url": "https://github.com/microsoft/vscode.git","ssh_url": "git@github.com:microsoft/vscode.git","api_url": "https://api.github.com/repos/microsoft/vscode",// Owner information"owner_info": {"login": "microsoft","type": "Organization","url": "https://github.com/microsoft","avatar_url": "https://avatars.githubusercontent.com/u/6154722?v=4"},// File and repository structure"file_count": 15420,"files": [{"name": "README.md", "path": "README.md", "type": "file"},{"name": "package.json", "path": "package.json", "type": "file"},{"name": "src", "path": "src", "type": "directory"}],// Clone and download URLs"clone_urls": {"https": "https://github.com/microsoft/vscode.git","ssh": "git@github.com:microsoft/vscode.git","github_cli": "gh repo clone microsoft/vscode"},"download_url": "/microsoft/vscode/archive/refs/heads/main.zip",// Branch and commit information"ref_info": {"name": "main","type": "branch","current_oid": "585acf48f88e399989d54f001029424b2b7c358a","can_edit": false},"commit_count": "185,234",// README information"readme_info": {"displayName": "README.md","repoName": "vscode","refName": "main","path": "README.md","loaded": true},// Metadata"enriched_at": "2024-12-29T15:30:45.123Z","data_source": "github_scraper_enhanced"}}
Error Handling
Failed repositories return structured error information:
{"url": "https://github.com/invalid/repo","repoPath": "invalid/repo","success": false,"error": "Repository not found or private"}
When includeNotFound
is enabled, 404 repositories return structured data:
{"url": "https://github.com/nonexistent/repo","repoPath": "nonexistent/repo","success": true,"data": {"exists": false,"error": "Repository not found","statusCode": 404}}
Common Error Cases:
Repository not found or private
— Repository doesn't exist or is privateNetwork error
— Connection issues or scraping errors- Invalid URLs are filtered out before processing with warning logs
💼 Common Use Cases
Competitive Analysis & Market Research
- Analyze competitor repositories and project activity
- Track technology trends through repository statistics
- Research popular libraries and frameworks in specific domains
- Monitor open source project adoption rates
Developer & Technology Research
- Study programming language usage patterns
- Analyze repository structures and best practices
- Research active open source projects in specific technologies
- Track development activity and contribution patterns
Portfolio & Investment Analysis
- Research technology companies and their open source contributions
- Analyze developer productivity and project health metrics
- Track repository growth and community engagement
- Identify trending projects and technologies
Academic & Educational Research
- Study software development patterns and practices
- Analyze open source community dynamics
- Research programming language evolution
- Track educational resource repositories
📊 Output & Export Options
Dataset Storage
- All extracted data stored in Apify dataset
- Each repository becomes one dataset item
- Status tracking for successful and failed extractions
Export Formats
- JSON — Raw structured data for API integration
- CSV — Spreadsheet-compatible format for analysis
- Excel — Formatted spreadsheet with repository data
Data Processing
- Clean, validated URLs
- Structured error reporting
- Comprehensive logging for troubleshooting
⚡ Quick Start Guide
-
Configure Input:
- Add GitHub repository URLs to the
urls
array - Set desired
maxConcurrency
(recommended: 5-10) - Configure
proxyConfiguration
withuseApifyProxy: true
and appropriate proxy groups for reliable scraping
- Add GitHub repository URLs to the
-
Run the Actor:
- Execute through Apify Console or API
- Monitor progress through real-time logs
- Review extracted data in the dataset
-
Export Results:
- Download data in your preferred format
- Integrate with your existing tools and workflows
🆘 Support & Feedback
For questions, feature requests, or technical support:
- Visit the Apify Community Forum
- Contact us through the Apify platform
- Submit issues for improvements and bug reports
🌟 Explore More Actors
✨ Need more scraping solutions? Discover additional actors on Apify for comprehensive web automation and data extraction. Explore our full range of tools at 🌐 Explore More Actors on Apify.
📧 For inquiries or custom development, reach out at apify@vulnv.com.