GitHub Repository Intelligence - API-Based Data Scraper
Pricing
$20.00 / 1,000 results
GitHub Repository Intelligence - API-Based Data Scraper
Extract repository metadata, README content, and documentation from GitHub using the official REST API. Perfect for LLM training data, developer research, and competitive analysis. Search by keywords or fetch specific repositories.
Pricing
$20.00 / 1,000 results
Rating
0.0
(0)
Developer

ben
Actor stats
0
Bookmarked
3
Total users
2
Monthly active users
11 hours ago
Last modified
Categories
Share
GitHub Repository Intelligence - API-Based Data & Documentation Scraper
Extract comprehensive repository data from GitHub using the official REST API.
Fetch repository metadata, README content, documentation, topics, language statistics, and more. Perfect for AI/LLM training data, developer research, competitive analysis, and tech stack discovery. Legal, stable, and fast API-based extraction.
Features
✅ Dual Scraping Modes
- Search Mode: Find repositories by keywords, language, stars
- Direct Mode: Fetch specific repositories by URL
✅ Comprehensive Data Extraction
- Repository metadata (stars, forks, watchers, issues)
- README content (perfect for LLM training)
- Programming language statistics
- Repository topics/tags
- License information
- Creation/update timestamps
- Owner information
✅ Official GitHub API
- Uses GitHub REST API v3 (100% legal)
- No browser automation required
- Stable and reliable
- Optional authentication for higher rate limits
✅ Built for AI & Research
- README extraction for LLM training
- Structured JSON output
- Rich metadata for analysis
- Topic and language classification
- Dataset export (CSV, JSON, Excel)
Use Cases
🤖 AI/LLM Training Data
- Extract README files for AI model training
- Gather documentation for vector databases
- Build RAG (Retrieval-Augmented Generation) pipelines
- Create code-to-text datasets
🔍 Developer Research
- Discover trending repositories
- Analyze tech stacks and tools
- Monitor open-source ecosystem
- Track language adoption trends
💼 Business Intelligence
- Competitive analysis
- Technology trend spotting
- Developer tool discovery
- Market research for dev tools
📊 Academic Research
- Software engineering studies
- Open-source collaboration analysis
- Programming language evolution
- Developer ecosystem research
Input
{"mode": "search","searchQuery": "language:python stars:>1000","sortBy": "stars","maxResults": 50,"includeReadme": true,"includeTopics": true,"includeLanguages": true,"githubToken": "ghp_xxxxxxxxxxxx"}
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
mode | string | search | Scraping mode: search (find repos by query) or direct (specific URLs) |
searchQuery | string | "" | Search query (e.g., language:python stars:>1000). Uses GitHub search syntax. |
repositoryUrls | string | "" | Repository URLs (one per line). Format: https://github.com/owner/repo |
sortBy | string | stars | Sort search results by: stars, forks, updated, help-wanted-issues |
maxResults | integer | 30 | Maximum repositories to fetch (search mode, 1-1000) |
includeReadme | boolean | true | Extract README content (recommended for AI/LLM) |
includeTopics | boolean | true | Fetch repository topics/tags |
includeLanguages | boolean | true | Fetch programming language statistics |
githubToken | string | "" | Optional GitHub Personal Access Token (5,000 vs 60 requests/hour) |
debugMode | boolean | false | Enable verbose logging |
Output
Each repository returns comprehensive data:
{"name": "tensorflow","full_name": "tensorflow/tensorflow","owner": {"login": "tensorflow","type": "Organization","url": "https://github.com/tensorflow"},"description": "An Open Source Machine Learning Framework for Everyone","url": "https://github.com/tensorflow/tensorflow","homepage": "https://www.tensorflow.org","language": "C++","stars": 185000,"forks": 74000,"watchers": 185000,"open_issues": 1850,"size": 285000,"topics": ["machine-learning", "deep-learning", "tensorflow", "python"],"license": "Apache License 2.0","created_at": "2015-11-07T01:19:20Z","updated_at": "2025-01-12T10:30:45Z","pushed_at": "2025-01-12T09:15:22Z","is_fork": false,"is_archived": false,"is_private": false,"default_branch": "master","readme": {"name": "README.md","path": "README.md","content": "# TensorFlow...","size": 12584,"html_url": "https://github.com/tensorflow/tensorflow/blob/master/README.md","download_url": "https://raw.githubusercontent.com/tensorflow/tensorflow/master/README.md"},"languages": {"C++": 125847623,"Python": 45123456,"Java": 12345678},"scraped_at": "2025-01-12T15:30:00.000Z","index": 1}
Example Usage
Search for Python Repositories
{"mode": "search","searchQuery": "language:python stars:>1000","sortBy": "stars","maxResults": 100}
Search for AI/ML Projects
{"mode": "search","searchQuery": "machine learning stars:>5000","sortBy": "updated","maxResults": 50,"includeReadme": true}
Fetch Specific Repositories
{"mode": "direct","repositoryUrls": "https://github.com/facebook/react\nhttps://github.com/tensorflow/tensorflow\nhttps://github.com/microsoft/vscode","includeReadme": true,"includeTopics": true}
With GitHub Token (Higher Rate Limits)
{"mode": "search","searchQuery": "language:rust stars:>500","maxResults": 200,"githubToken": "ghp_yourtoken","includeReadme": true}
GitHub Search Query Syntax
Search by Language
language:pythonlanguage:javascriptlanguage:rust
Search by Stars/Forks
stars:>1000stars:1000..5000forks:>500
Search by Topics
topic:machine-learningtopic:web-development
Search by Organization
org:googleorg:microsoftuser:torvalds
Combine Multiple Criteria
language:python stars:>1000 topic:machine-learninglanguage:go stars:>500 forks:>100
Full Documentation: GitHub Search Syntax
Rate Limits & Authentication
Without GitHub Token
- 60 requests per hour
- Good for: Testing, small batches (<30 repos)
- Unauthenticated access
With GitHub Token (Recommended)
- 5,000 requests per hour
- Good for: Production, large batches (100s of repos)
- Required for: Frequent usage
Creating a GitHub Token
- Go to GitHub Settings → Tokens
- Click "Generate new token (classic)"
- Select scopes:
public_repo(read public repositories) - Copy token and use in
githubTokenparameter
Note: Tokens are optional but highly recommended for production use.
Pricing (Pay-Per-Result)
$0.015 per repository ($15 per 1,000 repositories)
Example Cost Calculation:
Fetching 1,000 repositories:
- Repository metadata: 1,000 × $0.015 = $15.00
💡 No browser costs, no proxy costs - just lightweight API calls!
Best Practices
Search Optimization
- Use Specific Queries:
language:python stars:>1000>python - Filter by Activity:
pushed:>2024-01-01for active projects - Combine Criteria: Use stars, language, topics together
- Sort Strategically:
starsfor popular,updatedfor active
README Extraction for AI/LLM
- Enable README Fetching: Always set
includeReadme: true - Filter Quality: Focus on repos with stars:>100
- Language Filtering: Target specific tech stacks
- Documentation Rich: Search for
topic:documentation
Rate Limit Management
- Use Authentication: Get a GitHub token for 5,000 requests/hour
- Batch Requests: Plan your searches to minimize API calls
- Monitor Limits: Check rate limit in actor logs
- Schedule Runs: Spread large jobs across hours
FAQ
Q: Is this legal? A: Yes! Uses GitHub's official REST API with proper permissions.
Q: Do I need a GitHub account? A: No for basic usage (60 requests/hour). Yes for higher limits (5,000 requests/hour with token).
Q: What's the rate limit without a token? A: 60 requests per hour (unauthenticated). 5,000 with a token.
Q: Can I extract private repositories? A: Only public repositories. Private repos require different permissions.
Q: How do I get README content for AI training?
A: Set includeReadme: true and use search mode to find relevant repositories.
Q: Can I search by multiple languages?
A: Use language:python OR language:javascript in search query.
Q: What happens if rate limit is exceeded? A: Actor will log a warning. Add a GitHub token to increase limits.
Why Use This Actor?
| Feature | This Actor (GitHub API) | Web Scraping |
|---|---|---|
| Legal | ✅ Official API | ❌ Violates ToS |
| Stable | ✅ API rarely changes | ❌ HTML breaks often |
| Fast | ✅ Direct API calls | ❌ Browser overhead |
| Cost | ✅ $15 per 1k repos | ❌ $30+ per 1k |
| Authentication | ✅ Optional (higher limits) | ❌ Complex login |
| README Access | ✅ Direct API endpoint | ❌ Requires parsing |
| Maintenance | ✅ Minimal | ❌ Constant updates |
Output Use Cases
AI/LLM Training
- Feed README content into vector databases
- Build code documentation datasets
- Create programming Q&A pairs
- Extract technical writing samples
Developer Tools
- Tech stack analysis
- Framework popularity tracking
- Library comparison
- Documentation aggregation
Business Intelligence
- Competitor monitoring
- Technology trend analysis
- Open-source landscape mapping
- Developer ecosystem research
Legal & Ethics
✅ Legal Compliance:
- Official API: Uses GitHub REST API v3 with proper authentication
- Public Data Only: Accesses only publicly available repositories
- Rate Limits: Respects GitHub's rate limiting
- Terms of Service: Complies with GitHub's API ToS
- No Scraping: No HTML parsing or browser automation
This actor is 100% legal and ethical - uses official GitHub API with proper permissions.
Support
Need help? Have questions?
- 📧 Support: support@apify.com
- 📚 Apify Docs: https://docs.apify.com
- 💬 Community: https://discord.com/invite/jyEM2PRvMU
- 🐙 GitHub API Docs: https://docs.github.com/en/rest
Built with ❤️ using Apify and GitHub REST API
Perfect for:
- 🤖 AI/LLM training data collection
- 📊 Developer research and analytics
- 💼 Competitive intelligence
- 🔍 Technology trend analysis
- 📚 Documentation aggregation