GitHub Repository Intelligence - API-Based Data Scraper avatar
GitHub Repository Intelligence - API-Based Data Scraper

Pricing

$20.00 / 1,000 results

Go to Apify Store
GitHub Repository Intelligence - API-Based Data Scraper

GitHub Repository Intelligence - API-Based Data Scraper

Extract repository metadata, README content, and documentation from GitHub using the official REST API. Perfect for LLM training data, developer research, and competitive analysis. Search by keywords or fetch specific repositories.

Pricing

$20.00 / 1,000 results

Rating

0.0

(0)

Developer

ben

ben

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

2

Monthly active users

11 hours ago

Last modified

Share

GitHub Repository Intelligence - API-Based Data & Documentation Scraper

Extract comprehensive repository data from GitHub using the official REST API.

Fetch repository metadata, README content, documentation, topics, language statistics, and more. Perfect for AI/LLM training data, developer research, competitive analysis, and tech stack discovery. Legal, stable, and fast API-based extraction.

Features

Dual Scraping Modes

  • Search Mode: Find repositories by keywords, language, stars
  • Direct Mode: Fetch specific repositories by URL

Comprehensive Data Extraction

  • Repository metadata (stars, forks, watchers, issues)
  • README content (perfect for LLM training)
  • Programming language statistics
  • Repository topics/tags
  • License information
  • Creation/update timestamps
  • Owner information

Official GitHub API

  • Uses GitHub REST API v3 (100% legal)
  • No browser automation required
  • Stable and reliable
  • Optional authentication for higher rate limits

Built for AI & Research

  • README extraction for LLM training
  • Structured JSON output
  • Rich metadata for analysis
  • Topic and language classification
  • Dataset export (CSV, JSON, Excel)

Use Cases

🤖 AI/LLM Training Data

  • Extract README files for AI model training
  • Gather documentation for vector databases
  • Build RAG (Retrieval-Augmented Generation) pipelines
  • Create code-to-text datasets

🔍 Developer Research

  • Discover trending repositories
  • Analyze tech stacks and tools
  • Monitor open-source ecosystem
  • Track language adoption trends

💼 Business Intelligence

  • Competitive analysis
  • Technology trend spotting
  • Developer tool discovery
  • Market research for dev tools

📊 Academic Research

  • Software engineering studies
  • Open-source collaboration analysis
  • Programming language evolution
  • Developer ecosystem research

Input

{
"mode": "search",
"searchQuery": "language:python stars:>1000",
"sortBy": "stars",
"maxResults": 50,
"includeReadme": true,
"includeTopics": true,
"includeLanguages": true,
"githubToken": "ghp_xxxxxxxxxxxx"
}

Input Parameters

ParameterTypeDefaultDescription
modestringsearchScraping mode: search (find repos by query) or direct (specific URLs)
searchQuerystring""Search query (e.g., language:python stars:>1000). Uses GitHub search syntax.
repositoryUrlsstring""Repository URLs (one per line). Format: https://github.com/owner/repo
sortBystringstarsSort search results by: stars, forks, updated, help-wanted-issues
maxResultsinteger30Maximum repositories to fetch (search mode, 1-1000)
includeReadmebooleantrueExtract README content (recommended for AI/LLM)
includeTopicsbooleantrueFetch repository topics/tags
includeLanguagesbooleantrueFetch programming language statistics
githubTokenstring""Optional GitHub Personal Access Token (5,000 vs 60 requests/hour)
debugModebooleanfalseEnable verbose logging

Output

Each repository returns comprehensive data:

{
"name": "tensorflow",
"full_name": "tensorflow/tensorflow",
"owner": {
"login": "tensorflow",
"type": "Organization",
"url": "https://github.com/tensorflow"
},
"description": "An Open Source Machine Learning Framework for Everyone",
"url": "https://github.com/tensorflow/tensorflow",
"homepage": "https://www.tensorflow.org",
"language": "C++",
"stars": 185000,
"forks": 74000,
"watchers": 185000,
"open_issues": 1850,
"size": 285000,
"topics": ["machine-learning", "deep-learning", "tensorflow", "python"],
"license": "Apache License 2.0",
"created_at": "2015-11-07T01:19:20Z",
"updated_at": "2025-01-12T10:30:45Z",
"pushed_at": "2025-01-12T09:15:22Z",
"is_fork": false,
"is_archived": false,
"is_private": false,
"default_branch": "master",
"readme": {
"name": "README.md",
"path": "README.md",
"content": "# TensorFlow...",
"size": 12584,
"html_url": "https://github.com/tensorflow/tensorflow/blob/master/README.md",
"download_url": "https://raw.githubusercontent.com/tensorflow/tensorflow/master/README.md"
},
"languages": {
"C++": 125847623,
"Python": 45123456,
"Java": 12345678
},
"scraped_at": "2025-01-12T15:30:00.000Z",
"index": 1
}

Example Usage

Search for Python Repositories

{
"mode": "search",
"searchQuery": "language:python stars:>1000",
"sortBy": "stars",
"maxResults": 100
}

Search for AI/ML Projects

{
"mode": "search",
"searchQuery": "machine learning stars:>5000",
"sortBy": "updated",
"maxResults": 50,
"includeReadme": true
}

Fetch Specific Repositories

{
"mode": "direct",
"repositoryUrls": "https://github.com/facebook/react\nhttps://github.com/tensorflow/tensorflow\nhttps://github.com/microsoft/vscode",
"includeReadme": true,
"includeTopics": true
}

With GitHub Token (Higher Rate Limits)

{
"mode": "search",
"searchQuery": "language:rust stars:>500",
"maxResults": 200,
"githubToken": "ghp_yourtoken",
"includeReadme": true
}

GitHub Search Query Syntax

Search by Language

language:python
language:javascript
language:rust

Search by Stars/Forks

stars:>1000
stars:1000..5000
forks:>500

Search by Topics

topic:machine-learning
topic:web-development

Search by Organization

org:google
org:microsoft
user:torvalds

Combine Multiple Criteria

language:python stars:>1000 topic:machine-learning
language:go stars:>500 forks:>100

Full Documentation: GitHub Search Syntax

Rate Limits & Authentication

Without GitHub Token

  • 60 requests per hour
  • Good for: Testing, small batches (<30 repos)
  • Unauthenticated access
  • 5,000 requests per hour
  • Good for: Production, large batches (100s of repos)
  • Required for: Frequent usage

Creating a GitHub Token

  1. Go to GitHub Settings → Tokens
  2. Click "Generate new token (classic)"
  3. Select scopes: public_repo (read public repositories)
  4. Copy token and use in githubToken parameter

Note: Tokens are optional but highly recommended for production use.

Pricing (Pay-Per-Result)

$0.015 per repository ($15 per 1,000 repositories)

Example Cost Calculation:

Fetching 1,000 repositories:

  • Repository metadata: 1,000 × $0.015 = $15.00

💡 No browser costs, no proxy costs - just lightweight API calls!

Best Practices

Search Optimization

  1. Use Specific Queries: language:python stars:>1000 > python
  2. Filter by Activity: pushed:>2024-01-01 for active projects
  3. Combine Criteria: Use stars, language, topics together
  4. Sort Strategically: stars for popular, updated for active

README Extraction for AI/LLM

  1. Enable README Fetching: Always set includeReadme: true
  2. Filter Quality: Focus on repos with stars:>100
  3. Language Filtering: Target specific tech stacks
  4. Documentation Rich: Search for topic:documentation

Rate Limit Management

  1. Use Authentication: Get a GitHub token for 5,000 requests/hour
  2. Batch Requests: Plan your searches to minimize API calls
  3. Monitor Limits: Check rate limit in actor logs
  4. Schedule Runs: Spread large jobs across hours

FAQ

Q: Is this legal? A: Yes! Uses GitHub's official REST API with proper permissions.

Q: Do I need a GitHub account? A: No for basic usage (60 requests/hour). Yes for higher limits (5,000 requests/hour with token).

Q: What's the rate limit without a token? A: 60 requests per hour (unauthenticated). 5,000 with a token.

Q: Can I extract private repositories? A: Only public repositories. Private repos require different permissions.

Q: How do I get README content for AI training? A: Set includeReadme: true and use search mode to find relevant repositories.

Q: Can I search by multiple languages? A: Use language:python OR language:javascript in search query.

Q: What happens if rate limit is exceeded? A: Actor will log a warning. Add a GitHub token to increase limits.

Why Use This Actor?

FeatureThis Actor (GitHub API)Web Scraping
Legal✅ Official API❌ Violates ToS
Stable✅ API rarely changes❌ HTML breaks often
Fast✅ Direct API calls❌ Browser overhead
Cost✅ $15 per 1k repos❌ $30+ per 1k
Authentication✅ Optional (higher limits)❌ Complex login
README Access✅ Direct API endpoint❌ Requires parsing
Maintenance✅ Minimal❌ Constant updates

Output Use Cases

AI/LLM Training

  • Feed README content into vector databases
  • Build code documentation datasets
  • Create programming Q&A pairs
  • Extract technical writing samples

Developer Tools

  • Tech stack analysis
  • Framework popularity tracking
  • Library comparison
  • Documentation aggregation

Business Intelligence

  • Competitor monitoring
  • Technology trend analysis
  • Open-source landscape mapping
  • Developer ecosystem research

Legal Compliance:

  1. Official API: Uses GitHub REST API v3 with proper authentication
  2. Public Data Only: Accesses only publicly available repositories
  3. Rate Limits: Respects GitHub's rate limiting
  4. Terms of Service: Complies with GitHub's API ToS
  5. No Scraping: No HTML parsing or browser automation

This actor is 100% legal and ethical - uses official GitHub API with proper permissions.

Support

Need help? Have questions?


Built with ❤️ using Apify and GitHub REST API

Perfect for:

  • 🤖 AI/LLM training data collection
  • 📊 Developer research and analytics
  • 💼 Competitive intelligence
  • 🔍 Technology trend analysis
  • 📚 Documentation aggregation