Pricing

$20.00 / 1,000 results

GitHub Repository Intelligence - API-Based Data Scraper

Extract repository metadata, README content, and documentation from GitHub using the official REST API. Perfect for LLM training data, developer research, and competitive analysis. Search by keywords or fetch specific repositories.

Pricing

$20.00 / 1,000 results

Rating

0.0

(0)

Developer

ben

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

GitHub Repository Intelligence - API-Based Data & Documentation Scraper

Extract comprehensive repository data from GitHub using the official REST API.

Fetch repository metadata, README content, documentation, topics, language statistics, and more. Perfect for AI/LLM training data, developer research, competitive analysis, and tech stack discovery. Legal, stable, and fast API-based extraction.

Features

✅ Dual Scraping Modes

Search Mode: Find repositories by keywords, language, stars
Direct Mode: Fetch specific repositories by URL

✅ Comprehensive Data Extraction

Repository metadata (stars, forks, watchers, issues)
README content (perfect for LLM training)
Programming language statistics
Repository topics/tags
License information
Creation/update timestamps
Owner information

✅ Official GitHub API

Uses GitHub REST API v3 (100% legal)
No browser automation required
Stable and reliable
Optional authentication for higher rate limits

✅ Built for AI & Research

README extraction for LLM training
Structured JSON output
Rich metadata for analysis
Topic and language classification
Dataset export (CSV, JSON, Excel)

Use Cases

🤖 AI/LLM Training Data

Extract README files for AI model training
Gather documentation for vector databases
Build RAG (Retrieval-Augmented Generation) pipelines
Create code-to-text datasets

🔍 Developer Research

Discover trending repositories
Analyze tech stacks and tools
Monitor open-source ecosystem
Track language adoption trends

💼 Business Intelligence

Competitive analysis
Technology trend spotting
Developer tool discovery
Market research for dev tools

📊 Academic Research

Software engineering studies
Open-source collaboration analysis
Programming language evolution
Developer ecosystem research

Input

{
  "mode": "search",
  "searchQuery": "language:python stars:>1000",
  "sortBy": "stars",
  "maxResults": 50,
  "includeReadme": true,
  "includeTopics": true,
  "includeLanguages": true,
  "githubToken": "ghp_xxxxxxxxxxxx"
}

Input Parameters

Parameter	Type	Default	Description
`mode`	string	`search`	Scraping mode: `search` (find repos by query) or `direct` (specific URLs)
`searchQuery`	string	`""`	Search query (e.g., `language:python stars:>1000`). Uses GitHub search syntax.
`repositoryUrls`	string	`""`	Repository URLs (one per line). Format: `https://github.com/owner/repo`
`sortBy`	string	`stars`	Sort search results by: `stars`, `forks`, `updated`, `help-wanted-issues`
`maxResults`	integer	`30`	Maximum repositories to fetch (search mode, 1-1000)
`includeReadme`	boolean	`true`	Extract README content (recommended for AI/LLM)
`includeTopics`	boolean	`true`	Fetch repository topics/tags
`includeLanguages`	boolean	`true`	Fetch programming language statistics
`githubToken`	string	`""`	Optional GitHub Personal Access Token (5,000 vs 60 requests/hour)
`debugMode`	boolean	`false`	Enable verbose logging

Output

Each repository returns comprehensive data:

{
  "name": "tensorflow",
  "full_name": "tensorflow/tensorflow",
  "owner": {
    "login": "tensorflow",
    "type": "Organization",
    "url": "https://github.com/tensorflow"
  },
  "description": "An Open Source Machine Learning Framework for Everyone",
  "url": "https://github.com/tensorflow/tensorflow",
  "homepage": "https://www.tensorflow.org",
  "language": "C++",
  "stars": 185000,
  "forks": 74000,
  "watchers": 185000,
  "open_issues": 1850,
  "size": 285000,
  "topics": ["machine-learning", "deep-learning", "tensorflow", "python"],
  "license": "Apache License 2.0",
  "created_at": "2015-11-07T01:19:20Z",
  "updated_at": "2025-01-12T10:30:45Z",
  "pushed_at": "2025-01-12T09:15:22Z",
  "is_fork": false,
  "is_archived": false,
  "is_private": false,
  "default_branch": "master",
  "readme": {
    "name": "README.md",
    "path": "README.md",
    "content": "# TensorFlow...",
    "size": 12584,
    "html_url": "https://github.com/tensorflow/tensorflow/blob/master/README.md",
    "download_url": "https://raw.githubusercontent.com/tensorflow/tensorflow/master/README.md"
  },
  "languages": {
    "C++": 125847623,
    "Python": 45123456,
    "Java": 12345678
  },
  "scraped_at": "2025-01-12T15:30:00.000Z",
  "index": 1
}

Example Usage

Search for Python Repositories

{
  "mode": "search",
  "searchQuery": "language:python stars:>1000",
  "sortBy": "stars",
  "maxResults": 100
}

Search for AI/ML Projects

{
  "mode": "search",
  "searchQuery": "machine learning stars:>5000",
  "sortBy": "updated",
  "maxResults": 50,
  "includeReadme": true
}

Fetch Specific Repositories

{
  "mode": "direct",
  "repositoryUrls": "https://github.com/facebook/react\nhttps://github.com/tensorflow/tensorflow\nhttps://github.com/microsoft/vscode",
  "includeReadme": true,
  "includeTopics": true
}

With GitHub Token (Higher Rate Limits)

{
  "mode": "search",
  "searchQuery": "language:rust stars:>500",
  "maxResults": 200,
  "githubToken": "ghp_yourtoken",
  "includeReadme": true
}

GitHub Search Query Syntax

Search by Language

language:python
language:javascript
language:rust

Search by Stars/Forks

stars:>1000
stars:1000..5000
forks:>500

Search by Topics

topic:machine-learning
topic:web-development

Search by Organization

org:google
org:microsoft
user:torvalds

Combine Multiple Criteria

language:python stars:>1000 topic:machine-learning
language:go stars:>500 forks:>100

Full Documentation: GitHub Search Syntax

Rate Limits & Authentication

Without GitHub Token

60 requests per hour
Good for: Testing, small batches (<30 repos)
Unauthenticated access

With GitHub Token (Recommended)

5,000 requests per hour
Good for: Production, large batches (100s of repos)
Required for: Frequent usage

Creating a GitHub Token

Go to GitHub Settings → Tokens
Click "Generate new token (classic)"
Select scopes: public_repo (read public repositories)
Copy token and use in githubToken parameter

Note: Tokens are optional but highly recommended for production use.

Pricing (Pay-Per-Result)

$0.015 per repository ($15 per 1,000 repositories)

Example Cost Calculation:

Fetching 1,000 repositories:

Repository metadata: 1,000 × $0.015 = $15.00

💡 No browser costs, no proxy costs - just lightweight API calls!

Best Practices

Search Optimization

Use Specific Queries: language:python stars:>1000 > python
Filter by Activity: pushed:>2024-01-01 for active projects
Combine Criteria: Use stars, language, topics together
Sort Strategically: stars for popular, updated for active

README Extraction for AI/LLM

Enable README Fetching: Always set includeReadme: true
Filter Quality: Focus on repos with stars:>100
Language Filtering: Target specific tech stacks
Documentation Rich: Search for topic:documentation

Rate Limit Management

Use Authentication: Get a GitHub token for 5,000 requests/hour
Batch Requests: Plan your searches to minimize API calls
Monitor Limits: Check rate limit in actor logs
Schedule Runs: Spread large jobs across hours

FAQ

Q: Is this legal? A: Yes! Uses GitHub's official REST API with proper permissions.

Q: Do I need a GitHub account? A: No for basic usage (60 requests/hour). Yes for higher limits (5,000 requests/hour with token).

Q: What's the rate limit without a token? A: 60 requests per hour (unauthenticated). 5,000 with a token.

Q: Can I extract private repositories? A: Only public repositories. Private repos require different permissions.

Q: How do I get README content for AI training? A: Set includeReadme: true and use search mode to find relevant repositories.

Q: Can I search by multiple languages? A: Use language:python OR language:javascript in search query.

Q: What happens if rate limit is exceeded? A: Actor will log a warning. Add a GitHub token to increase limits.

Why Use This Actor?

Feature	This Actor (GitHub API)	Web Scraping
Legal	✅ Official API	❌ Violates ToS
Stable	✅ API rarely changes	❌ HTML breaks often
Fast	✅ Direct API calls	❌ Browser overhead
Cost	✅ $15 per 1k repos	❌ $30+ per 1k
Authentication	✅ Optional (higher limits)	❌ Complex login
README Access	✅ Direct API endpoint	❌ Requires parsing
Maintenance	✅ Minimal	❌ Constant updates

Output Use Cases

AI/LLM Training

Feed README content into vector databases
Build code documentation datasets
Create programming Q&A pairs
Extract technical writing samples

Developer Tools

Tech stack analysis
Framework popularity tracking
Library comparison
Documentation aggregation

Business Intelligence

Competitor monitoring
Technology trend analysis
Open-source landscape mapping
Developer ecosystem research

Legal & Ethics

✅ Legal Compliance:

Official API: Uses GitHub REST API v3 with proper authentication
Public Data Only: Accesses only publicly available repositories
Rate Limits: Respects GitHub's rate limiting
Terms of Service: Complies with GitHub's API ToS
No Scraping: No HTML parsing or browser automation

This actor is 100% legal and ethical - uses official GitHub API with proper permissions.

Support

Need help? Have questions?

📧 Support: support@apify.com
📚 Apify Docs: https://docs.apify.com
💬 Community: https://discord.com/invite/jyEM2PRvMU
🐙 GitHub API Docs: https://docs.github.com/en/rest

Built with ❤️ using Apify and GitHub REST API

Perfect for:

🤖 AI/LLM training data collection
📊 Developer research and analytics
💼 Competitive intelligence
🔍 Technology trend analysis
📚 Documentation aggregation

GitHub Repository Scraper

vulnv/github-repository-scraper

Scrape and extract GitHub repository data, metadata, statistics, stars, forks, issues, and project information from multiple repositories at once.

VulnV

5.0

Github Repository Analyzer

actually_good_at_this/apify-github-repository-analyzer

GitHub Repository Analyzer extracts comprehensive repository metrics using the official GitHub API: stars, forks, watchers, contributors, commit activity, and issues/PRs.

Kirill Y

GitHub Repository Scraper

cloud9_ai/github-scraper

Scrape GitHub repositories, users, and trending projects via REST API. Extract repo names, stars, forks, languages, descriptions, and contributor data.

cloud9

GitHub Repository Scraper

nexgendata/github-scraper

Search and extract GitHub repositories with stars, forks, languages, descriptions and contributor info.

Stephan Corbeil

GitHub Repository to Markdown Converter

vulnv/github-repo-markdown

Converts GitHub repositories into structured Markdown suitable for LLM consumption.

VulnV

GitHub Repository Analytics MCP Server

nexgendata/github-mcp-server

MCP server for GitHub repository statistics, search, language analysis, and comparisons via the Model Context Protocol.

Stephan Corbeil

GitHub Repo Scraper

artificially/github-repo-scraper

Scrape GitHub repository stats, README, languages, contributors, and releases.

Artificially

GitHub Repository Stats

nexgendata/github-repo-stats

Extract repository statistics including stars, forks, issues, contributors, and activity metrics from GitHub. Monitor open source project health.

Stephan Corbeil

Github Email Scraper

louisdeconinck/github-email-scraper

Instantly extract contributor emails and detailed profiles from any public GitHub repository or organization to supercharge your developer outreach and recruiting.