GitHub Scraper: Extract Trending Repos, Stars, Forks & Leads
Pricing
$6.99/month + usage
GitHub Scraper: Extract Trending Repos, Stars, Forks & Leads
Stop manual tracking! Extract trending repos, star counts, forks, and repository details from GitHub in seconds. Perfect for recruiters finding top talent or researchers tracking tech trends. Get clean, structured data ready for your CRM or spreadsheet. Fast, reliable, and dev-friendly!
Pricing
$6.99/month + usage
Rating
0.0
(0)
Developer
Scrape Pilot
Actor stats
0
Bookmarked
2
Total users
0
Monthly active users
21 days ago
Last modified
Categories
Share
GitHub Repository Scraper | Search Repos, Issues, Commits & User Data
Scrape GitHub repositories, issues, commits, and user profiles using the official GitHub API. Search by keyword, language, or username β sorted by stars, forks, or activity. No login required.
π Table of Contents
- What Does This Actor Do?
- Quick Start β 3 Steps
- Why This GitHub Scraper?
- Use Cases
- 4 Scraping Modes
- What Data You Get
- GitHub Token β Free vs Authenticated
- How It Works
- Input Parameters
- Example Input & Output
- Performance & Speed
- Cost Estimate
- Limitations
- Integrations
- FAQ
- Changelog
- Legal & Terms
π What Does This Actor Do?
This GitHub repository scraper pulls structured data from the official GitHub API across four modes β repository search, repository details, user repositories, issues, and commits β and returns clean JSON output ready for analysis, monitoring, or integration.
Run it once for a quick data pull or schedule it for ongoing GitHub monitoring:
- β Search GitHub repositories by keyword, language, stars, or forks
- β Get full repository details β stars, forks, topics, license, language, dates, README URL
- β Scrape a user's repositories β all public repos for any GitHub username
- β Extract issues β open, closed, and pull requests with labels and body text
- β Extract commits β message, author, date, additions, and deletions
No login. No browser. Uses the official GitHub API directly β fast, reliable, and structured.
β‘ Quick Start β 3 Steps
Step 1 β Choose your mode and configure
{"query": "machine learning","language_filter": "python","sort": "stars","max_results": 30}
Step 2 β Click Run The actor queries the GitHub API, parses every repository record, and pushes structured JSON to the Dataset.
Step 3 β Get your GitHub data
{"type": "repository","name": "tensorflow/tensorflow","stars": 183000,"forks": 74200,"language": "Python","topics": ["machine-learning", "deep-learning", "neural-network"],"license": "Apache License 2.0","created_at": "2015-11-07","updated_at": "2024-10-30","open_issues": 1842,"url": "https://github.com/tensorflow/tensorflow"}
Your GitHub repository data is in the Dataset tab β export as JSON, CSV, or Excel instantly.
π Why This GitHub Scraper?
| Feature | This Actor | Manual GitHub Search | GitHub API DIY |
|---|---|---|---|
| Repository search by keyword + language | β Built-in | β οΈ Limited UI | β But requires coding |
| Issues extraction (open + closed + PRs) | β All states | β | β But requires coding |
| Commit history with additions/deletions | β Per commit | β | β But requires coding |
| User repository listing | β Any username | β οΈ Manual only | β But requires coding |
| Sort by stars, forks, or activity | β Built-in | β οΈ Limited | β |
| Language filter | β Built-in | β οΈ Limited | β |
| Token support for 5000 req/hr | β Optional | β | β |
| README URL in output | β Auto-generated | β | β |
| Structured JSON β all 4 data types | β | β | Requires custom parser |
| No coding required | β | β | β |
All the power of the GitHub API β without writing a single line of code. Repository search, user data, issues, and commits in one actor.
π― Use Cases
π¬ Developer & Technology Research
Search GitHub repositories by keyword and language to map the open-source ecosystem in your technology area. Sort by stars to find the most widely adopted tools, frameworks, and libraries in any programming language.
π Competitive Intelligence for Dev Tools
Scrape star counts, fork activity, open issue counts, and update frequency for competitor or adjacent open-source projects. Track which tools are gaining momentum and which are going stale.
π€ Open Source Contributor Discovery
List all repositories for a GitHub username or organization. Identify active contributors, check commit frequency, and assess project health before partnering, sponsoring, or integrating.
π Issue Tracking & Bug Analysis
Extract all open issues from a repository β including labels, author, creation date, and full body text. Build a structured issue database for triage, prioritization, or customer feedback analysis.
π Commit History & Development Velocity
Pull commit logs from any public repository with message, author, date, additions, and deletions. Measure development velocity, identify active contributors, and analyze commit patterns over time.
π Academic & Data Science Research
Build datasets of GitHub repositories, contributors, or issues for research in software engineering, open-source collaboration, or developer productivity. The structured JSON output is ready for pandas, R, or any analysis tool.
πΌ Investor & VC Technology Due Diligence
Evaluate open-source developer traction for startups by scraping star growth trajectory, fork count, contributor diversity, and issue response rate β all from public GitHub data.
π Talent & Recruitment Intelligence
Find active developers in a specific language or technology area by scraping popular repositories and their contributors. Identify prolific contributors to high-star projects as potential candidates.
π° Tech Journalism & Trend Reports
Pull the top repositories by stars for any keyword or language to support articles on trending technologies, rising frameworks, or the most popular developer tools of the year.
π 4 Scraping Modes
This GitHub scraper operates in four distinct modes β selected automatically based on which input parameters you provide:
Mode 1 β Repository Search (query)
Search GitHub's repository index by keyword with optional language filter and sort order. Returns the top matching repositories ranked by stars, forks, or recent activity.
{ "query": "react dashboard", "language_filter": "javascript", "sort": "stars" }
Mode 2 β Repository Details + Issues/Commits (repo + mode)
Provide a specific owner/repo path to get full details for that repository. Add "mode": "issues" or "mode": "commits" to extract issues or commit history instead of repo metadata.
{ "repo": "facebook/react", "mode": "issues" }
Mode 3 β User Repositories (username)
List all public repositories for any GitHub username or organization, sorted by most recently updated.
{ "username": "torvalds" }
Mode 4 β Default Top Repositories (no query)
Run with no input to get the top repositories on GitHub by star count (stars:>1000). Useful for a quick pulse check on GitHub's most popular projects.
{ "max_results": 50, "sort": "stars" }
π What Data You Get
Repository Fields
| Field | Type | Description | Example |
|---|---|---|---|
type | string | Record type identifier | "repository" |
name | string | Full repo name (owner/repo) | "tensorflow/tensorflow" |
description | string | Repository description | "An Open Source ML library" |
url | string | GitHub repository URL | "https://github.com/..." |
stars | integer | Star count | 183000 |
forks | integer | Fork count | 74200 |
watchers | integer | Watcher count | 183000 |
language | string | Primary programming language | "Python" |
topics | array | Repository topic tags | ["machine-learning", "deep-learning"] |
license | string | License name | "Apache License 2.0" |
created_at | string | Creation date (YYYY-MM-DD) | "2015-11-07" |
updated_at | string | Last metadata update | "2024-10-30" |
pushed_at | string | Last code push date | "2024-10-30" |
open_issues | integer | Open issue count | 1842 |
size_kb | integer | Repository size in KB | 1284200 |
default_branch | string | Default branch name | "main" |
is_fork | boolean | Whether this is a fork | false |
homepage | string | Project homepage URL | "https://tensorflow.org" |
owner | string | Repository owner username | "tensorflow" |
owner_type | string | Owner type | "Organization" |
readme_url | string | Direct raw README.md URL | "https://raw.githubusercontent.com/..." |
Issue Fields
| Field | Type | Description | Example |
|---|---|---|---|
type | string | Record type | "issue" |
number | integer | Issue number | 1234 |
title | string | Issue title | "Bug: memory leak in..." |
author | string | GitHub username of issue author | "octocat" |
state | string | Issue state | "open" / "closed" |
body | string | Full issue description (up to 3000 chars) | "When running..." |
labels | array | Issue labels | ["bug", "help wanted"] |
created_at | string | Creation date | "2024-09-15" |
closed_at | string | Closure date (if closed) | "2024-10-01" |
comments | integer | Number of comments | 12 |
url | string | Issue URL | "https://github.com/.../issues/1234" |
is_pr | boolean | Whether this is a pull request | false |
Commit Fields
| Field | Type | Description | Example |
|---|---|---|---|
type | string | Record type | "commit" |
sha | string | First 8 chars of commit hash | "a1b2c3d4" |
message | string | Commit message (up to 500 chars) | "Fix: resolve null pointer in..." |
author | string | Commit author name or username | "gvanrossum" |
date | string | Commit date (YYYY-MM-DD) | "2024-10-15" |
url | string | GitHub commit URL | "https://github.com/.../commit/..." |
additions | integer | Lines added | 47 |
deletions | integer | Lines deleted | 12 |
π GitHub Token β Free vs Authenticated
The GitHub API has two rate limit tiers:
| Mode | Rate Limit | Best For |
|---|---|---|
| No token (default) | 60 requests/hour | Quick lookups, small datasets, testing |
| With GitHub token | 5,000 requests/hour | Large searches, bulk issues/commits, production runs |
Without a token, the actor can comfortably handle: up to 30 repository search results, a single repo's details, or about 50β60 issues/commits before hitting the hourly limit.
With a token, the actor handles thousands of records per run without interruption β ideal for extracting full issue histories, large commit logs, or bulk repository searches.
How to Get a Free GitHub Token
- Go to github.com/settings/tokens
- Click Generate new token (classic)
- Select no scopes (public data only β zero permissions needed)
- Copy the token and add it to the
github_tokeninput field
Your token is used only for this actor run and is never stored or shared.
βοΈ How It Works
Step 1 β Mode Detection
The actor detects which mode to run based on your input: repo (specific repository), username (user repos), query (search), or no input (top repos by stars).
Step 2 β GitHub API Request
All requests go directly to api.github.com using the official REST API v3. Repository search uses /search/repositories with Lucene-style query syntax. Issues use /repos/{owner}/{repo}/issues. Commits use /repos/{owner}/{repo}/commits.
Step 3 β Pagination
For issues and commits, the actor automatically paginates through results (100 per page) until max_results is reached or the endpoint returns fewer results than the page size.
Step 4 β Parsing & Output
Every record is parsed into a consistent typed structure. The type field ("repository", "issue", or "commit") identifies the record kind so mixed-mode datasets can be easily filtered downstream.
βοΈ Input Parameters
{"query": "web scraping","repo": "","username": "","mode": "search","language_filter": "python","sort": "stars","max_results": 30,"github_token": "ghp_xxxxxxxxxxxx","proxyConfiguration": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}}
| Parameter | Type | Default | Description |
|---|---|---|---|
query | string | "" | Keyword search for GitHub repositories (e.g., "machine learning", "react dashboard") |
repo | string | "" | Specific repository in owner/repo format (e.g., "facebook/react") |
username | string | "" | GitHub username or organization to list all public repos |
mode | string | "search" | When repo is set: "search" (repo details), "issues" (issue list), or "commits" (commit history) |
language_filter | string | "" | Filter search results by programming language (e.g., "python", "javascript", "rust") |
sort | string | "stars" | Sort repository search results by: "stars", "forks", or "updated" |
max_results | integer | 30 | Maximum number of records to return |
github_token | string | "" | Optional GitHub personal access token β increases rate limit from 60 to 5,000 req/hr |
proxyConfiguration | object | Off | Apify proxy configuration |
π¦ Example Input & Output
Example 1 β Search Top Python ML Repositories
Input:
{"query": "machine learning","language_filter": "python","sort": "stars","max_results": 10}
Output (one record):
{"type": "repository","name": "scikit-learn/scikit-learn","description": "scikit-learn: machine learning in Python","url": "https://github.com/scikit-learn/scikit-learn","stars": 58200,"forks": 25100,"watchers": 58200,"language": "Python","topics": ["machine-learning", "python", "statistics", "data-science"],"license": "BSD 3-Clause New or Revised License","created_at": "2010-08-17","updated_at": "2024-10-30","pushed_at": "2024-10-30","open_issues": 1892,"size_kb": 135600,"default_branch": "main","is_fork": false,"homepage": "https://scikit-learn.org","owner": "scikit-learn","owner_type": "Organization","readme_url": "https://raw.githubusercontent.com/scikit-learn/scikit-learn/main/README.md"}
Example 2 β Scrape All Issues from a Repository
Input:
{"repo": "vercel/next.js","mode": "issues","max_results": 50}
Output (one record):
{"type": "issue","number": 68421,"title": "Image optimization fails on custom domain with CORS headers","author": "dev_user_123","state": "open","body": "When deploying to a custom domain with strict CORS headers...","labels": ["bug", "needs triage"],"created_at": "2024-10-18","updated_at": "2024-10-25","closed_at": null,"comments": 7,"url": "https://github.com/vercel/next.js/issues/68421","is_pr": false,"repo": "vercel/next.js"}
Example 3 β List All Repositories for a User
Input:
{"username": "torvalds","max_results": 20}
Output: All public repositories for the given username, sorted by most recently updated β each with full star, fork, language, and date metadata.
Example 4 β Commit History with Code Changes
Input:
{"repo": "django/django","mode": "commits","max_results": 30}
Output (one record):
{"type": "commit","sha": "a1b2c3d4","message": "Fixed #35123 -- Corrected queryset evaluation in async context","author": "django-core","date": "2024-10-28","url": "https://github.com/django/django/commit/a1b2c3d4e5f6...","additions": 23,"deletions": 8,"repo": "django/django"}
β‘ Performance & Speed
| Mode | Records | Estimated Time |
|---|---|---|
| Repository search, 30 results | 30 repos | ~10β20 seconds |
| Repository search, 100 results | 100 repos | ~20β40 seconds |
| Issues, 50 records | 50 issues | ~15β30 seconds |
| Commits, 100 records | 100 commits | ~20β40 seconds |
| User repos, 50 repositories | 50 repos | ~15β25 seconds |
All modes include a 0.5β1 second delay between paginated requests to stay within GitHub API rate limits. With a GitHub token, rate limits are rarely a concern even for large runs.
π° Cost Estimate
Subscription: $6.99/month Β· Free Trial: 2 Hours (no credit card required)
| Run Type | Apify Compute Units | Approx. Compute Cost |
|---|---|---|
| Search, 30 repos | ~0.01β0.02 CU | < $0.01 |
| Issues, 100 records | ~0.02β0.04 CU | < $0.01 |
| Commits, 100 records | ~0.02β0.04 CU | < $0.01 |
| Large run, 500 records | ~0.1β0.2 CU | ~$0.01 |
| Scheduled daily (30-day month) | ~0.5β1.5 CU/month | ~$0.04β$0.12 |
This is one of the most compute-efficient actors on Apify β GitHub's API is fast and the actor has minimal overhead. The $6.99 subscription covers unlimited runs with negligible compute costs for most users.
β οΈ Limitations
Being transparent about what this actor cannot do:
- β Private repositories β Only publicly accessible GitHub repositories, issues, and commits are supported. Private repos require authentication that this actor does not support.
- β 60 req/hr without token β Without a GitHub token, the free rate limit allows roughly 50β60 records before hitting the hourly cap. For larger pulls, always add a token.
- β More than 1,000 search results β GitHub's search API returns a maximum of 1,000 results per query, regardless of
max_results. For broader coverage, run multiple searches with different keywords. - β Commit diff/patch content β Commit additions and deletions counts are included, but full code diffs are not extracted (these require one API call per commit and would exhaust rate limits quickly).
- β GitHub Actions, releases, or packages β This actor covers repositories, issues, and commits only. Actions workflow data, releases, and package registries are not included in v1.
- β Real-time webhook data β This actor fetches data on demand or on a schedule. It is not a live event stream.
- β GraphQL API β This actor uses GitHub's REST API v3. GraphQL-only data fields are not available.
π Integrations
Google Sheets β Repository Tracker
Export repository data to Google Sheets after each run. Track star growth, fork counts, and update frequency across your watchlist of repositories over time.
Airtable β Issue Database
Push GitHub issues into Airtable with label, state, author, and body fields. Build a structured issue tracker or customer feedback database from public GitHub repositories.
Apify API β Programmatic Integration
// Trigger a GitHub scraper run via APIconst run = await fetch("https://api.apify.com/v2/acts/YOUR_ACTOR_ID/runs", {method: "POST",headers: {"Content-Type": "application/json","Authorization": "Bearer YOUR_APIFY_TOKEN"},body: JSON.stringify({query: "kubernetes",language_filter: "go",sort: "stars",max_results: 50,github_token: "ghp_xxxxxxxxxxxx"})});
Scheduled Monitoring
Use Apify's built-in Schedule feature to run weekly repository searches. Track rising star counts, new topics, and activity changes in your technology area automatically.
n8n / Make / Zapier
Connect this GitHub scraper to downstream tools. Push new high-star repositories to a Slack channel, sync issues into a project management tool, or trigger alerts when a competitor repository crosses a star threshold.
β FAQ
Q: Do I need a GitHub account to use this actor? A: No. The GitHub API allows unauthenticated access at 60 requests/hour. You can run small queries without any account or token. For larger pulls, a free GitHub token raises the limit to 5,000 req/hr.
Q: How do I get a GitHub token?
A: Go to github.com/settings/tokens, generate a classic token with no scopes selected (public data needs zero permissions), and paste it into the github_token input field. It's free and takes under a minute.
Q: What is the maximum number of search results I can get? A: GitHub's search API caps results at 1,000 per query. If you need more, run multiple searches with different keywords or language filters and combine the results.
Q: Can I get issues AND repo details in one run?
A: Not in one run β each run targets one mode. Run twice: once with "mode": "search" for repo details, and once with "mode": "issues" for issue data. Both outputs can be exported and joined by repo name.
Q: Can I scrape repositories for a GitHub organization?
A: Yes. Use "username": "ORG_NAME" β the actor handles both personal accounts and organizations identically.
Q: What does readme_url in the output give me?
A: It's a direct URL to the raw README.md file for that repository on the default branch. You can use this URL to fetch and read the README content in a downstream step.
Q: Why does the actor slow down on large issue/commit pulls? A: GitHub's API requires pagination (100 records per page). A 0.5β1 second delay between pages is built in to avoid rate limiting. For 500 issues, expect 5 pages Γ ~0.75s delay = ~4 seconds of wait time on top of request time.
Q: Is the is_pr field reliable?
A: Yes. GitHub's issues API returns pull requests in the same feed. The is_pr field is true when the record is a pull request, false when it is a standard issue β letting you filter accurately.
π Changelog
v1.0.0 (Current)
- β
Repository search via GitHub
/search/repositorieswith keyword, language filter, and sort - β
Repository detail mode for specific
owner/repopaths - β Issues mode β all states (open, closed, PRs) with labels, body, and comment count
- β Commits mode β message, author, date, additions, and deletions per commit
- β User/organization repository listing sorted by most recently updated
- β Optional GitHub token support β 5,000 req/hr vs 60 req/hr unauthenticated
- β Automatic pagination for issues and commits
- β
typefield on every record for easy downstream filtering - β
readme_urlauto-generated for every repository record - β 0.5β1 second delay between paginated requests for rate-limit safety
- β
403rate-limit detection with clear log message prompting token addition - β
Proxy support via
curl_cffiChrome 110 impersonation - π Coming next: Stars/forks history, releases, GitHub Actions data
βοΈ Legal & Terms
This actor accesses publicly available data from the official GitHub REST API (api.github.com) β the same data visible to any user browsing GitHub without logging in.
Please use responsibly:
- Only scrape public repositories, issues, and commits
- Respect GitHub's Terms of Service and API Rate Limits
- Do not use extracted data to build unauthorized competing developer tools or data products for resale without consent
- Issue and commit body text may contain personal information β handle with appropriate care under GDPR and applicable data protection laws
- This actor is intended for research, analytics, competitive intelligence, and legitimate developer workflows
π€ Support
- Rate limit errors? Add a
github_tokenβ it raises your limit from 60 to 5,000 req/hr instantly - Need releases, Actions, or GraphQL data? Drop a feature request on the Apify actor page
- Works well for your research? A β review on the Apify Store helps others find this GitHub scraper and keeps it actively maintained
GitHub Repository Scraper Β· Built on Apify
Repositories Β· Issues Β· Commits Β· Users Β· Official API Β· No Login Β· Token Optional