GitHub Scraper — Repos, Issues, PRs & Code
Pricing
Pay per event
GitHub Scraper — Repos, Issues, PRs & Code
Scrape GitHub deeply — repos, issues, PRs, code search, contributors, releases, READMEs, commits, users, trending. 11 modes in one actor for AI coding agents (Claude Code, Cursor, Copilot). Optional PAT for 5K req/hr. MCP-ready, flat JSON output.
Pricing
Pay per event
Rating
0.0
(0)
Developer
Khadin Akbar
Maintained by CommunityActor stats
0
Bookmarked
1
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
GitHub Scraper is the GitHub data layer for AI coding agents. One actor, 11 modes, official REST + GraphQL APIs. Built for Claude Code, Cursor, GitHub Copilot, Aider, and any agent that needs rich GitHub context in a single tool call instead of stitching five narrow scrapers together.
What it does
Scrapes GitHub deeply through 11 selectable modes:
| Mode | Returns | Required input |
|---|---|---|
repo | Full repo metadata — 50+ fields (stars, forks, languages, topics, license, default branch, latest release) | repo |
repo-search | Repos matching a search query — full GitHub search qualifier syntax (language:, stars:>1000, user:, topic:) | query |
issues | Issues with labels, assignees, milestones, optional comment thread | repo |
prs | Pull requests with reviewers, files changed, mergeable status, optional reviews + comments | repo |
code-search | Code search across all of GitHub — file path, repo, sha, text matches (requires GITHUB_TOKEN) | query |
contributors | Contributors with login, contribution count, full profile (email, company, location, bio) | repo |
releases | Releases with assets, download counts, release notes, prerelease flag | repo |
readme | Full README — raw markdown + GFM-rendered HTML | repo |
commits | Commit history with author, message, SHA, parents, optional file diffs + stats | repo |
user | User or organization profile + their repos + organizations + social accounts | user |
trending | Trending repos by language and timeframe (daily / weekly / monthly) | none |
When to use it
- AI coding agents that need to ground answers in real repo state — issue history, PR reviews, contributor expertise, recent commits.
- OSS research — analyze hundreds of repos for tech stack, activity, bus factor, dependency drift.
- DevRel sourcing — find maintainers, contributors, and active issue commenters for partnership outreach.
- Recruiter pipelines — identify high-signal devs by contribution patterns and language depth.
- Competitive intelligence — track competitor open-source releases, issue volume, PR velocity.
When not to use it
- Private repositories or GitHub Enterprise Server (use the official
ghCLI with auth instead). - Real-time GitHub events — use GitHub webhooks for that.
- LinkedIn-style enrichment of GitHub users — see
linkedin-profile-email-scraperfor that.
Price
apify-actor-start— $0.00005 per run start.result— $0.005 per record returned for:repo,repo-search,issues(without comments),prs(without reviews/comments),contributors,releases,readme,user,trending.deep-result— $0.01 per record for heavier modes:code-search,commitswithincludeFiles=true, PRs withincludeReviews=trueorincludeComments=true, issues withincludeComments=true.
A typical agent call (50 repos in repo-search) costs about $0.25. A deep run (100 PRs with reviews + comments) costs about $1.00. The actor stops at maxResults (default 50, hard cap 1000) so one run stays under $10 — the x402 default prepay limit for agentic payments.
Authentication
Without a token, GitHub's REST API allows 60 requests/hour. With a token: 5,000/hour. Code search requires authentication.
Set the GITHUB_TOKEN environment variable in Apify Console → Settings → Environment variables → Add → Secret:
- Create a fine-grained Personal Access Token at https://github.com/settings/tokens?type=beta
- Grant
public_repo(read-only) scope; that's enough for everything public. - Paste the token as
GITHUB_TOKENin the actor's environment variables. Apify masks it automatically.
The actor never logs the token — apify/log auto-redacts.
Example inputs
Full metadata for one repo
{"mode": "repo","repo": "facebook/react"}
Top 50 TypeScript MCP repos
{"mode": "repo-search","query": "language:typescript stars:>500 mcp","maxResults": 50}
Open issues with full conversation threads
{"mode": "issues","repo": "apify/actors-mcp-server","state": "open","includeComments": true,"maxResults": 100}
Recent merged PRs with reviews
{"mode": "prs","repo": "vercel/next.js","state": "closed","includeReviews": true,"since": "2026-04-01","maxResults": 200}
Code search across GitHub
{"mode": "code-search","query": "StreamableHTTPServerTransport language:typescript","maxResults": 100}
Daily trending Rust repos
{"mode": "trending","language": "rust","timeframe": "daily"}
Last 30 days of commits with file diffs
{"mode": "commits","repo": "anthropics/claude-code","since": "2026-04-28","includeFiles": true,"maxResults": 200}
Output shape
Every record has mode, type, url, and scrapedAt (ISO 8601 UTC). Mode-specific fields follow. Items are flat, nulls are explicit, dates are ISO 8601. Average item size is under 500 tokens — built to fit inside an agent's context window when sampling 3-20 results.
Sample repo record:
{"mode": "repo","type": "repo","owner": "facebook","name": "react","fullName": "facebook/react","description": "The library for web and native user interfaces.","url": "https://github.com/facebook/react","homepage": "https://react.dev","language": "JavaScript","topics": ["react", "frontend", "javascript", "library"],"stars": 229000,"forks": 46900,"watchers": 6700,"openIssues": 980,"license": "MIT","archived": false,"defaultBranch": "main","createdAt": "2013-05-24T16:15:54Z","updatedAt": "2026-05-28T11:20:01Z","pushedAt": "2026-05-28T03:42:11Z","languages": { "JavaScript": 8294122, "TypeScript": 311024, "HTML": 24189 },"latestRelease": { "tagName": "v18.3.1", "publishedAt": "2025-04-26T17:42:00Z" },"scrapedAt": "2026-05-28T18:14:32Z"}
Use with MCP
The actor exposes itself as apify--github-deep-scraper in the Apify MCP server. Hit it from any MCP client:
https://mcp.apify.com?tools=khadinakbar/github-deep-scraper
From Claude Code or Cursor, configure the Apify MCP server with your Apify token, then the tool is discoverable through standard MCP list_tools calls. Anthropic agents budget per call — typical agent runs stay under the $1 x402 default prepay limit when maxResults is set sensibly.
Reliability and rate limits
- Built-in retry with exponential backoff for 5xx errors.
- 429 backoff respects GitHub's
Retry-Afterheader. - 403 rate-limit responses wait until
X-RateLimit-Reset(or fail clean if the wait exceeds 60 s). - Latest rate-limit state is persisted to the actor's key-value store (
RATELIMIT-latest) for inspection. - ETag headers are surfaced via the KV store collection
ETAGto enable downstream conditional caching.
FAQ
Why one actor with 11 modes instead of 11 separate actors? Agents call tools by name. Having one tool that covers all GitHub surfaces means the agent picks correctly the first time. Eleven separate actors mean eleven tool-description shootouts and eleven chances to pick the wrong one.
Do I need a GitHub token?
For most modes, no — but you'll be capped at 60 requests/hour. Set GITHUB_TOKEN to lift that to 5,000/hour. For code-search, a token is required (GitHub's API forces this).
Does it work for private repositories?
No. Use the official GitHub CLI (gh) for private repos. This actor is designed for public data only.
What about GitHub GraphQL? The actor uses REST v3 for stable, paginated endpoints. A future version may switch heavy multi-field reads to GraphQL to halve API quota usage.
How fresh is the data? Real-time. Every run hits GitHub's live API.
Can I run multiple modes in one call? No — one mode per run. Chain runs from your orchestrator (Apify task, n8n, Zapier, agent loop) when you need composite data.
Legal
This actor accesses GitHub's official REST and GraphQL APIs and the public github.com/trending HTML page. All endpoints used are public; no authentication is required to access them (a token is recommended for higher rate limits). You are responsible for complying with GitHub's Terms of Service and Acceptable Use Policy. The actor does not scrape private repositories or any data behind login. It does not bypass any technical protection measure. Personal data extracted (contributor names, emails published on profiles) must be handled in accordance with applicable data-protection law (GDPR, CCPA). Apify and the actor author make no warranty about data accuracy, completeness, or fitness for any purpose.
Changelog
- 2026-05-29 v0.2 — Reliability hardening. Graceful 404/422/451/409 → structured
not-found/search-errorrecords instead of failed runs. Canary check at run start. Pre-flight validation. Trending mode now flags invalid language slugs explicitly. 36/36 brutal test matrix green. - 2026-05-28 v0.1 — Initial release. 11 modes. REST v3. PAT-aware rate-limit handling. Premium PPE pricing.
Related actors
- Google Patents Scraper — search patents, citations & inventor portfolios across USPTO/EPO/WIPO when you need IP context alongside open-source.
- Hugging Face Scraper — models, datasets & Spaces for AI research and benchmarking work that pairs with GitHub repo intel.
- Y Combinator Scraper — YC company profiles, founders & jobs to enrich repo-author identification and dev-tool competitive scans.
- Google SERP Scraper — Google search results for any keyword when you need to combine repo signal with web-wide ranking.
- ImportYeti Scraper — US Customs import/supplier graph for industrial intelligence beyond the software layer.
Built by @khadinakbar.