GitHub Scraper — Repos, Issues, PRs & Code avatar

GitHub Scraper — Repos, Issues, PRs & Code

Pricing

Pay per event

Go to Apify Store
GitHub Scraper — Repos, Issues, PRs & Code

GitHub Scraper — Repos, Issues, PRs & Code

Scrape GitHub deeply — repos, issues, PRs, code search, contributors, releases, READMEs, commits, users, trending. 11 modes in one actor for AI coding agents (Claude Code, Cursor, Copilot). Optional PAT for 5K req/hr. MCP-ready, flat JSON output.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Khadin Akbar

Khadin Akbar

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

1

Monthly active users

3 days ago

Last modified

Share

GitHub Scraper is the GitHub data layer for AI coding agents. One actor, 11 modes, official REST + GraphQL APIs. Built for Claude Code, Cursor, GitHub Copilot, Aider, and any agent that needs rich GitHub context in a single tool call instead of stitching five narrow scrapers together.

What it does

Scrapes GitHub deeply through 11 selectable modes:

ModeReturnsRequired input
repoFull repo metadata — 50+ fields (stars, forks, languages, topics, license, default branch, latest release)repo
repo-searchRepos matching a search query — full GitHub search qualifier syntax (language:, stars:>1000, user:, topic:)query
issuesIssues with labels, assignees, milestones, optional comment threadrepo
prsPull requests with reviewers, files changed, mergeable status, optional reviews + commentsrepo
code-searchCode search across all of GitHub — file path, repo, sha, text matches (requires GITHUB_TOKEN)query
contributorsContributors with login, contribution count, full profile (email, company, location, bio)repo
releasesReleases with assets, download counts, release notes, prerelease flagrepo
readmeFull README — raw markdown + GFM-rendered HTMLrepo
commitsCommit history with author, message, SHA, parents, optional file diffs + statsrepo
userUser or organization profile + their repos + organizations + social accountsuser
trendingTrending repos by language and timeframe (daily / weekly / monthly)none

When to use it

  • AI coding agents that need to ground answers in real repo state — issue history, PR reviews, contributor expertise, recent commits.
  • OSS research — analyze hundreds of repos for tech stack, activity, bus factor, dependency drift.
  • DevRel sourcing — find maintainers, contributors, and active issue commenters for partnership outreach.
  • Recruiter pipelines — identify high-signal devs by contribution patterns and language depth.
  • Competitive intelligence — track competitor open-source releases, issue volume, PR velocity.

When not to use it

  • Private repositories or GitHub Enterprise Server (use the official gh CLI with auth instead).
  • Real-time GitHub events — use GitHub webhooks for that.
  • LinkedIn-style enrichment of GitHub users — see linkedin-profile-email-scraper for that.

Price

  • apify-actor-start — $0.00005 per run start.
  • result — $0.005 per record returned for: repo, repo-search, issues (without comments), prs (without reviews/comments), contributors, releases, readme, user, trending.
  • deep-result — $0.01 per record for heavier modes: code-search, commits with includeFiles=true, PRs with includeReviews=true or includeComments=true, issues with includeComments=true.

A typical agent call (50 repos in repo-search) costs about $0.25. A deep run (100 PRs with reviews + comments) costs about $1.00. The actor stops at maxResults (default 50, hard cap 1000) so one run stays under $10 — the x402 default prepay limit for agentic payments.

Authentication

Without a token, GitHub's REST API allows 60 requests/hour. With a token: 5,000/hour. Code search requires authentication.

Set the GITHUB_TOKEN environment variable in Apify Console → Settings → Environment variables → Add → Secret:

  1. Create a fine-grained Personal Access Token at https://github.com/settings/tokens?type=beta
  2. Grant public_repo (read-only) scope; that's enough for everything public.
  3. Paste the token as GITHUB_TOKEN in the actor's environment variables. Apify masks it automatically.

The actor never logs the token — apify/log auto-redacts.

Example inputs

Full metadata for one repo

{
"mode": "repo",
"repo": "facebook/react"
}

Top 50 TypeScript MCP repos

{
"mode": "repo-search",
"query": "language:typescript stars:>500 mcp",
"maxResults": 50
}

Open issues with full conversation threads

{
"mode": "issues",
"repo": "apify/actors-mcp-server",
"state": "open",
"includeComments": true,
"maxResults": 100
}

Recent merged PRs with reviews

{
"mode": "prs",
"repo": "vercel/next.js",
"state": "closed",
"includeReviews": true,
"since": "2026-04-01",
"maxResults": 200
}

Code search across GitHub

{
"mode": "code-search",
"query": "StreamableHTTPServerTransport language:typescript",
"maxResults": 100
}
{
"mode": "trending",
"language": "rust",
"timeframe": "daily"
}

Last 30 days of commits with file diffs

{
"mode": "commits",
"repo": "anthropics/claude-code",
"since": "2026-04-28",
"includeFiles": true,
"maxResults": 200
}

Output shape

Every record has mode, type, url, and scrapedAt (ISO 8601 UTC). Mode-specific fields follow. Items are flat, nulls are explicit, dates are ISO 8601. Average item size is under 500 tokens — built to fit inside an agent's context window when sampling 3-20 results.

Sample repo record:

{
"mode": "repo",
"type": "repo",
"owner": "facebook",
"name": "react",
"fullName": "facebook/react",
"description": "The library for web and native user interfaces.",
"url": "https://github.com/facebook/react",
"homepage": "https://react.dev",
"language": "JavaScript",
"topics": ["react", "frontend", "javascript", "library"],
"stars": 229000,
"forks": 46900,
"watchers": 6700,
"openIssues": 980,
"license": "MIT",
"archived": false,
"defaultBranch": "main",
"createdAt": "2013-05-24T16:15:54Z",
"updatedAt": "2026-05-28T11:20:01Z",
"pushedAt": "2026-05-28T03:42:11Z",
"languages": { "JavaScript": 8294122, "TypeScript": 311024, "HTML": 24189 },
"latestRelease": { "tagName": "v18.3.1", "publishedAt": "2025-04-26T17:42:00Z" },
"scrapedAt": "2026-05-28T18:14:32Z"
}

Use with MCP

The actor exposes itself as apify--github-deep-scraper in the Apify MCP server. Hit it from any MCP client:

https://mcp.apify.com?tools=khadinakbar/github-deep-scraper

From Claude Code or Cursor, configure the Apify MCP server with your Apify token, then the tool is discoverable through standard MCP list_tools calls. Anthropic agents budget per call — typical agent runs stay under the $1 x402 default prepay limit when maxResults is set sensibly.

Reliability and rate limits

  • Built-in retry with exponential backoff for 5xx errors.
  • 429 backoff respects GitHub's Retry-After header.
  • 403 rate-limit responses wait until X-RateLimit-Reset (or fail clean if the wait exceeds 60 s).
  • Latest rate-limit state is persisted to the actor's key-value store (RATELIMIT-latest) for inspection.
  • ETag headers are surfaced via the KV store collection ETAG to enable downstream conditional caching.

FAQ

Why one actor with 11 modes instead of 11 separate actors? Agents call tools by name. Having one tool that covers all GitHub surfaces means the agent picks correctly the first time. Eleven separate actors mean eleven tool-description shootouts and eleven chances to pick the wrong one.

Do I need a GitHub token? For most modes, no — but you'll be capped at 60 requests/hour. Set GITHUB_TOKEN to lift that to 5,000/hour. For code-search, a token is required (GitHub's API forces this).

Does it work for private repositories? No. Use the official GitHub CLI (gh) for private repos. This actor is designed for public data only.

What about GitHub GraphQL? The actor uses REST v3 for stable, paginated endpoints. A future version may switch heavy multi-field reads to GraphQL to halve API quota usage.

How fresh is the data? Real-time. Every run hits GitHub's live API.

Can I run multiple modes in one call? No — one mode per run. Chain runs from your orchestrator (Apify task, n8n, Zapier, agent loop) when you need composite data.

This actor accesses GitHub's official REST and GraphQL APIs and the public github.com/trending HTML page. All endpoints used are public; no authentication is required to access them (a token is recommended for higher rate limits). You are responsible for complying with GitHub's Terms of Service and Acceptable Use Policy. The actor does not scrape private repositories or any data behind login. It does not bypass any technical protection measure. Personal data extracted (contributor names, emails published on profiles) must be handled in accordance with applicable data-protection law (GDPR, CCPA). Apify and the actor author make no warranty about data accuracy, completeness, or fitness for any purpose.

Changelog

  • 2026-05-29 v0.2 — Reliability hardening. Graceful 404/422/451/409 → structured not-found/search-error records instead of failed runs. Canary check at run start. Pre-flight validation. Trending mode now flags invalid language slugs explicitly. 36/36 brutal test matrix green.
  • 2026-05-28 v0.1 — Initial release. 11 modes. REST v3. PAT-aware rate-limit handling. Premium PPE pricing.

  • Google Patents Scraper — search patents, citations & inventor portfolios across USPTO/EPO/WIPO when you need IP context alongside open-source.
  • Hugging Face Scraper — models, datasets & Spaces for AI research and benchmarking work that pairs with GitHub repo intel.
  • Y Combinator Scraper — YC company profiles, founders & jobs to enrich repo-author identification and dev-tool competitive scans.
  • Google SERP Scraper — Google search results for any keyword when you need to combine repo signal with web-wide ranking.
  • ImportYeti Scraper — US Customs import/supplier graph for industrial intelligence beyond the software layer.

Built by @khadinakbar.