GitHub Repository Intelligence
Pricing
from $1.00 / 1,000 results
GitHub Repository Intelligence
Fetch rich metadata (stars, forks, README, languages, topics, license) from GitHub repositories. Search by query or provide direct URLs. Optional GitHub token for 80x higher rate limit.
Pricing
from $1.00 / 1,000 results
Rating
0.0
(0)
Developer
Crawler Bros
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 hours ago
Last modified
Categories
Share
Pull rich, structured metadata about any public GitHub repository — stars, forks, languages, topics, license, README, and full timestamps — either by search query or from a list of repository URLs.
What this actor does
GitHub's REST API exposes a deep well of per-repository metadata, but stitching it together across many repos (metadata + README + language breakdown + topics) is awkward to do by hand. This actor does that stitching for you and returns one clean, denormalised record per repository, ready for a dataset, a CSV export, or a downstream analytics job.
It has two modes. In search mode you hand it any GitHub search query — stars:>10000 language:python, topic:llm, user:apify, or whatever qualifier expression you like — and it returns up to 1000 matching repos, each fully enriched. In direct mode you hand it a list of https://github.com/owner/repo URLs and it fetches them in parallel.
For each repository the actor returns full metadata (stars, forks, open issues, watchers, size, homepage, default branch, timestamps, archived / template / disabled flags), the README text (base64-decoded), the list of topics, a byte-level language breakdown, and a normalised license block with SPDX identifier.
Key features
- Two modes —
search(any GitHub search query, up to 1000 results) ordirect(explicit repository URLs) - Rich per-repo metadata — identity, owner, stars, forks, open issues, watchers, size, homepage, default branch, timestamps, and status flags
- README — fetched, decoded, and included inline (truncated at 500 KB)
- Topics — the tags the repository owner has assigned
- Language breakdown — byte counts per language
- License block — SPDX id, full name, and URL
- Optional
githubToken— lifts the rate limit from 60 to 5000 requests per hour (roughly 80x), letting you scale up to thousands of repositories per run - Resilient — automatic retry on transient errors, rate-limit detection, per-repo error sentinels so one bad repo doesn't kill the batch
- Zero-setup — no browser, no cookies, no proxy required
- Zero-null output — empty fields are omitted from each record
Input
| Field | Type | Default | Description |
|---|---|---|---|
mode | enum | search | search to run a GitHub search query, or direct to fetch a list of URLs. |
searchQuery | string | — | GitHub search query (used when mode=search). Examples: stars:>10000 language:python, topic:llm, user:apify. |
repositoryUrls | string[] | — | List of repository URLs (used when mode=direct). Example: https://github.com/apify/crawlee. |
sortBy | enum | stars | Search result ordering: stars, forks, updated, or help-wanted-issues. |
maxResults | integer | 100 | Cap on the number of repositories returned (1-1000). |
includeReadme | boolean | true | Fetch and include each repository's README. |
includeTopics | boolean | true | Include the repository's topic tags. |
includeLanguages | boolean | true | Include the byte-level language breakdown. |
githubToken | string (secret) | — | Optional personal access token. Highly recommended for runs that touch more than ~20 repositories. |
Example input — search mode
{"mode": "search","searchQuery": "topic:llm stars:>500 language:python","sortBy": "stars","maxResults": 200,"includeReadme": true,"includeTopics": true,"includeLanguages": true,"githubToken": "ghp_xxx"}
Example input — direct mode
{"mode": "direct","repositoryUrls": ["https://github.com/apify/crawlee","https://github.com/apify/actor-templates"]}
Output
One record per repository. Fields without data are omitted.
{"id": 169257834,"name": "crawlee","fullName": "apify/crawlee","owner": {"login": "apify","id": 24586296,"avatarUrl": "https://avatars.githubusercontent.com/u/24586296?v=4","htmlUrl": "https://github.com/apify","type": "Organization"},"description": "Crawlee — a web scraping and browser automation library...","htmlUrl": "https://github.com/apify/crawlee","homepage": "https://crawlee.dev","primaryLanguage": "TypeScript","stars": 15000,"forks": 1000,"watchers": 120,"openIssues": 80,"size": 25000,"topics": ["scraping", "crawler", "automation"],"license": {"spdxId": "Apache-2.0","name": "Apache License 2.0","url": "https://api.github.com/licenses/apache-2.0"},"createdAt": "2019-02-05T12:00:00Z","updatedAt": "2024-05-01T00:00:00Z","pushedAt": "2024-05-02T00:00:00Z","isFork": false,"isArchived": false,"isDisabled": false,"isTemplate": false,"defaultBranch": "master","readme": "# Crawlee\n...","languages": {"TypeScript": 987654, "JavaScript": 4321},"scrapedAt": "2026-04-24T12:00:00+00:00"}
Field descriptions
- Identity —
id,name,fullName,htmlUrl,owner(login / id / avatar / URL / type) - Description —
description,homepage,primaryLanguage - Engagement —
stars,forks,watchers,openIssues,size(in KB) - Tags and license —
topicsarray,licenseblock with SPDX identifier - Timestamps —
createdAt,updatedAt,pushedAt - Status flags —
isFork,isArchived,isDisabled,isTemplate - Branch —
defaultBranch - Content —
readme(decoded, truncated to 500 KB),languages(bytes per language) scrapedAt— ISO-8601 timestamp of this run
Error record — emitted per repository when the fetch fails:
{"type": "github_repo_intelligence_error","reason": "rate_limit","message": "GitHub API rate limit exceeded. Supply `githubToken` to lift the limit.","repoIdentifier": "apify/crawlee","scrapedAt": "2026-04-24T12:00:00+00:00"}
Reason codes: rate_limit, not_found, invalid_url, search_failed, no_results, fetch_failed.
Use cases
- Ecosystem mapping — enumerate every repo tagged with
topic:llmortopic:web3to build a competitive map - Leaderboards and dashboards — rank a set of repos by stars, forks, or recent activity for internal dashboards
- Due diligence — gather license, archived status, push cadence, and README for a list of dependencies
- Market research — study language mix, topic distribution, and growth across a population of related projects
- Downstream code analysis — seed an ML / static-analysis pipeline with enriched repo metadata plus READMEs
FAQ
Do I need a GitHub token?
For very small runs (under ~20 repositories an hour, total) you can run without one. For anything larger, supply a personal access token — the token lifts your hourly request budget from 60 to 5000 (roughly 80x). That's because GitHub enforces a low anonymous-request cap (60 requests per IP per hour) and a much higher token-based cap (5000 requests per token per hour). Create a classic token at https://github.com/settings/tokens with the public_repo scope.
Why does each repository use multiple API calls?
Fetching full metadata, the README, and the language breakdown is three API calls per repo. If you disable includeReadme, includeTopics, and includeLanguages you drop back to one call per repo — useful for big shallow scans.
Do I need a proxy? No. GitHub's API isn't rate-limited by IP; it's limited by the anonymous/token budget described above. Apify datacenter IPs are fine.
Can I scrape private repositories?
Yes, as long as the githubToken you supply has access to them. A classic token with repo scope covers all private repos your account can read.
Why are some fields missing in my output?
The actor omits empty fields to keep records compact and meaningful — a repo without a homepage simply won't have a homepage key, instead of reporting null.
How large can a README get?
READMEs are truncated at 500 KB of decoded UTF-8 with a ...[truncated] marker. This keeps dataset rows well-behaved without losing the vast majority of actual README content.
Will search return more than 1000 results?
No — that's GitHub's hard cap on search. For a bigger universe, slice the space with additional qualifiers (e.g. stars:1000..5000, then stars:500..999, then stars:100..499) and run once per slice.
What happens when a single repo fails? The actor emits a per-repo error record and continues with the rest of the batch. One deleted or private repository never kills the whole run.
Known limitations
- 1000-result search cap. GitHub itself caps any search query at 1000 results. For larger spaces, slice the query into ranges.
- Anonymous rate limit is tight. Without a
githubTokenyou get about 60 API calls per hour. Each enriched repo is up to 3 calls, so runs over ~20 repositories need a token. - README truncation at 500 KB. Very long READMEs (rare) are cut at 500 KB with a marker.
- No commit, issue, or PR data. This actor focuses on repository-level metadata; commit history, issues, and pull requests are out of scope.
- Private repos need explicit access. The
githubTokenmust havereposcope for private repositories; public-only tokens return anot_founderror record for private URLs.