GitHub Repository Intelligence avatar

GitHub Repository Intelligence

Pricing

from $1.00 / 1,000 results

Go to Apify Store
GitHub Repository Intelligence

GitHub Repository Intelligence

Fetch rich metadata (stars, forks, README, languages, topics, license) from GitHub repositories. Search by query or provide direct URLs. Optional GitHub token for 80x higher rate limit.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 hours ago

Last modified

Share

Pull rich, structured metadata about any public GitHub repository — stars, forks, languages, topics, license, README, and full timestamps — either by search query or from a list of repository URLs.

What this actor does

GitHub's REST API exposes a deep well of per-repository metadata, but stitching it together across many repos (metadata + README + language breakdown + topics) is awkward to do by hand. This actor does that stitching for you and returns one clean, denormalised record per repository, ready for a dataset, a CSV export, or a downstream analytics job.

It has two modes. In search mode you hand it any GitHub search query — stars:>10000 language:python, topic:llm, user:apify, or whatever qualifier expression you like — and it returns up to 1000 matching repos, each fully enriched. In direct mode you hand it a list of https://github.com/owner/repo URLs and it fetches them in parallel.

For each repository the actor returns full metadata (stars, forks, open issues, watchers, size, homepage, default branch, timestamps, archived / template / disabled flags), the README text (base64-decoded), the list of topics, a byte-level language breakdown, and a normalised license block with SPDX identifier.

Key features

  • Two modessearch (any GitHub search query, up to 1000 results) or direct (explicit repository URLs)
  • Rich per-repo metadata — identity, owner, stars, forks, open issues, watchers, size, homepage, default branch, timestamps, and status flags
  • README — fetched, decoded, and included inline (truncated at 500 KB)
  • Topics — the tags the repository owner has assigned
  • Language breakdown — byte counts per language
  • License block — SPDX id, full name, and URL
  • Optional githubToken — lifts the rate limit from 60 to 5000 requests per hour (roughly 80x), letting you scale up to thousands of repositories per run
  • Resilient — automatic retry on transient errors, rate-limit detection, per-repo error sentinels so one bad repo doesn't kill the batch
  • Zero-setup — no browser, no cookies, no proxy required
  • Zero-null output — empty fields are omitted from each record

Input

FieldTypeDefaultDescription
modeenumsearchsearch to run a GitHub search query, or direct to fetch a list of URLs.
searchQuerystringGitHub search query (used when mode=search). Examples: stars:>10000 language:python, topic:llm, user:apify.
repositoryUrlsstring[]List of repository URLs (used when mode=direct). Example: https://github.com/apify/crawlee.
sortByenumstarsSearch result ordering: stars, forks, updated, or help-wanted-issues.
maxResultsinteger100Cap on the number of repositories returned (1-1000).
includeReadmebooleantrueFetch and include each repository's README.
includeTopicsbooleantrueInclude the repository's topic tags.
includeLanguagesbooleantrueInclude the byte-level language breakdown.
githubTokenstring (secret)Optional personal access token. Highly recommended for runs that touch more than ~20 repositories.

Example input — search mode

{
"mode": "search",
"searchQuery": "topic:llm stars:>500 language:python",
"sortBy": "stars",
"maxResults": 200,
"includeReadme": true,
"includeTopics": true,
"includeLanguages": true,
"githubToken": "ghp_xxx"
}

Example input — direct mode

{
"mode": "direct",
"repositoryUrls": [
"https://github.com/apify/crawlee",
"https://github.com/apify/actor-templates"
]
}

Output

One record per repository. Fields without data are omitted.

{
"id": 169257834,
"name": "crawlee",
"fullName": "apify/crawlee",
"owner": {
"login": "apify",
"id": 24586296,
"avatarUrl": "https://avatars.githubusercontent.com/u/24586296?v=4",
"htmlUrl": "https://github.com/apify",
"type": "Organization"
},
"description": "Crawlee — a web scraping and browser automation library...",
"htmlUrl": "https://github.com/apify/crawlee",
"homepage": "https://crawlee.dev",
"primaryLanguage": "TypeScript",
"stars": 15000,
"forks": 1000,
"watchers": 120,
"openIssues": 80,
"size": 25000,
"topics": ["scraping", "crawler", "automation"],
"license": {
"spdxId": "Apache-2.0",
"name": "Apache License 2.0",
"url": "https://api.github.com/licenses/apache-2.0"
},
"createdAt": "2019-02-05T12:00:00Z",
"updatedAt": "2024-05-01T00:00:00Z",
"pushedAt": "2024-05-02T00:00:00Z",
"isFork": false,
"isArchived": false,
"isDisabled": false,
"isTemplate": false,
"defaultBranch": "master",
"readme": "# Crawlee\n...",
"languages": {"TypeScript": 987654, "JavaScript": 4321},
"scrapedAt": "2026-04-24T12:00:00+00:00"
}

Field descriptions

  • Identityid, name, fullName, htmlUrl, owner (login / id / avatar / URL / type)
  • Descriptiondescription, homepage, primaryLanguage
  • Engagementstars, forks, watchers, openIssues, size (in KB)
  • Tags and licensetopics array, license block with SPDX identifier
  • TimestampscreatedAt, updatedAt, pushedAt
  • Status flagsisFork, isArchived, isDisabled, isTemplate
  • BranchdefaultBranch
  • Contentreadme (decoded, truncated to 500 KB), languages (bytes per language)
  • scrapedAt — ISO-8601 timestamp of this run

Error record — emitted per repository when the fetch fails:

{
"type": "github_repo_intelligence_error",
"reason": "rate_limit",
"message": "GitHub API rate limit exceeded. Supply `githubToken` to lift the limit.",
"repoIdentifier": "apify/crawlee",
"scrapedAt": "2026-04-24T12:00:00+00:00"
}

Reason codes: rate_limit, not_found, invalid_url, search_failed, no_results, fetch_failed.

Use cases

  • Ecosystem mapping — enumerate every repo tagged with topic:llm or topic:web3 to build a competitive map
  • Leaderboards and dashboards — rank a set of repos by stars, forks, or recent activity for internal dashboards
  • Due diligence — gather license, archived status, push cadence, and README for a list of dependencies
  • Market research — study language mix, topic distribution, and growth across a population of related projects
  • Downstream code analysis — seed an ML / static-analysis pipeline with enriched repo metadata plus READMEs

FAQ

Do I need a GitHub token? For very small runs (under ~20 repositories an hour, total) you can run without one. For anything larger, supply a personal access token — the token lifts your hourly request budget from 60 to 5000 (roughly 80x). That's because GitHub enforces a low anonymous-request cap (60 requests per IP per hour) and a much higher token-based cap (5000 requests per token per hour). Create a classic token at https://github.com/settings/tokens with the public_repo scope.

Why does each repository use multiple API calls? Fetching full metadata, the README, and the language breakdown is three API calls per repo. If you disable includeReadme, includeTopics, and includeLanguages you drop back to one call per repo — useful for big shallow scans.

Do I need a proxy? No. GitHub's API isn't rate-limited by IP; it's limited by the anonymous/token budget described above. Apify datacenter IPs are fine.

Can I scrape private repositories? Yes, as long as the githubToken you supply has access to them. A classic token with repo scope covers all private repos your account can read.

Why are some fields missing in my output? The actor omits empty fields to keep records compact and meaningful — a repo without a homepage simply won't have a homepage key, instead of reporting null.

How large can a README get? READMEs are truncated at 500 KB of decoded UTF-8 with a ...[truncated] marker. This keeps dataset rows well-behaved without losing the vast majority of actual README content.

Will search return more than 1000 results? No — that's GitHub's hard cap on search. For a bigger universe, slice the space with additional qualifiers (e.g. stars:1000..5000, then stars:500..999, then stars:100..499) and run once per slice.

What happens when a single repo fails? The actor emits a per-repo error record and continues with the rest of the batch. One deleted or private repository never kills the whole run.

Known limitations

  • 1000-result search cap. GitHub itself caps any search query at 1000 results. For larger spaces, slice the query into ranges.
  • Anonymous rate limit is tight. Without a githubToken you get about 60 API calls per hour. Each enriched repo is up to 3 calls, so runs over ~20 repositories need a token.
  • README truncation at 500 KB. Very long READMEs (rare) are cut at 500 KB with a marker.
  • No commit, issue, or PR data. This actor focuses on repository-level metadata; commit history, issues, and pull requests are out of scope.
  • Private repos need explicit access. The githubToken must have repo scope for private repositories; public-only tokens return a not_found error record for private URLs.