Pricing

from $1.00 / 1,000 results

Try for free

Go to Apify Store

GitHub Repository Intelligence

Try for free

Fetch rich metadata (stars, forks, README, languages, topics, license) from GitHub repositories. Search by query or provide direct URLs. Optional GitHub token for 80x higher rate limit.

Pricing

from $1.00 / 1,000 results

Rating

5.0

(22)

Developer

Crawler Bros

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

What this actor does

GitHub's REST API exposes a deep well of per-repository metadata, but stitching it together across many repos (metadata + README + language breakdown + topics) is awkward to do by hand. This actor does that stitching for you and returns one clean, denormalised record per repository, ready for a dataset, a CSV export, or a downstream analytics job.

It has two modes. In search mode you hand it any GitHub search query — stars:>10000 language:python, topic:llm, user:apify, or whatever qualifier expression you like — and it returns up to 1000 matching repos, each fully enriched. In direct mode you hand it a list of https://github.com/owner/repo URLs and it fetches them in parallel.

For each repository the actor returns full metadata (stars, forks, open issues, watchers, size, homepage, default branch, timestamps, archived / template / disabled flags), the README text (base64-decoded), the list of topics, a byte-level language breakdown, and a normalised license block with SPDX identifier.

Key features

Two modes — search (any GitHub search query, up to 1000 results) or direct (explicit repository URLs)
Rich per-repo metadata — identity, owner, stars, forks, open issues, watchers, size, homepage, default branch, timestamps, and status flags
README — fetched, decoded, and included inline (truncated at 500 KB)
Topics — the tags the repository owner has assigned
Language breakdown — byte counts per language
License block — SPDX id, full name, and URL
Optional githubToken — lifts the rate limit from 60 to 5000 requests per hour (roughly 80x), letting you scale up to thousands of repositories per run
Resilient — automatic retry on transient errors, rate-limit detection, per-repo error sentinels so one bad repo doesn't kill the batch
Zero-setup — no browser, no cookies, no proxy required
Zero-null output — empty fields are omitted from each record

Input

Field	Type	Default	Description
`mode`	enum	`search`	`search` to run a GitHub search query, or `direct` to fetch a list of URLs.
`searchQuery`	string	—	GitHub search query (used when `mode=search`). Examples: `stars:>10000 language:python`, `topic:llm`, `user:apify`.
`repositoryUrls`	string[]	—	List of repository URLs (used when `mode=direct`). Example: `https://github.com/apify/crawlee`.
`sortBy`	enum	`stars`	Search result ordering: `stars`, `forks`, `updated`, or `help-wanted-issues`.
`maxResults`	integer	`100`	Cap on the number of repositories returned (1-1000).
`includeReadme`	boolean	`true`	Fetch and include each repository's README.
`includeTopics`	boolean	`true`	Include the repository's topic tags.
`includeLanguages`	boolean	`true`	Include the byte-level language breakdown.
`githubToken`	string (secret)	—	Optional personal access token. Highly recommended for runs that touch more than ~20 repositories.

Example input — search mode

{
  "mode": "search",
  "searchQuery": "topic:llm stars:>500 language:python",
  "sortBy": "stars",
  "maxResults": 200,
  "includeReadme": true,
  "includeTopics": true,
  "includeLanguages": true,
  "githubToken": "ghp_xxx"
}

Example input — direct mode

{
  "mode": "direct",
  "repositoryUrls": [
    "https://github.com/apify/crawlee",
    "https://github.com/apify/actor-templates"
  ]
}

Output

One record per repository. Fields without data are omitted.

{
  "id": 169257834,
  "name": "crawlee",
  "fullName": "apify/crawlee",
  "owner": {
    "login": "apify",
    "id": 24586296,
    "avatarUrl": "https://avatars.githubusercontent.com/u/24586296?v=4",
    "htmlUrl": "https://github.com/apify",
    "type": "Organization"
  },
  "description": "Crawlee — a web scraping and browser automation library...",
  "htmlUrl": "https://github.com/apify/crawlee",
  "homepage": "https://crawlee.dev",
  "primaryLanguage": "TypeScript",
  "stars": 15000,
  "forks": 1000,
  "watchers": 120,
  "openIssues": 80,
  "size": 25000,
  "topics": ["scraping", "crawler", "automation"],
  "license": {
    "spdxId": "Apache-2.0",
    "name": "Apache License 2.0",
    "url": "https://api.github.com/licenses/apache-2.0"
  },
  "createdAt": "2019-02-05T12:00:00Z",
  "updatedAt": "2024-05-01T00:00:00Z",
  "pushedAt": "2024-05-02T00:00:00Z",
  "isFork": false,
  "isArchived": false,
  "isDisabled": false,
  "isTemplate": false,
  "defaultBranch": "master",
  "readme": "# Crawlee\n...",
  "languages": {"TypeScript": 987654, "JavaScript": 4321},
  "scrapedAt": "2026-04-24T12:00:00+00:00"
}

Field descriptions

Identity — id, name, fullName, htmlUrl, owner (login / id / avatar / URL / type)
Description — description, homepage, primaryLanguage
Engagement — stars, forks, watchers, openIssues, size (in KB)
Tags and license — topics array, license block with SPDX identifier
Timestamps — createdAt, updatedAt, pushedAt
Status flags — isFork, isArchived, isDisabled, isTemplate
Branch — defaultBranch
Content — readme (decoded, truncated to 500 KB), languages (bytes per language)
scrapedAt — ISO-8601 timestamp of this run

Error record — emitted per repository when the fetch fails:

{
  "type": "github_repo_intelligence_error",
  "reason": "rate_limit",
  "message": "GitHub API rate limit exceeded. Supply `githubToken` to lift the limit.",
  "repoIdentifier": "apify/crawlee",
  "scrapedAt": "2026-04-24T12:00:00+00:00"
}

Reason codes: rate_limit, not_found, invalid_url, search_failed, no_results, fetch_failed.

Use cases

Ecosystem mapping — enumerate every repo tagged with topic:llm or topic:web3 to build a competitive map
Leaderboards and dashboards — rank a set of repos by stars, forks, or recent activity for internal dashboards
Due diligence — gather license, archived status, push cadence, and README for a list of dependencies
Market research — study language mix, topic distribution, and growth across a population of related projects
Downstream code analysis — seed an ML / static-analysis pipeline with enriched repo metadata plus READMEs

FAQ

Do I need a GitHub token? For very small runs (under ~20 repositories an hour, total) you can run without one. For anything larger, supply a personal access token — the token lifts your hourly request budget from 60 to 5000 (roughly 80x). That's because GitHub enforces a low anonymous-request cap (60 requests per IP per hour) and a much higher token-based cap (5000 requests per token per hour). Create a classic token at https://github.com/settings/tokens with the public_repo scope.

Why does each repository use multiple API calls? Fetching full metadata, the README, and the language breakdown is three API calls per repo. If you disable includeReadme, includeTopics, and includeLanguages you drop back to one call per repo — useful for big shallow scans.

Do I need a proxy? No. GitHub's API isn't rate-limited by IP; it's limited by the anonymous/token budget described above. Apify datacenter IPs are fine.

Can I scrape private repositories? Yes, as long as the githubToken you supply has access to them. A classic token with repo scope covers all private repos your account can read.

Why are some fields missing in my output? The actor omits empty fields to keep records compact and meaningful — a repo without a homepage simply won't have a homepage key, instead of reporting null.

How large can a README get? READMEs are truncated at 500 KB of decoded UTF-8 with a ...[truncated] marker. This keeps dataset rows well-behaved without losing the vast majority of actual README content.

Will search return more than 1000 results? No — that's GitHub's hard cap on search. For a bigger universe, slice the space with additional qualifiers (e.g. stars:1000..5000, then stars:500..999, then stars:100..499) and run once per slice.

What happens when a single repo fails? The actor emits a per-repo error record and continues with the rest of the batch. One deleted or private repository never kills the whole run.

Known limitations

1000-result search cap. GitHub itself caps any search query at 1000 results. For larger spaces, slice the query into ranges.
Anonymous rate limit is tight. Without a githubToken you get about 60 API calls per hour. Each enriched repo is up to 3 calls, so runs over ~20 repositories need a token.
README truncation at 500 KB. Very long READMEs (rare) are cut at 500 KB with a marker.
No commit, issue, or PR data. This actor focuses on repository-level metadata; commit history, issues, and pull requests are out of scope.
Private repos need explicit access. The githubToken must have repo scope for private repositories; public-only tokens return a not_found error record for private URLs.

GitHub Repos Scraper

gio21/github-repos-scraper

Search and scrape GitHub repositories. Extract stars, forks, language, license, topics, and more from the GitHub public API.

Gio

GitHub Repository Scraper

skystone_labs/github-repo-scraper

Extract GitHub repository metadata using GitHub API and scraping. Get repo info, stars, forks, language, topics, and README content. Perfect for research, analysis, and building datasets.

Skystone

GitHub Repository Scraper

logiover/github-repository-scraper

Scrape GitHub repositories by search query - stars, forks, language, topics, owner, license and activity dates. Track trending projects, competitor repos or developer activity.

Logiover

GitHub Scraper

automation-lab/github-scraper

Extract data from GitHub — repository details, developer profiles, trending repos, and search results. Stars, forks, languages, topics, and more. No API key needed.

Stas Persiianenko

GitHub Trending Scraper — Repos & Developers

diverse_venture/github-trending-scraper

Scrape trending GitHub repositories filtered by language and period (daily/weekly/monthly), or top developers by location. Returns full repo metadata: stars, forks, topics, language, license. Uses public GitHub API — auth optional for higher rate limits.

Chak Man Fung

GitHub Scraper

pear_fight/github-scraper

Scrape repositories, stars, issues and more from GitHub

Harald

GitHub Repository Scraper

vulnv/github-repository-scraper

Scrape and extract GitHub repository data, metadata, statistics, stars, forks, issues, and project information from multiple repositories at once.

VulnV

GitHub Repositories Scraper - Cheap📦🐙🔍

scrapestorm/github-repositories-scraper---cheap

🔍 Easily collect repositories from GitHub Provide a GitHub profile URL or username and extract detailed repository information such as repository name, description, language, stars, topics & repository link 📦🐙 Perfect for open-source analysis, developer scouting & market intelligence 📊🔥

Storm_Scraper

GitHub Stars Tracker

glassventures/github-stars-tracker

Track GitHub repository stars, forks, and metadata. Extract repo stats, stargazer data, and search repositories by keywords.

Glass Ventures

Github Scraper

fortuitous_pirate/github-scraper

Extract GitHub repository data including trending repos, search results, and contributor lists. Get stars, forks, language, topics, license, and activity dates. No authentication required for public data — optional GitHub token for higher rate limits.