GitHub Repository Scraper
Pricing
Pay per event
GitHub Repository Scraper
Fetch full GitHub repository metadata for one or many repos in one call — stars, forks, languages, topics, license, default branch, latest release, contributor count. Free GitHub REST API, optional token for higher rate limits.
Pricing
Pay per event
Rating
0.0
(0)
Developer
DevilScrapes
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
🎯 What this scrapes
GitHub publishes nearly every public repo via its REST API. This Actor takes a list of owner/repo slugs (or full URLs), fans them out in parallel, and writes one richly-typed dataset row per repo — name, description, stars, forks, watchers, open issues, languages, topics, license, default branch, fork status, archive status, latest release tag, and contributor count.
🔥 What we handle for you
- 🛡️ Browser fingerprint rotation —
curl-cffiimpersonates real Chrome / Firefox / Safari TLS handshakes so the target sees a browser, not Python. - 🌐 Residential proxy rotation via Apify Proxy — fresh session and exit IP on every block.
- 🔁 Retries with exponential backoff on
408 / 429 / 5xx— up to 5 attempts per page,Retry-Afterhonoured. - 🧱 Rate-limit-aware pacing — when the target pushes back, we slow down instead of getting banned.
- 🧊 Clean, typed dataset rows — Pydantic-validated, ISO-8601 timestamps, stable IDs, JSON / CSV / Excel export straight from the Apify Console.
- 💰 Pay-Per-Event pricing — you only pay for results that hit your dataset. No data, no charge.
💡 Use cases
- Open-source watch list — track stars/forks across competitor projects and pipe to Slack.
- Dependency intel — feed the list of repos your stack depends on and flag archived/unmaintained ones.
- Hiring research — pull metadata for repos a candidate contributed to.
- M&A diligence — quantify the open-source surface area of a target company.
- Newsletter automation — ingest a curated list of repos weekly, diff stars.
⚙️ How to use it
- Click Try for free at the top of the page.
- Fill in the input form — most fields have sensible defaults.
- Click Start. Output streams into the run's dataset.
- Export from Storage → Dataset as JSON, CSV, or Excel — or fetch via the API.
📥 Input
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
repos | array | yes | ['apify/apify-sdk-python', 'apify/crawlee-python'] | List of repos in owner/repo form (e.g. apify/apify-sdk-python) or as full GitHub URLs. Each in |
githubToken | string | no | '—' | Personal access token. Without one you get 60 requests/hour; with one, 5 000/hour. Read-only access to public repos is s |
includeLanguages | boolean | no | True | Adds a languages map (language → bytes) per repo. One extra API call per repo. |
includeLatestRelease | boolean | no | True | Adds latest_release_tag and latest_release_published_at. One extra API call per repo. |
concurrency | integer | no | 6 | Parallel API requests. 8 is comfortable with a token; 2-3 without. |
proxyConfiguration | object | no | {'useApifyProxy': False} | Apify Proxy is optional — the GitHub API does not block based on IP. Use proxy only if your enterprise routing requires |
Example input
{"repos": ["apify/apify-sdk-python"],"includeLanguages": true,"includeLatestRelease": true,"concurrency": 4,"proxyConfiguration": {"useApifyProxy": false}}
📤 Output
Every row is one dataset item.
| Field | Type | Notes |
|---|---|---|
owner | string | Owner / organisation login. |
name | string | Repository name (without owner prefix). |
full_name | string | Full slug — owner/name. |
html_url | string | Canonical GitHub URL. |
description | ['string', 'null'] | Repository tagline. |
fork | boolean | True if this is a fork of another repo. |
archived | boolean | True if the repo is archived (read-only). |
disabled | boolean | True if the repo is disabled. |
stargazers_count | integer | Star count at scrape time. |
forks_count | integer | Fork count. |
watchers_count | integer | Watcher count (subscribers). |
open_issues_count | integer | Open issues + PRs. |
size_kb | integer | Repository size in kilobytes. |
language | ['string', 'null'] | Primary language (GitHub's classification). |
languages | ['object', 'null'] | Map of language → bytes, when includeLanguages=true. |
topics | array | Repository topics (tags). |
license | ['string', 'null'] | SPDX license identifier (e.g. MIT, Apache-2.0). |
default_branch | string | Default branch name (usually main). |
homepage | ['string', 'null'] | User-supplied homepage URL. |
created_at | string | Repo creation timestamp (ISO-8601). |
updated_at | string | Last metadata update timestamp. |
pushed_at | string | Last commit push timestamp. |
latest_release_tag | ['string', 'null'] | Tag of the latest GitHub release (when includeLatestRelease=true). |
latest_release_published_at | ['string', 'null'] | Publish timestamp of the latest release. |
scraped_at | string | When this row was recorded (ISO-8601 UTC). |
Example output
{"owner": "apify","name": "apify-sdk-python","full_name": "apify/apify-sdk-python","html_url": "https://github.com/apify/apify-sdk-python","description": "The Apify SDK for Python.","fork": false,"archived": false,"stargazers_count": 415,"forks_count": 41,"language": "Python","topics": ["apify","scraping","sdk"],"license": "Apache-2.0","default_branch": "main","latest_release_tag": "v3.4.0"}
💰 Pricing
Pay-Per-Event — you pay only when these events fire:
| Event | USD | What it is |
|---|---|---|
actor-start | $0.005 | One-off warm-up charge per run |
result | $0.002 | Per dataset item |
Example: 1 000 results at the rates above ≈ $2.00. No subscription, no minimum, no card to start — Apify gives every new account $5 of free credit.
🚧 Limitations
Private repos require a token with the right scopes — we don't probe them. README content, code, and commit graphs are out of scope for this Actor (use GitHub's own search or a dedicated commits scraper).
❓ FAQ
Do I need a GitHub token?
Not for small batches — the unauthenticated API gives you 60 requests/hour. Provide a token to lift that to 5 000/hour.
Can I scrape private repos?
No. A token grants access to whatever repos that token can see, but this Actor is designed for public-data extraction. Don't reuse production tokens here.
Why are stars/forks slightly off?
GitHub caches some counts for a few minutes. Compare runs ≥5 min apart to see real movement.
What if a repo doesn't exist?
The Actor logs a warning and skips that repo — partial dataset is still written.
💬 Your feedback
Spotted a bug, hit a weird edge case, or need a new field? Open an issue on the Actor's Issues tab on Apify Console — we ship fixes weekly and we read every report.