GitHub Repository Scraper avatar

GitHub Repository Scraper

Pricing

Pay per event

Go to Apify Store
GitHub Repository Scraper

GitHub Repository Scraper

Fetch full GitHub repository metadata for one or many repos in one call — stars, forks, languages, topics, license, default branch, latest release, contributor count. Free GitHub REST API, optional token for higher rate limits.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

DevilScrapes

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share


🎯 What this scrapes

GitHub publishes nearly every public repo via its REST API. This Actor takes a list of owner/repo slugs (or full URLs), fans them out in parallel, and writes one richly-typed dataset row per repo — name, description, stars, forks, watchers, open issues, languages, topics, license, default branch, fork status, archive status, latest release tag, and contributor count.

🔥 What we handle for you

  • 🛡️ Browser fingerprint rotationcurl-cffi impersonates real Chrome / Firefox / Safari TLS handshakes so the target sees a browser, not Python.
  • 🌐 Residential proxy rotation via Apify Proxy — fresh session and exit IP on every block.
  • 🔁 Retries with exponential backoff on 408 / 429 / 5xx — up to 5 attempts per page, Retry-After honoured.
  • 🧱 Rate-limit-aware pacing — when the target pushes back, we slow down instead of getting banned.
  • 🧊 Clean, typed dataset rows — Pydantic-validated, ISO-8601 timestamps, stable IDs, JSON / CSV / Excel export straight from the Apify Console.
  • 💰 Pay-Per-Event pricing — you only pay for results that hit your dataset. No data, no charge.

💡 Use cases

  • Open-source watch list — track stars/forks across competitor projects and pipe to Slack.
  • Dependency intel — feed the list of repos your stack depends on and flag archived/unmaintained ones.
  • Hiring research — pull metadata for repos a candidate contributed to.
  • M&A diligence — quantify the open-source surface area of a target company.
  • Newsletter automation — ingest a curated list of repos weekly, diff stars.

⚙️ How to use it

  1. Click Try for free at the top of the page.
  2. Fill in the input form — most fields have sensible defaults.
  3. Click Start. Output streams into the run's dataset.
  4. Export from Storage → Dataset as JSON, CSV, or Excel — or fetch via the API.

📥 Input

FieldTypeRequiredDefaultNotes
reposarrayyes['apify/apify-sdk-python', 'apify/crawlee-python']List of repos in owner/repo form (e.g. apify/apify-sdk-python) or as full GitHub URLs. Each in
githubTokenstringno'—'Personal access token. Without one you get 60 requests/hour; with one, 5 000/hour. Read-only access to public repos is s
includeLanguagesbooleannoTrueAdds a languages map (language → bytes) per repo. One extra API call per repo.
includeLatestReleasebooleannoTrueAdds latest_release_tag and latest_release_published_at. One extra API call per repo.
concurrencyintegerno6Parallel API requests. 8 is comfortable with a token; 2-3 without.
proxyConfigurationobjectno{'useApifyProxy': False}Apify Proxy is optional — the GitHub API does not block based on IP. Use proxy only if your enterprise routing requires

Example input

{
"repos": [
"apify/apify-sdk-python"
],
"includeLanguages": true,
"includeLatestRelease": true,
"concurrency": 4,
"proxyConfiguration": {
"useApifyProxy": false
}
}

📤 Output

Every row is one dataset item.

FieldTypeNotes
ownerstringOwner / organisation login.
namestringRepository name (without owner prefix).
full_namestringFull slug — owner/name.
html_urlstringCanonical GitHub URL.
description['string', 'null']Repository tagline.
forkbooleanTrue if this is a fork of another repo.
archivedbooleanTrue if the repo is archived (read-only).
disabledbooleanTrue if the repo is disabled.
stargazers_countintegerStar count at scrape time.
forks_countintegerFork count.
watchers_countintegerWatcher count (subscribers).
open_issues_countintegerOpen issues + PRs.
size_kbintegerRepository size in kilobytes.
language['string', 'null']Primary language (GitHub's classification).
languages['object', 'null']Map of language → bytes, when includeLanguages=true.
topicsarrayRepository topics (tags).
license['string', 'null']SPDX license identifier (e.g. MIT, Apache-2.0).
default_branchstringDefault branch name (usually main).
homepage['string', 'null']User-supplied homepage URL.
created_atstringRepo creation timestamp (ISO-8601).
updated_atstringLast metadata update timestamp.
pushed_atstringLast commit push timestamp.
latest_release_tag['string', 'null']Tag of the latest GitHub release (when includeLatestRelease=true).
latest_release_published_at['string', 'null']Publish timestamp of the latest release.
scraped_atstringWhen this row was recorded (ISO-8601 UTC).

Example output

{
"owner": "apify",
"name": "apify-sdk-python",
"full_name": "apify/apify-sdk-python",
"html_url": "https://github.com/apify/apify-sdk-python",
"description": "The Apify SDK for Python.",
"fork": false,
"archived": false,
"stargazers_count": 415,
"forks_count": 41,
"language": "Python",
"topics": [
"apify",
"scraping",
"sdk"
],
"license": "Apache-2.0",
"default_branch": "main",
"latest_release_tag": "v3.4.0"
}

💰 Pricing

Pay-Per-Event — you pay only when these events fire:

EventUSDWhat it is
actor-start$0.005One-off warm-up charge per run
result$0.002Per dataset item

Example: 1 000 results at the rates above ≈ $2.00. No subscription, no minimum, no card to start — Apify gives every new account $5 of free credit.

🚧 Limitations

Private repos require a token with the right scopes — we don't probe them. README content, code, and commit graphs are out of scope for this Actor (use GitHub's own search or a dedicated commits scraper).

❓ FAQ

Do I need a GitHub token?

Not for small batches — the unauthenticated API gives you 60 requests/hour. Provide a token to lift that to 5 000/hour.

Can I scrape private repos?

No. A token grants access to whatever repos that token can see, but this Actor is designed for public-data extraction. Don't reuse production tokens here.

Why are stars/forks slightly off?

GitHub caches some counts for a few minutes. Compare runs ≥5 min apart to see real movement.

What if a repo doesn't exist?

The Actor logs a warning and skips that repo — partial dataset is still written.

💬 Your feedback

Spotted a bug, hit a weird edge case, or need a new field? Open an issue on the Actor's Issues tab on Apify Console — we ship fixes weekly and we read every report.