GitHub Repository Scraper avatar

GitHub Repository Scraper

Pricing

from $0.99 / 1,000 results

Go to Apify Store
GitHub Repository Scraper

GitHub Repository Scraper

Scrape GitHub repositories by search query - stars, forks, language, topics, owner, license and activity dates. Track trending projects, competitor repos or developer activity.

Pricing

from $0.99 / 1,000 results

Rating

0.0

(0)

Developer

Logiover

Logiover

Maintained by Community

Actor stats

0

Bookmarked

10

Total users

2

Monthly active users

2 days ago

Last modified

Share

πŸ™ GitHub Repository Scraper β€” Repo Search, Stars, Topics & Trending Data to JSON & CSV

GitHub Repository Scraper

Bulk-scrape public GitHub repositories via the official GitHub Search API β€” by query, qualifier (language, topic, stars, forks, created date), sort and order. Get back a flat, structured dataset of every matching repo: full name, owner (user or organization), description, homepage, primary language, topics, stars, forks, open issues, watchers, license, archived flag, created date, updated date and last-push date.

Built for VCs scouting open-source momentum, security teams tracking vulnerable dependencies, recruiters sourcing engineers by stack, OSS maintainers monitoring competitive projects, devtool product teams measuring adoption, and data teams powering ecosystem dashboards.

🟒 No login. No API key. No proxy. No browser. Public GitHub Search API only.


πŸš€ Why this scraper

GitHub is the planet's largest open-source codebase and one of the highest-signal datasets in tech. Star count is the closest thing the industry has to a market-cap for libraries. Topics tell you what categories are growing. Languages reveal where teams are betting their stack. Created dates surface new entrants before they trend on Hacker News. Activity timestamps separate alive projects from abandoned ones.

But pulling this data well at scale means:

  • Crafting GitHub Search query strings with qualifiers (language:, topic:, stars:>, created:>)
  • Working around GitHub Search's 1,000-result hard cap per query (and knowing when to split)
  • Paginating cleanly through ?page=1..N with per_page=100
  • Handling secondary rate limits and abuse signals
  • Flattening the nested owner / license / topics objects into spreadsheet rows
  • Persisting a clean schema downstream BI / warehouses / ML pipelines can consume

This Actor does all of that. Hand it any GitHub search query β€” get back a flat dataset of every matching repository with 20 useful columns, ready for Excel, your database, your security pipeline, or your VC deal-flow tracker.


✨ Key features

FeatureWhat it gives you
πŸ”Œ Official GitHub Search APIStable, well-documented, structured β€” no HTML scraping or anti-bot games
πŸ”Ž Full GitHub search syntaxAny qualifier GitHub supports: language:, topic:, stars:>N, forks:>N, created:>YYYY-MM-DD, pushed:>YYYY-MM-DD, archived:false, is:public, free text, and combinations
πŸ”’ Flexible sort & orderSort by stars, forks, updated or help-wanted-issues, ascending or descending
⭐ Rich repo metadata20 fields per repo: stars, forks, open issues, watchers, language, topics, license, archived flag, dates, owner type and more
πŸ“… Activity trackingcreatedAt, updatedAt and pushedAt to separate active from abandoned repos
πŸ‘€ Owner identityowner login + ownerType (User vs Organization) for downstream segmentation
♾️ Unlimited modemaxRepos=0 pulls every result the query allows (GitHub Search caps at 1,000 per query)
🧱 Flat schemaNo nested JSON to wrangle β€” drop straight into a warehouse
πŸ“¦ All export formatsJSON, CSV, Excel, HTML, XML, JSONL via the Apify Dataset
⏱️ Schedule-friendlyDeterministic and idempotent β€” great for daily ecosystem tracking
πŸ”“ No auth requiredAnonymous GitHub Search API access
🧰 Built-in Overview viewPre-configured Apify Dataset view with the most-useful columns visible by default

🎯 Built for these use cases

1. Open-source intelligence (OSINT) for VCs & investors

Run "topic:ai stars:>500 created:>2026-01-01" weekly. Every Monday morning, you get the new AI projects that crossed 500 stars in the last week β€” half of them will become the next portfolio companies. Filter by language, by owner type (organization-led usually monetizes faster), by activity recency.

2. Dependency & supply-chain tracking

Pull every repo using a topic / library you care about (topic:kubernetes, topic:llm). Build a database of who depends on what β€” critical for security, partnerships and product roadmap signals.

3. Security & vulnerability research

Find every public repo using a specific framework or language version. When a CVE drops, you have a pre-built target list to scan and notify. Sort by stars to prioritize widely-deployed code.

4. Technical talent sourcing

language:rust stars:>100 returns the maintainers of significant Rust projects β€” recruiters' shortlist for senior Rust engineers. Pair with pushedAt for active-this-month maintainers only.

5. Devtool adoption & competitive intelligence

You sell a devtool. Search every repo that uses your competitor's library β€” count, compare growth over months, segment by language and topic. Feed your sales/CS teams.

6. Package popularity tracking

For library maintainers: which forks have meaningful stars? Which competing packages are growing faster? Use weekly snapshots + growth deltas to inform OSS strategy.

Build "best of GitHub this week / month" newsletters, dashboards, podcasts β€” automated from a single scheduled scrape.

8. Ecosystem mapping for research & journalism

Map every project in a niche (LLM agents, web3 infra, climate tech, edtech). Cross-reference languages, owners, dates, license types. Publish industry reports.


πŸ“₯ Inputs

FieldTypeRequiredDescription
searchQuerystringNoAny GitHub Search query string. Free text and qualifiers both work: language:python, topic:llm, stars:>1000, created:>2026-01-01, pushed:>2026-04-01, archived:false, is:public, org:openai, user:torvalds. Default stars:>1000.
sortenumNoSort field: stars, forks, updated, help-wanted-issues. Default stars.
orderenumNoSort direction: desc or asc. Default desc.
maxReposintegerNoHard cap on rows. 0 = pull every available result (GitHub Search caps at 1,000 per query).

Example inputs

Trending AI repos in Python:

{
"searchQuery": "topic:ai language:python stars:>500",
"sort": "stars",
"order": "desc",
"maxRepos": 0
}

Active Rust devtools created this year:

{
"searchQuery": "language:rust topic:cli created:>2026-01-01 stars:>50 archived:false",
"sort": "updated",
"order": "desc",
"maxRepos": 500
}

Owner-specific (everything by an org):

{
"searchQuery": "org:vercel",
"sort": "stars",
"order": "desc",
"maxRepos": 0
}

Recently-pushed Kubernetes-related repos:

{
"searchQuery": "topic:kubernetes pushed:>2026-04-01 stars:>100",
"sort": "updated",
"order": "desc",
"maxRepos": 1000
}

πŸ“€ Output

One Apify dataset row per repository. Sample:

{
"id": 65600975,
"fullName": "openai/whisper",
"name": "whisper",
"owner": "openai",
"ownerType": "Organization",
"description": "Robust Speech Recognition via Large-Scale Weak Supervision",
"url": "https://github.com/openai/whisper",
"homepage": "",
"language": "Python",
"topics": ["audio", "speech-recognition", "deep-learning", "pytorch"],
"stars": 88231,
"forks": 10241,
"openIssues": 102,
"watchers": 88231,
"license": "MIT",
"isArchived": false,
"createdAt": "2022-09-21T16:35:42.000Z",
"updatedAt": "2026-05-15T09:11:00.000Z",
"pushedAt": "2026-05-12T14:22:18.000Z",
"scrapedAt": "2026-05-16T10:00:00.000Z"
}

Full field reference

FieldTypeMeaning
idintegerGitHub numeric repository ID
fullNamestringowner/repo β€” the canonical identifier
namestringRepository name (without owner)
ownerstringOwner login (user or organization name)
ownerTypestringUser or Organization
descriptionstringRepo short description (as shown under the name)
urlstringCanonical github.com URL
homepagestringProject homepage URL, if set
languagestringPrimary programming language detected by GitHub
topicsarrayRepo topics (manually set by the owner)
starsintegerCurrent star count
forksintegerCurrent fork count
openIssuesintegerCurrent open-issues count (includes PRs)
watchersintegerWatcher count (equals stars for most repos)
licensestringLicense identifier (e.g. MIT, Apache-2.0, GPL-3.0)
isArchivedbooleanTrue if the repo has been archived (read-only)
createdAtstringISO 8601 timestamp of repo creation
updatedAtstringISO 8601 timestamp of the last metadata update
pushedAtstringISO 8601 timestamp of the last code push (best activity signal)
scrapedAtstringISO 8601 timestamp of the scrape

βš™οΈ How it works

  1. Parses input β€” query string, sort, order, max cap.
  2. Calls api.github.com/search/repositories with q={searchQuery}, sort, order, per_page=100, page=1.
  3. Paginates through page=1..N until the cap or 1,000-row API ceiling is reached.
  4. Respects rate limits β€” observes X-RateLimit-Remaining and Retry-After headers; sleeps and retries gracefully.
  5. Flattens the nested response β€” extracts owner.login β†’ owner, owner.type β†’ ownerType, license.spdx_id β†’ license, normalizes timestamps to ISO 8601.
  6. Streams each repo as one flat row into the Apify Dataset.

The Actor uses ONLY the official, publicly-documented GitHub Search API (api.github.com/search/repositories). No HTML scraping, no headless browser, no proxy, no anti-bot bypass.


⚑ Performance

WorkloadApprox timeAPI calls
Top 100 repos for a tight query~3 seconds1
500 repos for one query~10 seconds5
1,000 repos (full query cap)~20 seconds10
Multi-query sweep (10 queries Γ— 1,000)~3 minutes~100
Daily ecosystem dashboard~minutesbounded

GitHub anonymous Search API allows ~10 requests/minute. The Actor stays comfortably within that with built-in pacing.


πŸ’° Cost model

Pay-Per-Result. You pay only for repository rows saved to the dataset. Queries that match nothing are not billed.

Typical costs (rough order):

  • Daily monitor of one query (~50 new repos) β†’ tiny
  • Weekly ecosystem sweep (1,000 repos Γ— 5 topics) β†’ small
  • Backfill of an industry-wide niche (10,000 repos via slicing) β†’ moderate
  • Multi-query talent / dependency mapping pipeline β†’ bounded and predictable

πŸ”„ Schedule for continuous monitoring

Schedule patterns we see in real deployments:

  • Hourly for "new trending repo" alerts in a fast-moving topic (e.g. topic:llm)
  • Daily for ecosystem dashboards and VC deal-flow trackers
  • Weekly for newsletter generation and dependency reports
  • Monthly for "state of the X ecosystem" reports

Push new rows into Slack, Discord, Notion, Airtable, Sheets, Postgres, BigQuery, your CRM or your custom API via Apify Webhooks.


πŸ› οΈ FAQ

Is using this GitHub scraper allowed? Yes. It uses the official public GitHub Search API and reads only publicly visible repository metadata. Use the data responsibly under GitHub's terms.

Do I need a GitHub account, login or token? No. The Actor works against GitHub's public Search API without authentication. (Authenticated requests would raise the rate-limit ceiling β€” if you need that, request a custom build.)

How many repos can I get per run? GitHub Search caps any single query at 1,000 results. Set maxRepos=0 to pull every available result for your query. To go beyond 1,000, split your query into narrower windows (by date, star band, language or topic) β€” each window has its own 1,000-row budget.

What search syntax does it support? Any GitHub Search qualifier and free-text combination. Common qualifiers: language:, topic:, stars:, forks:, size:, created:, pushed:, archived:, is:public, org:, user:. Comparators: >N, <N, >=N, <=N, N..M. Use exactly the same syntax you'd type into the github.com search bar.

Can I sort by anything I want? The GitHub Search API supports sort=stars, forks, updated and help-wanted-issues, ascending or descending. For other sorts (e.g. by pushed), sort downstream in your spreadsheet, SQL or pandas.

What's the difference between updatedAt and pushedAt? updatedAt = any metadata change (description edit, star count change). pushedAt = an actual code push to a branch. Use pushedAt to identify truly active development.

How do I get every repo in an organization? Use org:openai (replace with the org slug) as your searchQuery. Combine with is:public if you want to filter out forks: org:openai fork:false.

Does this get README content, file trees or commit history? No β€” this Actor focuses on repository metadata. For READMEs, file lists, contributors or commit history, request the matching companion Actor.

Are forks included by default? Yes β€” GitHub Search includes forks unless you add fork:false to your query.

Can I filter by license? Yes β€” add license:mit or license:apache-2.0 (etc.) to your searchQuery.

Is the data fresh? Yes β€” the API serves data within seconds of changes on github.com.

What output formats are supported? JSON, CSV, Excel, HTML, XML and JSONL via the Apify Dataset, plus REST API and webhooks for live integrations.

How is this different from GitHub Trending? GitHub Trending is a single rolling list visible only on github.com/trending. This Actor lets you build any trending-style list you want β€” by any query, with any sort, into a structured dataset.


Adjacent data sources in the social/dev/content suite:

ScraperPurpose
github-repository-scraperYou are here. Public GitHub repo metadata by search query.
stack-exchange-questions-scraperQ&A across 170+ Stack Exchange sites by tag/site/sort.
hacker-news-search-scraperHN stories/comments/Show HN/Ask HN/front page by keyword.
hacker-news-who-is-hiring-scraperMonthly HN "Who is hiring?" thread parsed by company/role/stack.
reddit-subreddit-scraperPosts from any subreddit by sort and time window.
reddit-historical-archive-scraperYears of subreddit history at scale.
devto-articles-scraperDev.to articles by tag, author, latest feed.
product-hunt-daily-launches-scraperToday's Product Hunt launches with votes and makers.
linkedin-top-content-scraperTop-performing LinkedIn posts by keyword/author.
linkedin-ad-library-scraperLinkedIn Ad Library β€” competitor ad creative & spend signals.
letterboxd-film-review-scraperFilm reviews from Letterboxd for culture/sentiment work.
instagram-media-downloaderReels/Posts/Stories HD download URLs in bulk.

πŸ”‘ Keyword cloud

Core: github scraper, github repo scraper, github repository scraper, github search api, github repo stats api, github trending scraper, github metadata scraper, github stars api, github stars scraper, github topics scraper, github bulk repo export, github repo to csv.

Niche: github search by topic, github search by language, github search by stars, github search by date, github fork:false, github archived:false, github org search, github user search, github pushed_at scraper, github created_at scraper, github license scraper, github watchers tracker, github open issues scraper.

Use case: open source intelligence, oss market research, vc deal flow github, devtool adoption tracking, dependency tracking, supply chain security research, cve target list, security research, talent sourcing for engineers, technical recruiting, package popularity tracking, oss competitive intelligence, ecosystem mapping, trending projects newsletter, github dashboard data, devrel ecosystem health, llm fine tuning corpus selection, code dataset selection, github research dataset.

Audience: vcs, investors, founders, devtool product managers, devrel teams, security researchers, technical recruiters, oss maintainers, package authors, data engineers, ml engineers, ai researchers, ecosystem analysts, technical journalists, growth marketers for developer audiences, sales / cs teams targeting engineers.