GitHub Repository Scraper
Pricing
from $0.99 / 1,000 results
GitHub Repository Scraper
Scrape GitHub repositories by search query - stars, forks, language, topics, owner, license and activity dates. Track trending projects, competitor repos or developer activity.
Pricing
from $0.99 / 1,000 results
Rating
0.0
(0)
Developer
Logiover
Maintained by CommunityActor stats
0
Bookmarked
10
Total users
2
Monthly active users
2 days ago
Last modified
Categories
Share
π GitHub Repository Scraper β Repo Search, Stars, Topics & Trending Data to JSON & CSV

Bulk-scrape public GitHub repositories via the official GitHub Search API β by query, qualifier (language, topic, stars, forks, created date), sort and order. Get back a flat, structured dataset of every matching repo: full name, owner (user or organization), description, homepage, primary language, topics, stars, forks, open issues, watchers, license, archived flag, created date, updated date and last-push date.
Built for VCs scouting open-source momentum, security teams tracking vulnerable dependencies, recruiters sourcing engineers by stack, OSS maintainers monitoring competitive projects, devtool product teams measuring adoption, and data teams powering ecosystem dashboards.
π’ No login. No API key. No proxy. No browser. Public GitHub Search API only.
π Why this scraper
GitHub is the planet's largest open-source codebase and one of the highest-signal datasets in tech. Star count is the closest thing the industry has to a market-cap for libraries. Topics tell you what categories are growing. Languages reveal where teams are betting their stack. Created dates surface new entrants before they trend on Hacker News. Activity timestamps separate alive projects from abandoned ones.
But pulling this data well at scale means:
- Crafting GitHub Search query strings with qualifiers (
language:,topic:,stars:>,created:>) - Working around GitHub Search's 1,000-result hard cap per query (and knowing when to split)
- Paginating cleanly through
?page=1..Nwithper_page=100 - Handling secondary rate limits and abuse signals
- Flattening the nested
owner/license/topicsobjects into spreadsheet rows - Persisting a clean schema downstream BI / warehouses / ML pipelines can consume
This Actor does all of that. Hand it any GitHub search query β get back a flat dataset of every matching repository with 20 useful columns, ready for Excel, your database, your security pipeline, or your VC deal-flow tracker.
β¨ Key features
| Feature | What it gives you |
|---|---|
| π Official GitHub Search API | Stable, well-documented, structured β no HTML scraping or anti-bot games |
| π Full GitHub search syntax | Any qualifier GitHub supports: language:, topic:, stars:>N, forks:>N, created:>YYYY-MM-DD, pushed:>YYYY-MM-DD, archived:false, is:public, free text, and combinations |
| π’ Flexible sort & order | Sort by stars, forks, updated or help-wanted-issues, ascending or descending |
| β Rich repo metadata | 20 fields per repo: stars, forks, open issues, watchers, language, topics, license, archived flag, dates, owner type and more |
| π Activity tracking | createdAt, updatedAt and pushedAt to separate active from abandoned repos |
| π€ Owner identity | owner login + ownerType (User vs Organization) for downstream segmentation |
| βΎοΈ Unlimited mode | maxRepos=0 pulls every result the query allows (GitHub Search caps at 1,000 per query) |
| π§± Flat schema | No nested JSON to wrangle β drop straight into a warehouse |
| π¦ All export formats | JSON, CSV, Excel, HTML, XML, JSONL via the Apify Dataset |
| β±οΈ Schedule-friendly | Deterministic and idempotent β great for daily ecosystem tracking |
| π No auth required | Anonymous GitHub Search API access |
| π§° Built-in Overview view | Pre-configured Apify Dataset view with the most-useful columns visible by default |
π― Built for these use cases
1. Open-source intelligence (OSINT) for VCs & investors
Run "topic:ai stars:>500 created:>2026-01-01" weekly. Every Monday morning, you get the new AI projects that crossed 500 stars in the last week β half of them will become the next portfolio companies. Filter by language, by owner type (organization-led usually monetizes faster), by activity recency.
2. Dependency & supply-chain tracking
Pull every repo using a topic / library you care about (topic:kubernetes, topic:llm). Build a database of who depends on what β critical for security, partnerships and product roadmap signals.
3. Security & vulnerability research
Find every public repo using a specific framework or language version. When a CVE drops, you have a pre-built target list to scan and notify. Sort by stars to prioritize widely-deployed code.
4. Technical talent sourcing
language:rust stars:>100 returns the maintainers of significant Rust projects β recruiters' shortlist for senior Rust engineers. Pair with pushedAt for active-this-month maintainers only.
5. Devtool adoption & competitive intelligence
You sell a devtool. Search every repo that uses your competitor's library β count, compare growth over months, segment by language and topic. Feed your sales/CS teams.
6. Package popularity tracking
For library maintainers: which forks have meaningful stars? Which competing packages are growing faster? Use weekly snapshots + growth deltas to inform OSS strategy.
7. Trending project discovery & curation
Build "best of GitHub this week / month" newsletters, dashboards, podcasts β automated from a single scheduled scrape.
8. Ecosystem mapping for research & journalism
Map every project in a niche (LLM agents, web3 infra, climate tech, edtech). Cross-reference languages, owners, dates, license types. Publish industry reports.
π₯ Inputs
| Field | Type | Required | Description |
|---|---|---|---|
searchQuery | string | No | Any GitHub Search query string. Free text and qualifiers both work: language:python, topic:llm, stars:>1000, created:>2026-01-01, pushed:>2026-04-01, archived:false, is:public, org:openai, user:torvalds. Default stars:>1000. |
sort | enum | No | Sort field: stars, forks, updated, help-wanted-issues. Default stars. |
order | enum | No | Sort direction: desc or asc. Default desc. |
maxRepos | integer | No | Hard cap on rows. 0 = pull every available result (GitHub Search caps at 1,000 per query). |
Example inputs
Trending AI repos in Python:
{"searchQuery": "topic:ai language:python stars:>500","sort": "stars","order": "desc","maxRepos": 0}
Active Rust devtools created this year:
{"searchQuery": "language:rust topic:cli created:>2026-01-01 stars:>50 archived:false","sort": "updated","order": "desc","maxRepos": 500}
Owner-specific (everything by an org):
{"searchQuery": "org:vercel","sort": "stars","order": "desc","maxRepos": 0}
Recently-pushed Kubernetes-related repos:
{"searchQuery": "topic:kubernetes pushed:>2026-04-01 stars:>100","sort": "updated","order": "desc","maxRepos": 1000}
π€ Output
One Apify dataset row per repository. Sample:
{"id": 65600975,"fullName": "openai/whisper","name": "whisper","owner": "openai","ownerType": "Organization","description": "Robust Speech Recognition via Large-Scale Weak Supervision","url": "https://github.com/openai/whisper","homepage": "","language": "Python","topics": ["audio", "speech-recognition", "deep-learning", "pytorch"],"stars": 88231,"forks": 10241,"openIssues": 102,"watchers": 88231,"license": "MIT","isArchived": false,"createdAt": "2022-09-21T16:35:42.000Z","updatedAt": "2026-05-15T09:11:00.000Z","pushedAt": "2026-05-12T14:22:18.000Z","scrapedAt": "2026-05-16T10:00:00.000Z"}
Full field reference
| Field | Type | Meaning |
|---|---|---|
id | integer | GitHub numeric repository ID |
fullName | string | owner/repo β the canonical identifier |
name | string | Repository name (without owner) |
owner | string | Owner login (user or organization name) |
ownerType | string | User or Organization |
description | string | Repo short description (as shown under the name) |
url | string | Canonical github.com URL |
homepage | string | Project homepage URL, if set |
language | string | Primary programming language detected by GitHub |
topics | array | Repo topics (manually set by the owner) |
stars | integer | Current star count |
forks | integer | Current fork count |
openIssues | integer | Current open-issues count (includes PRs) |
watchers | integer | Watcher count (equals stars for most repos) |
license | string | License identifier (e.g. MIT, Apache-2.0, GPL-3.0) |
isArchived | boolean | True if the repo has been archived (read-only) |
createdAt | string | ISO 8601 timestamp of repo creation |
updatedAt | string | ISO 8601 timestamp of the last metadata update |
pushedAt | string | ISO 8601 timestamp of the last code push (best activity signal) |
scrapedAt | string | ISO 8601 timestamp of the scrape |
βοΈ How it works
- Parses input β query string, sort, order, max cap.
- Calls
api.github.com/search/repositorieswithq={searchQuery},sort,order,per_page=100,page=1. - Paginates through
page=1..Nuntil the cap or 1,000-row API ceiling is reached. - Respects rate limits β observes
X-RateLimit-RemainingandRetry-Afterheaders; sleeps and retries gracefully. - Flattens the nested response β extracts
owner.loginβowner,owner.typeβownerType,license.spdx_idβlicense, normalizes timestamps to ISO 8601. - Streams each repo as one flat row into the Apify Dataset.
The Actor uses ONLY the official, publicly-documented GitHub Search API (api.github.com/search/repositories). No HTML scraping, no headless browser, no proxy, no anti-bot bypass.
β‘ Performance
| Workload | Approx time | API calls |
|---|---|---|
| Top 100 repos for a tight query | ~3 seconds | 1 |
| 500 repos for one query | ~10 seconds | 5 |
| 1,000 repos (full query cap) | ~20 seconds | 10 |
| Multi-query sweep (10 queries Γ 1,000) | ~3 minutes | ~100 |
| Daily ecosystem dashboard | ~minutes | bounded |
GitHub anonymous Search API allows ~10 requests/minute. The Actor stays comfortably within that with built-in pacing.
π° Cost model
Pay-Per-Result. You pay only for repository rows saved to the dataset. Queries that match nothing are not billed.
Typical costs (rough order):
- Daily monitor of one query (~50 new repos) β tiny
- Weekly ecosystem sweep (1,000 repos Γ 5 topics) β small
- Backfill of an industry-wide niche (10,000 repos via slicing) β moderate
- Multi-query talent / dependency mapping pipeline β bounded and predictable
π Schedule for continuous monitoring
Schedule patterns we see in real deployments:
- Hourly for "new trending repo" alerts in a fast-moving topic (e.g.
topic:llm) - Daily for ecosystem dashboards and VC deal-flow trackers
- Weekly for newsletter generation and dependency reports
- Monthly for "state of the X ecosystem" reports
Push new rows into Slack, Discord, Notion, Airtable, Sheets, Postgres, BigQuery, your CRM or your custom API via Apify Webhooks.
π οΈ FAQ
Is using this GitHub scraper allowed? Yes. It uses the official public GitHub Search API and reads only publicly visible repository metadata. Use the data responsibly under GitHub's terms.
Do I need a GitHub account, login or token? No. The Actor works against GitHub's public Search API without authentication. (Authenticated requests would raise the rate-limit ceiling β if you need that, request a custom build.)
How many repos can I get per run?
GitHub Search caps any single query at 1,000 results. Set maxRepos=0 to pull every available result for your query. To go beyond 1,000, split your query into narrower windows (by date, star band, language or topic) β each window has its own 1,000-row budget.
What search syntax does it support?
Any GitHub Search qualifier and free-text combination. Common qualifiers: language:, topic:, stars:, forks:, size:, created:, pushed:, archived:, is:public, org:, user:. Comparators: >N, <N, >=N, <=N, N..M. Use exactly the same syntax you'd type into the github.com search bar.
Can I sort by anything I want?
The GitHub Search API supports sort=stars, forks, updated and help-wanted-issues, ascending or descending. For other sorts (e.g. by pushed), sort downstream in your spreadsheet, SQL or pandas.
What's the difference between updatedAt and pushedAt?
updatedAt = any metadata change (description edit, star count change). pushedAt = an actual code push to a branch. Use pushedAt to identify truly active development.
How do I get every repo in an organization?
Use org:openai (replace with the org slug) as your searchQuery. Combine with is:public if you want to filter out forks: org:openai fork:false.
Does this get README content, file trees or commit history? No β this Actor focuses on repository metadata. For READMEs, file lists, contributors or commit history, request the matching companion Actor.
Are forks included by default?
Yes β GitHub Search includes forks unless you add fork:false to your query.
Can I filter by license?
Yes β add license:mit or license:apache-2.0 (etc.) to your searchQuery.
Is the data fresh? Yes β the API serves data within seconds of changes on github.com.
What output formats are supported? JSON, CSV, Excel, HTML, XML and JSONL via the Apify Dataset, plus REST API and webhooks for live integrations.
How is this different from GitHub Trending? GitHub Trending is a single rolling list visible only on github.com/trending. This Actor lets you build any trending-style list you want β by any query, with any sort, into a structured dataset.
π Related scrapers
Adjacent data sources in the social/dev/content suite:
| Scraper | Purpose |
|---|---|
github-repository-scraper | You are here. Public GitHub repo metadata by search query. |
stack-exchange-questions-scraper | Q&A across 170+ Stack Exchange sites by tag/site/sort. |
hacker-news-search-scraper | HN stories/comments/Show HN/Ask HN/front page by keyword. |
hacker-news-who-is-hiring-scraper | Monthly HN "Who is hiring?" thread parsed by company/role/stack. |
reddit-subreddit-scraper | Posts from any subreddit by sort and time window. |
reddit-historical-archive-scraper | Years of subreddit history at scale. |
devto-articles-scraper | Dev.to articles by tag, author, latest feed. |
product-hunt-daily-launches-scraper | Today's Product Hunt launches with votes and makers. |
linkedin-top-content-scraper | Top-performing LinkedIn posts by keyword/author. |
linkedin-ad-library-scraper | LinkedIn Ad Library β competitor ad creative & spend signals. |
letterboxd-film-review-scraper | Film reviews from Letterboxd for culture/sentiment work. |
instagram-media-downloader | Reels/Posts/Stories HD download URLs in bulk. |
π Keyword cloud
Core: github scraper, github repo scraper, github repository scraper, github search api, github repo stats api, github trending scraper, github metadata scraper, github stars api, github stars scraper, github topics scraper, github bulk repo export, github repo to csv.
Niche: github search by topic, github search by language, github search by stars, github search by date, github fork:false, github archived:false, github org search, github user search, github pushed_at scraper, github created_at scraper, github license scraper, github watchers tracker, github open issues scraper.
Use case: open source intelligence, oss market research, vc deal flow github, devtool adoption tracking, dependency tracking, supply chain security research, cve target list, security research, talent sourcing for engineers, technical recruiting, package popularity tracking, oss competitive intelligence, ecosystem mapping, trending projects newsletter, github dashboard data, devrel ecosystem health, llm fine tuning corpus selection, code dataset selection, github research dataset.
Audience: vcs, investors, founders, devtool product managers, devrel teams, security researchers, technical recruiters, oss maintainers, package authors, data engineers, ml engineers, ai researchers, ecosystem analysts, technical journalists, growth marketers for developer audiences, sales / cs teams targeting engineers.
