π GitHub Repository Intelligence - Repos & READMEs
Pricing
from $20.00 / 1,000 results
π GitHub Repository Intelligence - Repos & READMEs
Extract repository metadata, README content, and documentation from GitHub using the official REST API. Perfect for LLM training data, developer research, and competitive analysis. Search by keywords or fetch specific repositories.
Pricing
from $20.00 / 1,000 results
Rating
0.0
(0)
Developer
ben
Maintained by CommunityActor stats
1
Bookmarked
57
Total users
12
Monthly active users
5 days ago
Last modified
Categories
Share
π GitHub Repository Intelligence β Repo Metadata, READMEs & Docs via Official API
Extract structured GitHub repository data, README content and documentation at scale using GitHub's official REST API β no scraping, no broken selectors. Search millions of repos with GitHub's own query syntax, or fetch specific repositories by URL, and get back clean metadata (stars, forks, language stats, topics, license) plus the full decoded README for each one. Perfect for LLM training data, developer research and competitive analysis. Export to JSON/CSV/Excel, run on a schedule, call via API, or connect to Make, Zapier or n8n.
π¦ What is the GitHub Repository Intelligence?
It turns a GitHub search query β or a list of repo URLs β into a rich dataset built from the official GitHub API. For every repository it pulls core stats, topics, per-language byte counts and the decoded README text in one pass, fetching the extra details concurrently for speed. Because it uses the documented API rather than HTML scraping, results stay stable and complete β ideal for building AI training corpora, tech-stack analysis and ecosystem monitoring.
What data does it extract?
full_name,name,owner(login, type, profile URL) anddescriptionstars,forks,watchers,open_issuesand reposizelanguage(primary) pluslanguagesbyte-count breakdowntopics/ tags andlicensenamereadmeβ decoded full content with name, path, size and URLshomepageand repositoryurl- Timestamps:
created_at,updated_at,pushed_at - Flags:
is_fork,is_archived,is_private,has_wiki,has_issues,default_branch scraped_attimestamp on every record
β¬οΈ Input
Run it two ways β search repositories with GitHub query syntax, or pass exact repo URLs:
| Field | Description |
|---|---|
mode | search (by query) or direct (by URL) |
searchQuery | GitHub search query, e.g. language:python stars:>1000, react, machine-learning |
repositoryUrls | One URL per line in direct mode, e.g. https://github.com/facebook/react |
sortBy | stars, forks, updated, or help-wanted-issues (search mode) |
maxResults | Max repositories to fetch in search mode (1β1000) |
includeReadme | Fetch and decode the full README content |
includeTopics | Fetch repository topics/tags |
includeLanguages | Fetch programming-language statistics |
githubToken | Optional token for higher rate limits (5,000 vs 60 requests/hour) |
Example input
{"mode": "search","searchQuery": "language:python stars:>1000","sortBy": "stars","maxResults": 30,"includeReadme": true,"includeTopics": true,"includeLanguages": true}
β¬οΈ Output
Every repository is one clean row (view as a table, or export JSON / CSV / Excel):
{"name": "react","full_name": "facebook/react","owner": {"login": "facebook","type": "Organization","url": "https://github.com/facebook"},"description": "The library for web and native user interfaces.","url": "https://github.com/facebook/react","homepage": "https://react.dev","language": "JavaScript","stars": 228000,"forks": 46500,"watchers": 228000,"open_issues": 980,"size": 412000,"topics": ["javascript", "react", "frontend", "ui", "library"],"license": "MIT License","languages": { "JavaScript": 8214530, "TypeScript": 412300, "HTML": 12040 },"readme": {"name": "README.md","path": "README.md","content": "# React\n\nReact is a JavaScript library for building user interfaces...","size": 4120,"html_url": "https://github.com/facebook/react/blob/main/README.md","download_url": "https://raw.githubusercontent.com/facebook/react/main/README.md"},"created_at": "2013-05-24T16:15:54Z","updated_at": "2026-06-26T09:12:00Z","pushed_at": "2026-06-26T08:40:11Z","is_fork": false,"is_archived": false,"default_branch": "main","scraped_at": "2026-06-26T15:30:00.000000","index": 1}
π‘ Use cases
- π€ AI & LLM training data: harvest README and documentation text plus metadata to build code and docs training corpora.
- π Developer & OSS research: find and rank the top repos for any language, topic or keyword with GitHub's own search syntax.
- π Tech-stack & competitive analysis: compare languages, stars, activity and licenses across competitors or an ecosystem.
- π Ecosystem monitoring: track stars, forks, issues and recent pushes for a watchlist of repositories over time.
β FAQ
How do I scrape GitHub repository data? Choose search mode with a query (or
direct mode with repo URLs), pick which extras to include, and Run. You get
metadata, topics, language stats and decoded README content per repo.
What search syntax can I use? Anything GitHub search supports β e.g.
language:python stars:>1000, topic:machine-learning, org:apify,
created:>2024-01-01. Sort by stars, forks, recent updates or help-wanted issues.
Does it include the full README? Yes β with includeReadme enabled it fetches
the README via the API and returns the fully decoded text plus its size and URLs.
Do I need an API key? No key is required to start. Without a token GitHub allows
60 requests/hour; adding an optional githubToken raises that to 5,000/hour for
large runs. The token is stored as a secret input.
How do I get a GitHub token? Create a personal access token at
github.com/settings/tokens (no special scopes needed for public repos) and paste it
into the githubToken field.
Can I fetch private repositories? Only if you supply a githubToken that has
access to them; otherwise the API returns only public repositories.
How many repositories can it return? Up to your maxResults cap in search mode
(it paginates automatically), or as many URLs as you provide in direct mode.
Can I run it on a schedule or via API? Yes β schedule recurring runs in Apify, call it via the API/SDK, or connect it to Make, Zapier or n8n.
Is it legal? It uses GitHub's official, documented REST API to read public data β no browser automation or ToS workarounds. Respect rate limits and the licenses on the content you collect.
π You might also like
- GitHub Intelligence Scraper β broader GitHub data intelligence
- npm Package Scraper β npm package metadata & stats
- Tech Stack Detector β detect technologies a site is built with
- Dev.to Articles Intelligence β developer articles & content
Keywords: GitHub scraper, GitHub API, repository data, README extraction, GitHub documentation scraper, LLM training data, code dataset, developer research, tech stack analysis, open source intelligence, repository metadata, GitHub search API, stars and forks data, programming language stats, repo monitoring.