πŸ™ GitHub Repository Intelligence - Repos & READMEs avatar

πŸ™ GitHub Repository Intelligence - Repos & READMEs

Pricing

from $20.00 / 1,000 results

Go to Apify Store
πŸ™ GitHub Repository Intelligence - Repos & READMEs

πŸ™ GitHub Repository Intelligence - Repos & READMEs

Extract repository metadata, README content, and documentation from GitHub using the official REST API. Perfect for LLM training data, developer research, and competitive analysis. Search by keywords or fetch specific repositories.

Pricing

from $20.00 / 1,000 results

Rating

0.0

(0)

Developer

ben

ben

Maintained by Community

Actor stats

1

Bookmarked

57

Total users

12

Monthly active users

5 days ago

Last modified

Share

πŸ™ GitHub Repository Intelligence β€” Repo Metadata, READMEs & Docs via Official API

Extract structured GitHub repository data, README content and documentation at scale using GitHub's official REST API β€” no scraping, no broken selectors. Search millions of repos with GitHub's own query syntax, or fetch specific repositories by URL, and get back clean metadata (stars, forks, language stats, topics, license) plus the full decoded README for each one. Perfect for LLM training data, developer research and competitive analysis. Export to JSON/CSV/Excel, run on a schedule, call via API, or connect to Make, Zapier or n8n.

πŸ“¦ What is the GitHub Repository Intelligence?

It turns a GitHub search query β€” or a list of repo URLs β€” into a rich dataset built from the official GitHub API. For every repository it pulls core stats, topics, per-language byte counts and the decoded README text in one pass, fetching the extra details concurrently for speed. Because it uses the documented API rather than HTML scraping, results stay stable and complete β€” ideal for building AI training corpora, tech-stack analysis and ecosystem monitoring.

What data does it extract?

  • full_name, name, owner (login, type, profile URL) and description
  • stars, forks, watchers, open_issues and repo size
  • language (primary) plus languages byte-count breakdown
  • topics / tags and license name
  • readme β€” decoded full content with name, path, size and URLs
  • homepage and repository url
  • Timestamps: created_at, updated_at, pushed_at
  • Flags: is_fork, is_archived, is_private, has_wiki, has_issues, default_branch
  • scraped_at timestamp on every record

⬇️ Input

Run it two ways β€” search repositories with GitHub query syntax, or pass exact repo URLs:

FieldDescription
modesearch (by query) or direct (by URL)
searchQueryGitHub search query, e.g. language:python stars:>1000, react, machine-learning
repositoryUrlsOne URL per line in direct mode, e.g. https://github.com/facebook/react
sortBystars, forks, updated, or help-wanted-issues (search mode)
maxResultsMax repositories to fetch in search mode (1–1000)
includeReadmeFetch and decode the full README content
includeTopicsFetch repository topics/tags
includeLanguagesFetch programming-language statistics
githubTokenOptional token for higher rate limits (5,000 vs 60 requests/hour)

Example input

{
"mode": "search",
"searchQuery": "language:python stars:>1000",
"sortBy": "stars",
"maxResults": 30,
"includeReadme": true,
"includeTopics": true,
"includeLanguages": true
}

⬆️ Output

Every repository is one clean row (view as a table, or export JSON / CSV / Excel):

{
"name": "react",
"full_name": "facebook/react",
"owner": {
"login": "facebook",
"type": "Organization",
"url": "https://github.com/facebook"
},
"description": "The library for web and native user interfaces.",
"url": "https://github.com/facebook/react",
"homepage": "https://react.dev",
"language": "JavaScript",
"stars": 228000,
"forks": 46500,
"watchers": 228000,
"open_issues": 980,
"size": 412000,
"topics": ["javascript", "react", "frontend", "ui", "library"],
"license": "MIT License",
"languages": { "JavaScript": 8214530, "TypeScript": 412300, "HTML": 12040 },
"readme": {
"name": "README.md",
"path": "README.md",
"content": "# React\n\nReact is a JavaScript library for building user interfaces...",
"size": 4120,
"html_url": "https://github.com/facebook/react/blob/main/README.md",
"download_url": "https://raw.githubusercontent.com/facebook/react/main/README.md"
},
"created_at": "2013-05-24T16:15:54Z",
"updated_at": "2026-06-26T09:12:00Z",
"pushed_at": "2026-06-26T08:40:11Z",
"is_fork": false,
"is_archived": false,
"default_branch": "main",
"scraped_at": "2026-06-26T15:30:00.000000",
"index": 1
}

πŸ’‘ Use cases

  • πŸ€– AI & LLM training data: harvest README and documentation text plus metadata to build code and docs training corpora.
  • πŸ” Developer & OSS research: find and rank the top repos for any language, topic or keyword with GitHub's own search syntax.
  • πŸ“Š Tech-stack & competitive analysis: compare languages, stars, activity and licenses across competitors or an ecosystem.
  • πŸ“ˆ Ecosystem monitoring: track stars, forks, issues and recent pushes for a watchlist of repositories over time.

❓ FAQ

How do I scrape GitHub repository data? Choose search mode with a query (or direct mode with repo URLs), pick which extras to include, and Run. You get metadata, topics, language stats and decoded README content per repo.

What search syntax can I use? Anything GitHub search supports β€” e.g. language:python stars:>1000, topic:machine-learning, org:apify, created:>2024-01-01. Sort by stars, forks, recent updates or help-wanted issues.

Does it include the full README? Yes β€” with includeReadme enabled it fetches the README via the API and returns the fully decoded text plus its size and URLs.

Do I need an API key? No key is required to start. Without a token GitHub allows 60 requests/hour; adding an optional githubToken raises that to 5,000/hour for large runs. The token is stored as a secret input.

How do I get a GitHub token? Create a personal access token at github.com/settings/tokens (no special scopes needed for public repos) and paste it into the githubToken field.

Can I fetch private repositories? Only if you supply a githubToken that has access to them; otherwise the API returns only public repositories.

How many repositories can it return? Up to your maxResults cap in search mode (it paginates automatically), or as many URLs as you provide in direct mode.

Can I run it on a schedule or via API? Yes β€” schedule recurring runs in Apify, call it via the API/SDK, or connect it to Make, Zapier or n8n.

Is it legal? It uses GitHub's official, documented REST API to read public data β€” no browser automation or ToS workarounds. Respect rate limits and the licenses on the content you collect.

πŸ”— You might also like


Keywords: GitHub scraper, GitHub API, repository data, README extraction, GitHub documentation scraper, LLM training data, code dataset, developer research, tech stack analysis, open source intelligence, repository metadata, GitHub search API, stars and forks data, programming language stats, repo monitoring.