GitHub repos search scraper
Pricing
Pay per usage
GitHub repos search scraper
DeprecatedGiven a search query (e.g. "Apify"), scrapes all repos from GitHub containing that query in title or description. It's not limited to the first 1000 results as the official API is.
0.0 (0)
Pricing
Pay per usage
0
1
1
Last modified
3 years ago
About implementation
https://docs.github.com/en/free-pro-team@latest/rest/reference/search
GitHub Search API provides only up to 1000 results for each search.
Because of this limitation, we have to do some workarounds, and even with them the results are not guaranteed to be complete.
The workaround
- Sort results by stars
- Get first 1000 results
- Check number of stars of the last result, and use that number for filtering the next search query
- Repeat
Limitation
If there's more than 1000 results with same number of stars, there's no way to get them all
Real example
Statistics of results for meteor query (as of 2020-11-03)
| Stars | Results | Diff |
|---|---|---|
| no filter | 46 742 | 1000 |
| <26 | 45 759 | 983 |
| <11 | 44 851 | 908 |
| <7 | 44 120 | 731 |
| <5 | 43 401 | 719 |
| <4 | 42 839 | 562 |
| <3 | 41 971 | 868 |
| <2 | 40 415 | 1556 |
| <1 | 36 068 | 5347 |
Why it is not possible to sort by date?
https://stackoverflow.com/questions/37602893/github-search-limit-results#comment85767535_37639739
Output example
- owner
stringe.g. apify - name
stringe.g. apify-js - url
stringe.g. https://github.com/apify/apify-js - fork
boolean - description
stringe.g. Apify SDK — The scalable web scraping and crawling library for JavaScript/Node.js - created_at
undefinede.g. 2012-01-19T01:58:17Z - updated_at
undefinede.g. 2020-11-03T04:16:58Z - pushed_at
undefinede.g. 2020-10-31T16:21:04Z - homepage
stringe.g. https://sdk.apify.com/ - size
numbere.g. 80509 - stars
numbere.g. 42034 - open_issues
numbere.g. 144 - forks
numbere.g. 5140 - language
stringe.g. JavaScript - archived
boolean - disabled
boolean
