
GitHub Repository Scraper
Pricing
$10.00/month + usage

GitHub Repository Scraper
This actor scrapes detailed information from GitHub repositories using reliable HTTP requests and HTML parsing. It extracts repository metadata including star counts, fork counts, topics/tags, license information, primary programming language, and last updated timestamps.
0.0 (0)
Pricing
$10.00/month + usage
0
Total users
3
Monthly users
3
Runs succeeded
>99%
Last modified
a month ago
GitHub Repository Scraper for Apify
A Python-based Apify actor that scrapes GitHub repository information using requests and BeautifulSoup.
Features
- Extracts repository information including:
- Full name (owner/repo)
- Star count
- Description
- Primary programming language
- Topics/tags
- Last updated time
- License information
- Fork count
- Written in Python using requests and BeautifulSoup for reliable scraping
- Built for the Apify platform
Files
apify_actor.py
- The main actor code for Apify deploymentrequests_github_scraper.py
- Standalone GitHub scraper (for local testing)INPUT_SCHEMA.json
- Input schema for the Apify actorrequirements.txt
- Python dependenciespackage.json
- Actor metadata for Apify
Local Testing
- Install dependencies:
pip install -r requirements.txt
- Run the local version:
python requests_github_scraper.py
- Check results in the
apify_storage
directory
Deploying to Apify
Prerequisites
- Create an Apify account if you don't have one
- Install the Apify CLI:
npm install -g apify-cli
- Log in to your Apify account:
apify login
Deployment Steps
-
Initialize your project folder (if you haven't already):
apify init github-scraper
-
Modify the
Dockerfile
to use Python:1FROM apify/actor-python:3.9 2 3# Copy source code 4COPY . ./ 5 6# Install dependencies 7RUN pip install --no-cache-dir -r requirements.txt 8 9# Define how to run the actor 10CMD ["python3", "apify_actor.py"]
-
Push your actor to Apify:
apify push
-
After pushing, your actor will be available in the Apify Console.
Running on Apify
- Navigate to your actor in the Apify Console
- Click on "Run" in the top-right corner
- Enter the GitHub repository URLs you want to scrape in the Input form
- Click "Run" to start the actor
- Access the results in the "Dataset" tab once the run is complete
Input Options
repoUrls
(required): Array of GitHub repository URLs to scrapesleepBetweenRequests
(optional): Delay between requests in seconds (default: 3)
Example Input
1{ 2 "repoUrls": [ 3 "https://github.com/microsoft/playwright", 4 "https://github.com/facebook/react", 5 "https://github.com/tensorflow/tensorflow" 6 ], 7 "sleepBetweenRequests": 5 8}
Output Format
The actor provides clean, well-structured data for each GitHub repository in the following format:
1{ 2 "url": "https://github.com/microsoft/playwright", 3 "name": "playwright", 4 "owner": "microsoft", 5 "fullName": "microsoft/playwright", 6 "description": "Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.", 7 "stats": { 8 "stars": "71.2k", 9 "forks": "4k" 10 }, 11 "language": "TypeScript", 12 "topics": [ 13 "electron", 14 "javascript", 15 "testing", 16 "firefox", 17 "chrome", 18 "automation", 19 "web", 20 "test", 21 "chromium", 22 "test-automation", 23 "testing-tools", 24 "webkit", 25 "end-to-end-testing", 26 "e2e-testing", 27 "playwright" 28 ], 29 "lastUpdated": "2025-03-17T17:00:47Z", 30 "license": "Apache-2.0 license" 31}
Output Fields:
Field | Type | Description |
---|---|---|
url | String | The full URL of the GitHub repository |
name | String | Repository name (without owner) |
owner | String | Username or organization that owns the repository |
fullName | String | Complete repository identifier (owner/name) |
description | String | Repository description |
stats.stars | String | Number of stars the repository has |
stats.forks | String | Number of forks the repository has |
language | String | Primary programming language |
topics | Array | List of topics/tags associated with the repository |
lastUpdated | String | ISO timestamp of the last update |
license | String | Repository license information |
This structured output format makes it easy to:
- Display repository cards in your applications
- Create data visualizations
- Filter and sort repositories by various attributes
- Export to other data formats