
GitHub Repository Scraper
Pricing
$10.00/month + usage

GitHub Repository Scraper
This actor scrapes detailed information from GitHub repositories using reliable HTTP requests and HTML parsing. It extracts repository metadata including star counts, fork counts, topics/tags, license information, primary programming language, and last updated timestamps.
0.0 (0)
Pricing
$10.00/month + usage
0
Total users
3
Monthly users
3
Runs succeeded
>99%
Last modified
a month ago
GitHub Repository Scraper for Apify
A Python-based Apify actor that scrapes GitHub repository information using requests and BeautifulSoup.
Features
- Extracts repository information including:
- Full name (owner/repo)
- Star count
- Description
- Primary programming language
- Topics/tags
- Last updated time
- License information
- Fork count
- Written in Python using requests and BeautifulSoup for reliable scraping
- Built for the Apify platform
Files
apify_actor.py
- The main actor code for Apify deploymentrequests_github_scraper.py
- Standalone GitHub scraper (for local testing)INPUT_SCHEMA.json
- Input schema for the Apify actorrequirements.txt
- Python dependenciespackage.json
- Actor metadata for Apify
Local Testing
- Install dependencies:
pip install -r requirements.txt
- Run the local version:
python requests_github_scraper.py
- Check results in the
apify_storage
directory
Deploying to Apify
Prerequisites
- Create an Apify account if you don't have one
- Install the Apify CLI:
npm install -g apify-cli
- Log in to your Apify account:
apify login
Deployment Steps
-
Initialize your project folder (if you haven't already):
apify init github-scraper -
Modify the
Dockerfile
to use Python:FROM apify/actor-python:3.9# Copy source codeCOPY . ./# Install dependenciesRUN pip install --no-cache-dir -r requirements.txt# Define how to run the actorCMD ["python3", "apify_actor.py"] -
Push your actor to Apify:
apify push -
After pushing, your actor will be available in the Apify Console.
Running on Apify
- Navigate to your actor in the Apify Console
- Click on "Run" in the top-right corner
- Enter the GitHub repository URLs you want to scrape in the Input form
- Click "Run" to start the actor
- Access the results in the "Dataset" tab once the run is complete
Input Options
repoUrls
(required): Array of GitHub repository URLs to scrapesleepBetweenRequests
(optional): Delay between requests in seconds (default: 3)
Example Input
{"repoUrls": ["https://github.com/microsoft/playwright","https://github.com/facebook/react","https://github.com/tensorflow/tensorflow"],"sleepBetweenRequests": 5}
Output Format
The actor provides clean, well-structured data for each GitHub repository in the following format:
{"url": "https://github.com/microsoft/playwright","name": "playwright","owner": "microsoft","fullName": "microsoft/playwright","description": "Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.","stats": {"stars": "71.2k","forks": "4k"},"language": "TypeScript","topics": ["electron","javascript","testing","firefox","chrome","automation","web","test","chromium","test-automation","testing-tools","webkit","end-to-end-testing","e2e-testing","playwright"],"lastUpdated": "2025-03-17T17:00:47Z","license": "Apache-2.0 license"}
Output Fields:
Field | Type | Description |
---|---|---|
url | String | The full URL of the GitHub repository |
name | String | Repository name (without owner) |
owner | String | Username or organization that owns the repository |
fullName | String | Complete repository identifier (owner/name) |
description | String | Repository description |
stats.stars | String | Number of stars the repository has |
stats.forks | String | Number of forks the repository has |
language | String | Primary programming language |
topics | Array | List of topics/tags associated with the repository |
lastUpdated | String | ISO timestamp of the last update |
license | String | Repository license information |
This structured output format makes it easy to:
- Display repository cards in your applications
- Create data visualizations
- Filter and sort repositories by various attributes
- Export to other data formats