Gsoc Finder
Pricing
from $0.01 / 1,000 results
Gsoc Finder
Pricing
from $0.01 / 1,000 results
Rating
0.0
(0)
Developer

Sumeet Gond
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
12 days ago
Last modified
Share
π GSoC Organization Crawler
An Apify Actor that crawls Google Summer of Code organizations and extracts detailed information to help students find the perfect organization to contribute to.
Features β’ How It Works β’ Quick Start β’ Output β’ Integration
β¨ Features
| Feature | Description |
|---|---|
| π Multi-Year Crawling | Scrapes organizations from multiple GSoC years (2022-2024+) |
| π§ Technology Detection | Identifies programming languages, frameworks, and tools |
| π Difficulty Assessment | Auto-determines difficulty (Beginner-Friendly, Intermediate, Advanced) |
| π Acceptance Estimation | Estimates acceptance rates based on historical data |
| π·οΈ Topic Categorization | Extracts and categorizes topics for each organization |
| π Data Merging | Combines data when organizations appear in multiple years |
π How It Works
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ CRAWLER WORKFLOW ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββββββββββββββββ βββββββββββββββββ 1. INPUT β β 2. CRAWL β β 3. EXTRACT ββ β β β β ββ Years to ββββββββββΆβ Visit GSoC ββββββββββΆβ Parse org ββ scrape β β archive β β details ββ [2024,2023] β β pages β β βββββββββββββββββ ββββββββββββββββ βββββββββββββββββββββββββββββββββ ββββββββββββββββ ββ 6. SAVE β β 5. ENRICH β ββ β β β βΌβ Push to βββββββββββ Add ββββββββββββββββββββββββββ Apify β β difficulty & β β 4. MERGE ββ Dataset β β acceptance β β βββββββββββββββββ ββββββββββββββββ β Combine ββ multi-year ββ data βββββββββββββββββ
Step-by-Step Process
- π Initialize - Actor starts with input parameters (years, max requests)
- π Discover - Visits GSoC archive pages, waits for JavaScript to render
- π Extract - Visits each organization page, extracts all details
- π Merge - Combines data for organizations appearing in multiple years
- π Enrich - Determines difficulty level and acceptance rate
- πΎ Save - Pushes all organizations to Apify Dataset
π Quick Start
Prerequisites
- Node.js 18+ installed
- Apify CLI installed (
npm install -g apify-cli) - Apify Account (free tier available)
Local Development
# 1. Navigate to the projectcd gsoc# 2. Install dependenciesnpm install# 3. Run locallynpm start
Deploy to Apify Cloud
# 1. Login to Apify (first time only)apify login# 2. Push to Apify Consoleapify push# 3. Run from Apify Console with input JSON
π₯ Input Configuration
| Field | Type | Default | Description |
|---|---|---|---|
years | number[] | [2024, 2023, 2022] | GSoC years to scrape |
maxRequestsPerCrawl | number | 500 | Maximum HTTP requests per run |
Example Input
{"years": [2024, 2023, 2022, 2021],"maxRequestsPerCrawl": 1000}
π‘ Tip: Start with fewer years (e.g.,
[2024]) for faster testing
π€ Output Format
Each organization in the dataset includes:
{"name": "TensorFlow","description": "An end-to-end open source machine learning platform...","url": "https://summerofcode.withgoogle.com/archive/2024/organizations/tensorflow","technologies": ["Python", "C++", "JavaScript", "TensorFlow", "Keras"],"topics": ["Machine Learning", "Deep Learning", "AI"],"difficulty": "Advanced","acceptanceRate": "Low","years": [2024, 2023, 2022, 2021, 2020],"projectTypes": ["Library Development", "Documentation", "Testing"],"category": "Machine Learning","ideaListUrl": "https://github.com/tensorflow/tensorflow/wiki/gsoc","logoUrl": "https://..."}
Output Fields Explained
| Field | Type | Description |
|---|---|---|
name | string | Organization's display name |
description | string | Full description from GSoC page |
url | string | Direct link to organization's GSoC page |
technologies | string[] | Programming languages, frameworks, tools |
topics | string[] | Project categories and domains |
difficulty | string | Auto-determined: Beginner-Friendly, Intermediate, Advanced |
acceptanceRate | string | Estimated: High, Medium, Low |
years | number[] | Years the organization participated in GSoC |
projectTypes | string[] | Types of projects available |
category | string | Primary category classification |
ideaListUrl | string | Link to project ideas page |
logoUrl | string | Organization's logo URL |
π Difficulty Classification
The crawler automatically determines difficulty based on technology complexity:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ DIFFICULTY LEVELS ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€β ββ π’ BEGINNER-FRIENDLY ββ βββ HTML, CSS, JavaScript ββ βββ Python basics ββ βββ Documentation projects ββ ββ π‘ INTERMEDIATE ββ βββ React, Vue, Angular ββ βββ Node.js, Django, Flask ββ βββ Database work (PostgreSQL, MongoDB) ββ ββ π΄ ADVANCED ββ βββ C++, Rust, Go (systems programming) ββ βββ Kubernetes, Docker (infrastructure) ββ βββ TensorFlow, PyTorch (ML frameworks) ββ βββ Compiler/kernel development ββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Integration Guide
Step 1: Deploy Actor
cd gsocapify loginapify push
Step 2: Run Actor & Get Credentials
- Go to Apify Console
- Find your actor β Click Start
- Configure input JSON and run
- After completion:
- Actor ID: From the URL (e.g.,
username/gsoc-crawler) - Dataset ID: Storage tab β Dataset β Copy ID
- Actor ID: From the URL (e.g.,
Step 3: Configure Frontend
Create/update .env in your app root:
# Required for Apify integrationVITE_APIFY_API_TOKEN=apify_api_xxxxxxxxxxxxVITE_APIFY_ACTOR_ID=your-username/gsoc-crawlerVITE_APIFY_DATASET_ID=xxxxxxxxxxxxxxxxxxxx
Step 4: Verify Integration
# Run the frontendnpm run dev# You should see:# β "Live Data" badge in the UI# β Last updated timestamp# β Organizations loaded from Apify
π Project Structure
gsoc/βββ .actor/β βββ actor.json # Actor metadata & configurationβ βββ dataset_schema.json # Output data structure definitionβ βββ input_schema.json # Input validation schemaβ βββ Dockerfile # Playwright container configβββ src/β βββ main.ts # Main PlaywrightCrawler logicβββ storage/ # Local development storageβ βββ datasets/default/β βββ key_value_stores/default/β βββ request_queues/default/βββ package.jsonβββ tsconfig.jsonβββ README.md
π₯οΈ Console Output Preview
When the crawler runs, you'll see formatted output like this:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ π GSoC CRAWLER RESULTS ββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£β Total Organizations: 248 ββ Years Crawled: 2024, 2023, 2022 ββ Unique Technologies: 156 βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββπ DIFFICULTY BREAKDOWNββββββββββββββββββββββ¬ββββββββ¬ββββββββββββββ Level β Count β Percentage βββββββββββββββββββββββΌββββββββΌβββββββββββββ€β π’ Beginner β 45 β 18.1% ββ π‘ Intermediate β 142 β 57.3% ββ π΄ Advanced β 61 β 24.6% βββββββββββββββββββββββ΄ββββββββ΄βββββββββββββπ§ TOP 10 TECHNOLOGIES1. Python ............... 187 orgs (75.4%)2. JavaScript ........... 134 orgs (54.0%)3. C++ .................. 89 orgs (35.9%)4. Java ................. 76 orgs (30.6%)5. TypeScript ........... 67 orgs (27.0%)...β Crawling complete! Data saved to Apify Dataset.
π οΈ Troubleshooting
Common Issues
| Issue | Solution |
|---|---|
| 0 organizations found | GSoC uses JavaScript rendering - we use PlaywrightCrawler to handle this |
| Timeout errors | Increase maxRequestsPerCrawl or check network connectivity |
| Memory issues | Reduce the number of years in input |
| Rate limiting | Actor automatically handles retries with exponential backoff |
Debug Mode
Enable detailed logging by setting environment variable:
$DEBUG=1 npm start
π Resources
Apify Documentation
- Apify SDK - Actor development toolkit
- Crawlee - Web scraping library
- PlaywrightCrawler - Browser automation
- Input Schema - Configuration validation
Tutorials
Integrations
π License
MIT License - Feel free to use, modify, and distribute.
Built with β€οΈ for GSoC Students
Getting started
For complete information see this article. To run the Actor use the following command:
$apify run
Deploy to Apify
Connect Git repository to Apify
If you've created a Git repository for the project, you can easily connect to Apify:
- Go to Actor creation page
- Click on Link Git Repository button
Push project on your local machine to Apify
You can also deploy the project on your local machine to Apify without the need for the Git repository.
-
Log in to Apify. You will need to provide your Apify API Token to complete this action.
$apify login -
Deploy your Actor. This command will deploy and build the Actor on the Apify Platform. You can find your newly created Actor under Actors -> My Actors.
$apify push
Documentation reference
To learn more about Apify and Actors, take a look at the following resources: