Under maintenance

Pricing

from $0.01 / 1,000 results

Try for free

Go to Apify Store

Gsoc Finder

Under maintenance

Try for free

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

Sumeet Gond

Actor stats

Bookmarked

Total users

Monthly active users

7 months ago

Last modified

🎓 GSoC Organization Crawler

An Apify Actor that crawls Google Summer of Code organizations and extracts detailed information to help students find the perfect organization to contribute to.

Features • How It Works • Quick Start • Output • Integration

✨ Features

Feature	Description
🔄 Multi-Year Crawling	Scrapes organizations from multiple GSoC years (2022-2024+)
🔧 Technology Detection	Identifies programming languages, frameworks, and tools
📊 Difficulty Assessment	Auto-determines difficulty (Beginner-Friendly, Intermediate, Advanced)
📈 Acceptance Estimation	Estimates acceptance rates based on historical data
🏷️ Topic Categorization	Extracts and categorizes topics for each organization
🔗 Data Merging	Combines data when organizations appear in multiple years

🔄 How It Works

┌─────────────────────────────────────────────────────────────────────────────┐
│                           CRAWLER WORKFLOW                                   │
└─────────────────────────────────────────────────────────────────────────────┘

     ┌──────────────┐         ┌──────────────┐         ┌──────────────┐
     │   1. INPUT   │         │   2. CRAWL   │         │  3. EXTRACT  │
     │              │         │              │         │              │
     │ Years to     │────────▶│ Visit GSoC   │────────▶│ Parse org    │
     │ scrape       │         │ archive      │         │ details      │
     │ [2024,2023]  │         │ pages        │         │              │
     └──────────────┘         └──────────────┘         └──────────────┘
                                                              │
     ┌──────────────┐         ┌──────────────┐                │
     │   6. SAVE    │         │  5. ENRICH   │                │
     │              │         │              │                ▼
     │ Push to      │◀────────│ Add          │◀───────┌──────────────┐
     │ Apify        │         │ difficulty & │        │  4. MERGE    │
     │ Dataset      │         │ acceptance   │        │              │
     └──────────────┘         └──────────────┘        │ Combine      │
                                                      │ multi-year   │
                                                      │ data         │
                                                      └──────────────┘

Step-by-Step Process

🚀 Initialize - Actor starts with input parameters (years, max requests)
🔍 Discover - Visits GSoC archive pages, waits for JavaScript to render
📄 Extract - Visits each organization page, extracts all details
🔗 Merge - Combines data for organizations appearing in multiple years
📊 Enrich - Determines difficulty level and acceptance rate
💾 Save - Pushes all organizations to Apify Dataset

🚀 Quick Start

Prerequisites

Node.js 18+ installed
Apify CLI installed (npm install -g apify-cli)
Apify Account (free tier available)

Local Development

# 1. Navigate to the project
cd gsoc

# 2. Install dependencies
npm install

# 3. Run locally
npm start

Deploy to Apify Cloud

# 1. Login to Apify (first time only)
apify login

# 2. Push to Apify Console
apify push

# 3. Run from Apify Console with input JSON

📥 Input Configuration

Field	Type	Default	Description
`years`	`number[]`	`[2024, 2023, 2022]`	GSoC years to scrape
`maxRequestsPerCrawl`	`number`	`500`	Maximum HTTP requests per run

Example Input

{
  "years": [2024, 2023, 2022, 2021],
  "maxRequestsPerCrawl": 1000
}

💡 Tip: Start with fewer years (e.g., [2024]) for faster testing

📤 Output Format

Each organization in the dataset includes:

{
  "name": "TensorFlow",
  "description": "An end-to-end open source machine learning platform...",
  "url": "https://summerofcode.withgoogle.com/archive/2024/organizations/tensorflow",
  "technologies": ["Python", "C++", "JavaScript", "TensorFlow", "Keras"],
  "topics": ["Machine Learning", "Deep Learning", "AI"],
  "difficulty": "Advanced",
  "acceptanceRate": "Low",
  "years": [2024, 2023, 2022, 2021, 2020],
  "projectTypes": ["Library Development", "Documentation", "Testing"],
  "category": "Machine Learning",
  "ideaListUrl": "https://github.com/tensorflow/tensorflow/wiki/gsoc",
  "logoUrl": "https://..."
}

Output Fields Explained

Field	Type	Description
`name`	`string`	Organization's display name
`description`	`string`	Full description from GSoC page
`url`	`string`	Direct link to organization's GSoC page
`technologies`	`string[]`	Programming languages, frameworks, tools
`topics`	`string[]`	Project categories and domains
`difficulty`	`string`	Auto-determined: Beginner-Friendly, Intermediate, Advanced
`acceptanceRate`	`string`	Estimated: High, Medium, Low
`years`	`number[]`	Years the organization participated in GSoC
`projectTypes`	`string[]`	Types of projects available
`category`	`string`	Primary category classification
`ideaListUrl`	`string`	Link to project ideas page
`logoUrl`	`string`	Organization's logo URL

📊 Difficulty Classification

The crawler automatically determines difficulty based on technology complexity:

┌─────────────────────────────────────────────────────────────────┐
│                    DIFFICULTY LEVELS                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  🟢 BEGINNER-FRIENDLY                                          │
│  ├── HTML, CSS, JavaScript                                     │
│  ├── Python basics                                             │
│  └── Documentation projects                                    │
│                                                                 │
│  🟡 INTERMEDIATE                                               │
│  ├── React, Vue, Angular                                       │
│  ├── Node.js, Django, Flask                                    │
│  └── Database work (PostgreSQL, MongoDB)                       │
│                                                                 │
│  🔴 ADVANCED                                                   │
│  ├── C++, Rust, Go (systems programming)                       │
│  ├── Kubernetes, Docker (infrastructure)                       │
│  ├── TensorFlow, PyTorch (ML frameworks)                       │
│  └── Compiler/kernel development                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

🔗 Integration Guide

Step 1: Deploy Actor

cd gsoc
apify login
apify push

Step 2: Run Actor & Get Credentials

Go to Apify Console
Find your actor → Click Start
Configure input JSON and run
After completion:
- Actor ID: From the URL (e.g., username/gsoc-crawler)
- Dataset ID: Storage tab → Dataset → Copy ID

Step 3: Configure Frontend

Create/update .env in your app root:

# Required for Apify integration
VITE_APIFY_API_TOKEN=apify_api_xxxxxxxxxxxx
VITE_APIFY_ACTOR_ID=your-username/gsoc-crawler
VITE_APIFY_DATASET_ID=xxxxxxxxxxxxxxxxxxxx

Step 4: Verify Integration

# Run the frontend
npm run dev

# You should see:
# ✅ "Live Data" badge in the UI
# ✅ Last updated timestamp
# ✅ Organizations loaded from Apify

📁 Project Structure

gsoc/
├── .actor/
│   ├── actor.json           # Actor metadata & configuration
│   ├── dataset_schema.json  # Output data structure definition
│   ├── input_schema.json    # Input validation schema
│   └── Dockerfile           # Playwright container config
├── src/
│   └── main.ts              # Main PlaywrightCrawler logic
├── storage/                 # Local development storage
│   ├── datasets/default/
│   ├── key_value_stores/default/
│   └── request_queues/default/
├── package.json
├── tsconfig.json
└── README.md

🖥️ Console Output Preview

When the crawler runs, you'll see formatted output like this:

╔════════════════════════════════════════════════════════════════╗
║                    🎓 GSoC CRAWLER RESULTS                     ║
╠════════════════════════════════════════════════════════════════╣
║  Total Organizations: 248                                      ║
║  Years Crawled: 2024, 2023, 2022                              ║
║  Unique Technologies: 156                                      ║
╚════════════════════════════════════════════════════════════════╝

📊 DIFFICULTY BREAKDOWN
┌────────────────────┬───────┬────────────┐
│ Level              │ Count │ Percentage │
├────────────────────┼───────┼────────────┤
│ 🟢 Beginner        │    45 │      18.1% │
│ 🟡 Intermediate    │   142 │      57.3% │
│ 🔴 Advanced        │    61 │      24.6% │
└────────────────────┴───────┴────────────┘

🔧 TOP 10 TECHNOLOGIES
 1. Python ............... 187 orgs (75.4%)
 2. JavaScript ........... 134 orgs (54.0%)
 3. C++ .................. 89 orgs (35.9%)
 4. Java ................. 76 orgs (30.6%)
 5. TypeScript ........... 67 orgs (27.0%)
 ...

✅ Crawling complete! Data saved to Apify Dataset.

🛠️ Troubleshooting

Common Issues

Issue	Solution
0 organizations found	GSoC uses JavaScript rendering - we use PlaywrightCrawler to handle this
Timeout errors	Increase `maxRequestsPerCrawl` or check network connectivity
Memory issues	Reduce the number of years in input
Rate limiting	Actor automatically handles retries with exponential backoff

Debug Mode

Enable detailed logging by setting environment variable:

$DEBUG=1 npm start

📚 Resources

Apify Documentation

Apify SDK - Actor development toolkit
Crawlee - Web scraping library
PlaywrightCrawler - Browser automation
Input Schema - Configuration validation

Tutorials

Integrations

📄 License

MIT License - Feel free to use, modify, and distribute.

Built with ❤️ for GSoC Students

Getting started

For complete information see this article. To run the Actor use the following command:

$apify run

Deploy to Apify

Connect Git repository to Apify

If you've created a Git repository for the project, you can easily connect to Apify:

Go to Actor creation page
Click on Link Git Repository button

Push project on your local machine to Apify

You can also deploy the project on your local machine to Apify without the need for the Git repository.

Log in to Apify. You will need to provide your Apify API Token to complete this action.
```
$apify login
```
Deploy your Actor. This command will deploy and build the Actor on the Apify Platform. You can find your newly created Actor under Actors -> My Actors.
```
$apify push
```