Gsoc Finder avatar
Gsoc Finder
Under maintenance

Pricing

from $0.01 / 1,000 results

Go to Apify Store
Gsoc Finder

Gsoc Finder

Under maintenance

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

Sumeet Gond

Sumeet Gond

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

12 days ago

Last modified

Categories

Share

πŸŽ“ GSoC Organization Crawler

An Apify Actor that crawls Google Summer of Code organizations and extracts detailed information to help students find the perfect organization to contribute to.

Features β€’ How It Works β€’ Quick Start β€’ Output β€’ Integration


✨ Features

FeatureDescription
πŸ”„ Multi-Year CrawlingScrapes organizations from multiple GSoC years (2022-2024+)
πŸ”§ Technology DetectionIdentifies programming languages, frameworks, and tools
πŸ“Š Difficulty AssessmentAuto-determines difficulty (Beginner-Friendly, Intermediate, Advanced)
πŸ“ˆ Acceptance EstimationEstimates acceptance rates based on historical data
🏷️ Topic CategorizationExtracts and categorizes topics for each organization
πŸ”— Data MergingCombines data when organizations appear in multiple years

πŸ”„ How It Works

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CRAWLER WORKFLOW β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. INPUT β”‚ β”‚ 2. CRAWL β”‚ β”‚ 3. EXTRACT β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ Years to │────────▢│ Visit GSoC │────────▢│ Parse org β”‚
β”‚ scrape β”‚ β”‚ archive β”‚ β”‚ details β”‚
β”‚ [2024,2023] β”‚ β”‚ pages β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ 6. SAVE β”‚ β”‚ 5. ENRICH β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ β–Ό
β”‚ Push to │◀────────│ Add β”‚β—€β”€β”€β”€β”€β”€β”€β”€β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Apify β”‚ β”‚ difficulty & β”‚ β”‚ 4. MERGE β”‚
β”‚ Dataset β”‚ β”‚ acceptance β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ Combine β”‚
β”‚ multi-year β”‚
β”‚ data β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step-by-Step Process

  1. πŸš€ Initialize - Actor starts with input parameters (years, max requests)
  2. πŸ” Discover - Visits GSoC archive pages, waits for JavaScript to render
  3. πŸ“„ Extract - Visits each organization page, extracts all details
  4. πŸ”— Merge - Combines data for organizations appearing in multiple years
  5. πŸ“Š Enrich - Determines difficulty level and acceptance rate
  6. πŸ’Ύ Save - Pushes all organizations to Apify Dataset

πŸš€ Quick Start

Prerequisites

Local Development

# 1. Navigate to the project
cd gsoc
# 2. Install dependencies
npm install
# 3. Run locally
npm start

Deploy to Apify Cloud

# 1. Login to Apify (first time only)
apify login
# 2. Push to Apify Console
apify push
# 3. Run from Apify Console with input JSON

πŸ“₯ Input Configuration

FieldTypeDefaultDescription
yearsnumber[][2024, 2023, 2022]GSoC years to scrape
maxRequestsPerCrawlnumber500Maximum HTTP requests per run

Example Input

{
"years": [2024, 2023, 2022, 2021],
"maxRequestsPerCrawl": 1000
}

πŸ’‘ Tip: Start with fewer years (e.g., [2024]) for faster testing


πŸ“€ Output Format

Each organization in the dataset includes:

{
"name": "TensorFlow",
"description": "An end-to-end open source machine learning platform...",
"url": "https://summerofcode.withgoogle.com/archive/2024/organizations/tensorflow",
"technologies": ["Python", "C++", "JavaScript", "TensorFlow", "Keras"],
"topics": ["Machine Learning", "Deep Learning", "AI"],
"difficulty": "Advanced",
"acceptanceRate": "Low",
"years": [2024, 2023, 2022, 2021, 2020],
"projectTypes": ["Library Development", "Documentation", "Testing"],
"category": "Machine Learning",
"ideaListUrl": "https://github.com/tensorflow/tensorflow/wiki/gsoc",
"logoUrl": "https://..."
}

Output Fields Explained

FieldTypeDescription
namestringOrganization's display name
descriptionstringFull description from GSoC page
urlstringDirect link to organization's GSoC page
technologiesstring[]Programming languages, frameworks, tools
topicsstring[]Project categories and domains
difficultystringAuto-determined: Beginner-Friendly, Intermediate, Advanced
acceptanceRatestringEstimated: High, Medium, Low
yearsnumber[]Years the organization participated in GSoC
projectTypesstring[]Types of projects available
categorystringPrimary category classification
ideaListUrlstringLink to project ideas page
logoUrlstringOrganization's logo URL

πŸ“Š Difficulty Classification

The crawler automatically determines difficulty based on technology complexity:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DIFFICULTY LEVELS β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ 🟒 BEGINNER-FRIENDLY β”‚
β”‚ β”œβ”€β”€ HTML, CSS, JavaScript β”‚
β”‚ β”œβ”€β”€ Python basics β”‚
β”‚ └── Documentation projects β”‚
β”‚ β”‚
β”‚ 🟑 INTERMEDIATE β”‚
β”‚ β”œβ”€β”€ React, Vue, Angular β”‚
β”‚ β”œβ”€β”€ Node.js, Django, Flask β”‚
β”‚ └── Database work (PostgreSQL, MongoDB) β”‚
β”‚ β”‚
β”‚ πŸ”΄ ADVANCED β”‚
β”‚ β”œβ”€β”€ C++, Rust, Go (systems programming) β”‚
β”‚ β”œβ”€β”€ Kubernetes, Docker (infrastructure) β”‚
β”‚ β”œβ”€β”€ TensorFlow, PyTorch (ML frameworks) β”‚
β”‚ └── Compiler/kernel development β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”— Integration Guide

Step 1: Deploy Actor

cd gsoc
apify login
apify push

Step 2: Run Actor & Get Credentials

  1. Go to Apify Console
  2. Find your actor β†’ Click Start
  3. Configure input JSON and run
  4. After completion:
    • Actor ID: From the URL (e.g., username/gsoc-crawler)
    • Dataset ID: Storage tab β†’ Dataset β†’ Copy ID

Step 3: Configure Frontend

Create/update .env in your app root:

# Required for Apify integration
VITE_APIFY_API_TOKEN=apify_api_xxxxxxxxxxxx
VITE_APIFY_ACTOR_ID=your-username/gsoc-crawler
VITE_APIFY_DATASET_ID=xxxxxxxxxxxxxxxxxxxx

Step 4: Verify Integration

# Run the frontend
npm run dev
# You should see:
# βœ… "Live Data" badge in the UI
# βœ… Last updated timestamp
# βœ… Organizations loaded from Apify

πŸ“ Project Structure

gsoc/
β”œβ”€β”€ .actor/
β”‚ β”œβ”€β”€ actor.json # Actor metadata & configuration
β”‚ β”œβ”€β”€ dataset_schema.json # Output data structure definition
β”‚ β”œβ”€β”€ input_schema.json # Input validation schema
β”‚ └── Dockerfile # Playwright container config
β”œβ”€β”€ src/
β”‚ └── main.ts # Main PlaywrightCrawler logic
β”œβ”€β”€ storage/ # Local development storage
β”‚ β”œβ”€β”€ datasets/default/
β”‚ β”œβ”€β”€ key_value_stores/default/
β”‚ └── request_queues/default/
β”œβ”€β”€ package.json
β”œβ”€β”€ tsconfig.json
└── README.md

πŸ–₯️ Console Output Preview

When the crawler runs, you'll see formatted output like this:

╔════════════════════════════════════════════════════════════════╗
β•‘ πŸŽ“ GSoC CRAWLER RESULTS β•‘
╠════════════════════════════════════════════════════════════════╣
β•‘ Total Organizations: 248 β•‘
β•‘ Years Crawled: 2024, 2023, 2022 β•‘
β•‘ Unique Technologies: 156 β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
πŸ“Š DIFFICULTY BREAKDOWN
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Level β”‚ Count β”‚ Percentage β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 🟒 Beginner β”‚ 45 β”‚ 18.1% β”‚
β”‚ 🟑 Intermediate β”‚ 142 β”‚ 57.3% β”‚
β”‚ πŸ”΄ Advanced β”‚ 61 β”‚ 24.6% β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
πŸ”§ TOP 10 TECHNOLOGIES
1. Python ............... 187 orgs (75.4%)
2. JavaScript ........... 134 orgs (54.0%)
3. C++ .................. 89 orgs (35.9%)
4. Java ................. 76 orgs (30.6%)
5. TypeScript ........... 67 orgs (27.0%)
...
βœ… Crawling complete! Data saved to Apify Dataset.

πŸ› οΈ Troubleshooting

Common Issues

IssueSolution
0 organizations foundGSoC uses JavaScript rendering - we use PlaywrightCrawler to handle this
Timeout errorsIncrease maxRequestsPerCrawl or check network connectivity
Memory issuesReduce the number of years in input
Rate limitingActor automatically handles retries with exponential backoff

Debug Mode

Enable detailed logging by setting environment variable:

$DEBUG=1 npm start

πŸ“š Resources

Apify Documentation

Tutorials

Integrations


πŸ“„ License

MIT License - Feel free to use, modify, and distribute.


Built with ❀️ for GSoC Students

Getting started

For complete information see this article. To run the Actor use the following command:

$apify run

Deploy to Apify

Connect Git repository to Apify

If you've created a Git repository for the project, you can easily connect to Apify:

  1. Go to Actor creation page
  2. Click on Link Git Repository button

Push project on your local machine to Apify

You can also deploy the project on your local machine to Apify without the need for the Git repository.

  1. Log in to Apify. You will need to provide your Apify API Token to complete this action.

    $apify login
  2. Deploy your Actor. This command will deploy and build the Actor on the Apify Platform. You can find your newly created Actor under Actors -> My Actors.

    $apify push

Documentation reference

To learn more about Apify and Actors, take a look at the following resources: