# edX Course Scraper (`glassventures/edx-course-scraper`) Actor

Scrape courses from edX. Extract title, provider, price, instructors, subjects, level, and more. Export to JSON, CSV, Excel.

- **URL**: https://apify.com/glassventures/edx-course-scraper.md
- **Developed by:** [Glass Ventures](https://apify.com/glassventures) (community)
- **Categories:** Education, Automation, Developer tools
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## edX Course Scraper

Scrape course data from edX, the leading online learning platform. Extract titles, providers, prices, instructors, subjects, difficulty levels, and more.

### What does edX Course Scraper do?

edX Course Scraper extracts structured data from edX's online course catalog. It uses edX's public catalog APIs to efficiently gather comprehensive course information without needing a browser.

The actor supports searching by keywords, filtering by subjects, and scraping specific course URLs. It automatically handles pagination and deduplication, making it easy to build datasets of thousands of courses.

Whether you're researching the online education market, comparing course offerings across universities, or building a course recommendation system, this actor provides the data you need in a clean, structured format.

### Use Cases

- **Market researchers** -- analyze the online education landscape, track course offerings and pricing trends across universities
- **Data analysts** -- build datasets of courses for analysis, compare subjects, pricing, and enrollment across providers
- **EdTech companies** -- monitor competitor course offerings, identify gaps in the market
- **Developers** -- integrate edX course data into apps, build course comparison tools or recommendation engines

### Features

- Search courses by keyword or subject area
- Scrape specific edX course URLs directly
- Extract rich metadata: title, provider, price, instructors, level, duration, effort
- Automatic pagination through large result sets
- Deduplication of courses across multiple searches
- Proxy support with automatic rotation
- Handles pagination and large datasets automatically
- Exports to JSON, CSV, Excel, or connect via API

### How much will it cost?

| Results | Estimated Cost |
|---------|---------------|
| 100     | ~$0.10        |
| 1,000   | ~$0.50        |
| 10,000  | ~$3.00        |

| Cost Component | Per 1,000 Results |
|----------------|-------------------|
| Platform compute | ~$0.25 |
| Proxy (datacenter) | ~$0.25 |
| **Total** | **~$0.50** |

edX has a public API, so scraping is very efficient and inexpensive. Datacenter proxies work well.

### How to use

1. Go to the edX Course Scraper page on Apify Store
2. Click "Start" or "Try for free"
3. Enter search terms (e.g., "machine learning") or edX course URLs
4. Set the maximum number of courses to scrape
5. Click "Start" and wait for the results

### Input parameters

| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| startUrls | array | edX course URLs to scrape | - |
| searchTerms | array | Search queries to find courses | - |
| subjects | array | Filter by subject area | - |
| maxItems | number | Max courses to return | 100 |
| maxConcurrency | number | Parallel requests | 10 |
| proxyConfig | object | Proxy settings | Apify Proxy |

### Output

The actor produces a dataset with the following fields:

```json
{
    "url": "https://www.edx.org/learn/machine-learning/stanford-university-machine-learning",
    "title": "Machine Learning",
    "provider": "Stanford University",
    "shortDescription": "Learn about the most effective machine learning techniques...",
    "fullDescription": "This course provides a broad introduction to machine learning...",
    "subjects": ["Computer Science", "Data Analysis & Statistics"],
    "level": "Intermediate",
    "language": "en-us",
    "startDate": "2024-01-15T00:00:00Z",
    "endDate": null,
    "enrollmentCount": 4500000,
    "price": 79,
    "isFree": true,
    "instructors": ["Andrew Ng"],
    "imageUrl": "https://prod-discovery.edx-cdn.org/media/course/image/...",
    "duration": "11 weeks",
    "effort": "5-7 hours per week",
    "scrapedAt": "2024-01-15T10:30:00.000Z"
}
````

| Field | Type | Description |
|-------|------|-------------|
| url | string | Course page URL |
| title | string | Course title |
| provider | string | University or institution |
| shortDescription | string | Brief course description |
| fullDescription | string | Detailed course description |
| subjects | array | Subject areas |
| level | string | Introductory, Intermediate, or Advanced |
| language | string | Course language code |
| startDate | string | Course start date (ISO 8601) |
| endDate | string | Course end date (ISO 8601) |
| enrollmentCount | number | Number of enrolled students |
| price | number | Price in USD for verified certificate |
| isFree | boolean | Whether audit track is free |
| instructors | array | List of instructor names |
| imageUrl | string | Course thumbnail image |
| duration | string | Course duration (e.g., "6 weeks") |
| effort | string | Weekly time commitment |
| scrapedAt | string | ISO 8601 scrape timestamp |

### Integrations

Connect edX Course Scraper with other tools:

- **Apify API** -- REST API for programmatic access
- **Webhooks** -- get notified when a run finishes
- **Zapier / Make** -- connect to 5,000+ apps
- **Google Sheets** -- export directly to spreadsheets

#### API Example (Node.js)

```javascript
import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_TOKEN' });
const run = await client.actor('YOUR_USERNAME/edx-course-scraper').call({
    searchTerms: ['machine learning', 'python programming'],
    maxItems: 100,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
```

#### API Example (Python)

```python
from apify_client import ApifyClient
client = ApifyClient('YOUR_TOKEN')
run = client.actor('YOUR_USERNAME/edx-course-scraper').call(run_input={
    'searchTerms': ['machine learning', 'python programming'],
    'maxItems': 100,
})
items = client.dataset(run['defaultDatasetId']).list_items().items
```

#### API Example (cURL)

```bash
curl "https://api.apify.com/v2/acts/YOUR_USERNAME~edx-course-scraper/runs" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{"searchTerms": ["machine learning"], "maxItems": 100}'
```

### Tips and tricks

- Start with a small `maxItems` (10-20) to test before running large scrapes
- edX's API is public and fast -- datacenter proxies work well, no need for residential
- Use `searchTerms` for broad discovery and `startUrls` for specific courses
- The actor tries the discovery API first for richer data (instructors, price, enrollment count)

### FAQ

**Q: Does this actor require login credentials?**
A: No. edX has public catalog APIs that don't require authentication.

**Q: How fast is the scraping?**
A: Approximately 50-200 courses per minute depending on API response times and concurrency settings.

**Q: What should I do if I get blocked?**
A: Switch to residential proxies in the Proxy Configuration settings. However, edX's API rarely blocks requests.

**Q: Does this scrape course content/videos?**
A: No. This actor only scrapes course metadata (title, description, instructors, etc.), not the actual course content.

### Is it legal to scrape edX?

Web scraping of publicly available data is generally legal based on precedents like the LinkedIn v. HiQ Labs case. This actor only accesses publicly available API endpoints that do not require authentication. Always review and respect the target site's Terms of Service and robots.txt. For more information, see [Apify's blog on web scraping legality](https://blog.apify.com/is-web-scraping-legal/).

### Related Actors

- [Coursera Course Scraper](https://apify.com/store) -- Scrape courses from Coursera
- [Udemy Course Scraper](https://apify.com/store) -- Scrape courses from Udemy

### Limitations

- Course enrollment counts may not be available for all courses
- Some course details (full description, instructors) depend on the discovery API being accessible
- The actor extracts metadata only, not course content or video materials
- Pricing information reflects the verified certificate track; audit is typically free

### Changelog

- **v0.1** (2026-04-23) -- Initial release

# Actor input Schema

## `startUrls` (type: `array`):

List of edX course URLs to scrape directly (e.g., https://www.edx.org/learn/machine-learning/...).

## `searchTerms` (type: `array`):

Search queries to find courses on edX. The actor will search the edX catalog API for each term.

## `subjects` (type: `array`):

Filter courses by subject area (e.g., Computer Science, Data Science, Business).

## `maxItems` (type: `integer`):

Maximum number of courses to scrape. Use 0 or leave empty for unlimited.

## `maxConcurrency` (type: `integer`):

Maximum number of requests processed in parallel.

## `debugMode` (type: `boolean`):

Enables verbose logging for troubleshooting.

## `extendOutputFunction` (type: `string`):

A JavaScript function to customize each output item. Receives { data }.

## `proxyConfig` (type: `object`):

Select proxies to be used. Datacenter proxies work well for edX.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://www.edx.org/learn/machine-learning"
    }
  ],
  "searchTerms": [
    "machine learning",
    "python programming"
  ],
  "maxItems": 100,
  "maxConcurrency": 10,
  "debugMode": false,
  "extendOutputFunction": "async ({ data }) => {\n    return data;\n}",
  "proxyConfig": {
    "useApifyProxy": true
  }
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://www.edx.org/learn/machine-learning"
        }
    ],
    "searchTerms": [
        "machine learning",
        "python programming"
    ],
    "extendOutputFunction": async ({ data }) => {
        return data;
    },
    "proxyConfig": {
        "useApifyProxy": true
    }
};

// Run the Actor and wait for it to finish
const run = await client.actor("glassventures/edx-course-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": [{ "url": "https://www.edx.org/learn/machine-learning" }],
    "searchTerms": [
        "machine learning",
        "python programming",
    ],
    "extendOutputFunction": """async ({ data }) => {
    return data;
}""",
    "proxyConfig": { "useApifyProxy": True },
}

# Run the Actor and wait for it to finish
run = client.actor("glassventures/edx-course-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://www.edx.org/learn/machine-learning"
    }
  ],
  "searchTerms": [
    "machine learning",
    "python programming"
  ],
  "extendOutputFunction": "async ({ data }) => {\\n    return data;\\n}",
  "proxyConfig": {
    "useApifyProxy": true
  }
}' |
apify call glassventures/edx-course-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=glassventures/edx-course-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "edX Course Scraper",
        "description": "Scrape courses from edX. Extract title, provider, price, instructors, subjects, level, and more. Export to JSON, CSV, Excel.",
        "version": "0.1",
        "x-build-id": "YwLISKfR7n4dmFxTz"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/glassventures~edx-course-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-glassventures-edx-course-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/glassventures~edx-course-scraper/runs": {
            "post": {
                "operationId": "runs-sync-glassventures-edx-course-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/glassventures~edx-course-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-glassventures-edx-course-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "startUrls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "List of edX course URLs to scrape directly (e.g., https://www.edx.org/learn/machine-learning/...).",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "searchTerms": {
                        "title": "Search Terms",
                        "type": "array",
                        "description": "Search queries to find courses on edX. The actor will search the edX catalog API for each term.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "subjects": {
                        "title": "Subjects",
                        "type": "array",
                        "description": "Filter courses by subject area (e.g., Computer Science, Data Science, Business).",
                        "items": {
                            "type": "string"
                        }
                    },
                    "maxItems": {
                        "title": "Max Items",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Maximum number of courses to scrape. Use 0 or leave empty for unlimited.",
                        "default": 100
                    },
                    "maxConcurrency": {
                        "title": "Max Concurrency",
                        "minimum": 1,
                        "maximum": 100,
                        "type": "integer",
                        "description": "Maximum number of requests processed in parallel.",
                        "default": 10
                    },
                    "debugMode": {
                        "title": "Debug Mode",
                        "type": "boolean",
                        "description": "Enables verbose logging for troubleshooting.",
                        "default": false
                    },
                    "extendOutputFunction": {
                        "title": "Extend Output Function",
                        "type": "string",
                        "description": "A JavaScript function to customize each output item. Receives { data }."
                    },
                    "proxyConfig": {
                        "title": "Proxy Configuration",
                        "type": "object",
                        "description": "Select proxies to be used. Datacenter proxies work well for edX."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
