edX Online Course Scraper
tugkan/edx-scraper
- Readme
- API
- Input
- Source code
Actor - edx.org Scraper
edx.org Scraper is an Apify actor for extracting data of courses from edx.org. It allows you to search for keywords and pick a language. It is build on top of Apify SDK and you can run it both on Apify platform and locally.
- edx.org Scraper Input Parameters
- edx.org Scraper Input Example
- edx.org Scraper Course Output
- Important Notes for the Output
- Compute Unit Consumption
- During The Run
- edx.org Export
edx.org Scraper Input Parameters
The input of this scraper should be JSON containing the list of pages on edx that should be visited. Required fields are:
Field | Type | Description |
---|---|---|
search | String | (optional) |
language | Array | (optional) List of languages that edx provides. You can fetch all courses of a language with it |
startUrls | Array | (optional) List of edx URLs. You should provide only course detail URL or URLs from https://www.edx.org/course |
maxItems | Integer | (optional) Maximum number of items that output will contain |
proxyConfig | Object | Proxy configuration |
This solution requires the use of Proxy servers, either your own proxy servers or you can use Apify Proxy.
edx Scraper Input example
{
"language": "Spanish",
"maxItems": 7,
"proxyConfig": {
"useApifyProxy": true,
"apifyProxyGroups": [
"SHADER"
]
}
}
edx Course Output
The structure of each item in edx courses looks like this:
{
"courseTitle": "Comunicarnos sin daño para la reconciliación y la salud mental",
"subject": "Medicine",
"code": "course-v1:JaverianaX+CRSx+1T2020",
"url": "https://courses.edx.org/courses/course-v1:JaverianaX+CRSx+1T2020/course/."
}
Important Notes for the Output
When you search for a query which doesn't have any courses, edx just returns all the courses it got. That's why you should check the website for the best result.
Compute Unit Consumption
The actor optimized to run blazing fast and scrape many product as possible. Therefore, it forefronts all product detail requests. If actor doesn't block very often it'll scrape XXX products in XXX minutes with XXX compute units.
During the Run
During the run, the actor will output messages letting you know what is going on. Each message always contains a short label specifying which page from the provided list is currently specified. When items are loaded from the page, you should see a message about this event with a loaded item count and total item count for each page.
If you provide incorrect input to the actor, it will immediately stop with failure state and output an explanation of what is wrong.
edx Export
During the run, the actor stores results into a dataset. Each item is a separate item in the dataset.
You can manage the results in any languague (Python, PHP, Node JS/NPM). See the FAQ or our API reference to learn more about getting results from this edx actor.