Economist Category Scraper avatar
Economist Category Scraper
Deprecated

Pricing

Pay per usage

Go to Store
Economist Category Scraper

Economist Category Scraper

Deprecated
mtrunkat/economist-category-scraper

Developed by

Marek Trunkát

Maintained by Community

Example implementation of economist.com scraper built using apify/web-scraper actor. Crawls latest updates from a given economist category.

0.0 (0)

Pricing

Pay per usage

2

Monthly users

2

Last modified

2 years ago

Dockerfile

1# Dockerfile contains instructions how to build a Docker image that
2# will contain all the code and configuration needed to run your actor.
3# For a full Dockerfile reference,
4# see https://docs.docker.com/engine/reference/builder/
5
6# First, specify the base Docker image. Apify provides the following
7# base images for your convenience:
8#  apify/actor-node-basic (Node.js 10 on Alpine Linux, small and fast)
9#  apify/actor-node-chrome (Node.js 10 + Chrome on Debian)
10#  apify/actor-node-chrome-xvfb (Node.js 10 + Chrome + Xvfb on Debian)
11# For more information, see https://apify.com/docs/actor#base-images
12# Note that you can use any other image from Docker Hub.
13FROM apify/actor-node-basic
14
15# Second, copy just package.json since it should be the only file
16# that affects NPM install in the next step
17COPY package.json ./
18
19# Install NPM packages, skip optional and development dependencies to
20# keep the image small. Avoid logging too much and print the dependency
21# tree for debugging
22RUN npm --quiet set progress=false \
23 && npm install --only=prod --no-optional \
24 && echo "Installed NPM packages:" \
25 && npm list \
26 && echo "Node.js version:" \
27 && node --version \
28 && echo "NPM version:" \
29 && npm --version
30 
31# Next, copy the remaining files and directories with the source code.
32# Since we do this after NPM install, quick build will be really fast
33# for most source file changes. 
34COPY . ./
35
36# Optionally, specify how to launch the source code of your actor.
37# By default, Apify's base Docker images define the CMD instruction
38# that runs the source code using the command specified
39# in the "scripts.start" section of the package.json file.
40# In short, the instruction looks something like this:  
41# CMD npm start

INPUT_SCHEMA.json

1{
2    "title": "My input schema",
3    "type": "object",
4    "schemaVersion": 1,
5    "properties": {
6        "category": {
7            "title": "Category",
8            "type": "string",
9            "description": "Economist.com category to be scraped",
10            "editor": "textfield",
11            "prefill": "briefing"
12        }
13    }
14}

main.js

1// This is the main Node.js source code file of your actor.
2// It is referenced from the "scripts" section of the package.json file.
3
4const Apify = require('apify');
5
6Apify.main(async () => {
7    // Get input of the actor. Input fields can be modified in INPUT_SCHEMA.json file.
8    // For more information, see https://apify.com/docs/actor/input-schema
9    const input = await Apify.getInput();
10    console.log('Input:');
11    console.dir(input);
12
13    // Here you can prepare your input for actor apify/web-scraper this input is based on a actor
14    // task you used as the starting point.
15    const metamorphInput = {
16        "startUrls": [
17            {
18                "url": `https://www.economist.com/${input.category}/?page=1`,
19                "method": "GET"
20            }
21        ],
22        "useRequestQueue": true,
23        "pseudoUrls": [
24            {
25                "purl": `https://www.economist.com/${input.category}/?page=[\\d+]`,
26                "method": "GET"
27            }
28        ],
29        "linkSelector": "a",
30        "pageFunction": async function pageFunction(context) {
31            // request is an instance of Apify.Request (https://sdk.apify.com/docs/api/request)
32            // $ is an instance of jQuery (http://jquery.com/)
33            const request = context.request;
34            const $ = context.jQuery;
35            const pageNum = parseInt(request.url.split('?page=').pop());
36        
37            context.log.info(`Scraping ${context.request.url}`);
38        
39            // Extract all articles.
40            const articles = [];
41            $('article').each((index, articleEl) => {
42                const $articleEl = $(articleEl);
43        
44                // H3 contains 2 child elements where first one is topic and second is article title.
45                const $h3El = $articleEl.find('h3');
46        
47                // Extract additonal info and push it to data object.
48                articles.push({
49                    pageNum,
50                    topic: $h3El.children().first().text(),
51                    title: $h3El.children().last().text(),
52                    url: $articleEl.find('a')[0].href,
53                    teaser: $articleEl.find('.teaser__text').text(),
54                });
55            });
56        
57            // Return results.
58            return articles;
59        },
60        "proxyConfiguration": {
61            "useApifyProxy": true
62        },
63        "debugLog": false,
64        "browserLog": false,
65        "injectJQuery": true,
66        "injectUnderscore": false,
67        "downloadMedia": false,
68        "downloadCss": false,
69        "ignoreSslErrors": false
70    };
71
72    // Now let's metamorph into actor apify/web-scraper using the created input.
73    await Apify.metamorph('apify/web-scraper', metamorphInput);
74});

package.json

1{
2    "name": "my-actor",
3    "version": "0.0.1",
4    "dependencies": {
5        "apify": "^0.14.5"
6    },
7    "scripts": {
8        "start": "node main.js"
9    },
10    "author": "Me!"
11}

Pricing

Pricing model

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage.