Economist Category Scraper avatar

Economist Category Scraper

Deprecated
View all Actors
This Actor is deprecated

This Actor is unavailable because the developer has decided to deprecate it. Would you like to try a similar Actor instead?

See alternative Actors
Economist Category Scraper

Economist Category Scraper

mtrunkat/economist-category-scraper

Example implementation of economist.com scraper built using apify/web-scraper actor. Crawls latest updates from a given economist category.

Dockerfile

1# Dockerfile contains instructions how to build a Docker image that
2# will contain all the code and configuration needed to run your actor.
3# For a full Dockerfile reference,
4# see https://docs.docker.com/engine/reference/builder/
5
6# First, specify the base Docker image. Apify provides the following
7# base images for your convenience:
8#  apify/actor-node-basic (Node.js 10 on Alpine Linux, small and fast)
9#  apify/actor-node-chrome (Node.js 10 + Chrome on Debian)
10#  apify/actor-node-chrome-xvfb (Node.js 10 + Chrome + Xvfb on Debian)
11# For more information, see https://apify.com/docs/actor#base-images
12# Note that you can use any other image from Docker Hub.
13FROM apify/actor-node-basic
14
15# Second, copy just package.json since it should be the only file
16# that affects NPM install in the next step
17COPY package.json ./
18
19# Install NPM packages, skip optional and development dependencies to
20# keep the image small. Avoid logging too much and print the dependency
21# tree for debugging
22RUN npm --quiet set progress=false \
23 && npm install --only=prod --no-optional \
24 && echo "Installed NPM packages:" \
25 && npm list \
26 && echo "Node.js version:" \
27 && node --version \
28 && echo "NPM version:" \
29 && npm --version
30 
31# Next, copy the remaining files and directories with the source code.
32# Since we do this after NPM install, quick build will be really fast
33# for most source file changes. 
34COPY . ./
35
36# Optionally, specify how to launch the source code of your actor.
37# By default, Apify's base Docker images define the CMD instruction
38# that runs the source code using the command specified
39# in the "scripts.start" section of the package.json file.
40# In short, the instruction looks something like this:  
41# CMD npm start

INPUT_SCHEMA.json

1{
2    "title": "My input schema",
3    "type": "object",
4    "schemaVersion": 1,
5    "properties": {
6        "category": {
7            "title": "Category",
8            "type": "string",
9            "description": "Economist.com category to be scraped",
10            "editor": "textfield",
11            "prefill": "briefing"
12        }
13    }
14}

main.js

1// This is the main Node.js source code file of your actor.
2// It is referenced from the "scripts" section of the package.json file.
3
4const Apify = require('apify');
5
6Apify.main(async () => {
7    // Get input of the actor. Input fields can be modified in INPUT_SCHEMA.json file.
8    // For more information, see https://apify.com/docs/actor/input-schema
9    const input = await Apify.getInput();
10    console.log('Input:');
11    console.dir(input);
12
13    // Here you can prepare your input for actor apify/web-scraper this input is based on a actor
14    // task you used as the starting point.
15    const metamorphInput = {
16        "startUrls": [
17            {
18                "url": `https://www.economist.com/${input.category}/?page=1`,
19                "method": "GET"
20            }
21        ],
22        "useRequestQueue": true,
23        "pseudoUrls": [
24            {
25                "purl": `https://www.economist.com/${input.category}/?page=[\\d+]`,
26                "method": "GET"
27            }
28        ],
29        "linkSelector": "a",
30        "pageFunction": async function pageFunction(context) {
31            // request is an instance of Apify.Request (https://sdk.apify.com/docs/api/request)
32            // $ is an instance of jQuery (http://jquery.com/)
33            const request = context.request;
34            const $ = context.jQuery;
35            const pageNum = parseInt(request.url.split('?page=').pop());
36        
37            context.log.info(`Scraping ${context.request.url}`);
38        
39            // Extract all articles.
40            const articles = [];
41            $('article').each((index, articleEl) => {
42                const $articleEl = $(articleEl);
43        
44                // H3 contains 2 child elements where first one is topic and second is article title.
45                const $h3El = $articleEl.find('h3');
46        
47                // Extract additonal info and push it to data object.
48                articles.push({
49                    pageNum,
50                    topic: $h3El.children().first().text(),
51                    title: $h3El.children().last().text(),
52                    url: $articleEl.find('a')[0].href,
53                    teaser: $articleEl.find('.teaser__text').text(),
54                });
55            });
56        
57            // Return results.
58            return articles;
59        },
60        "proxyConfiguration": {
61            "useApifyProxy": true
62        },
63        "debugLog": false,
64        "browserLog": false,
65        "injectJQuery": true,
66        "injectUnderscore": false,
67        "downloadMedia": false,
68        "downloadCss": false,
69        "ignoreSslErrors": false
70    };
71
72    // Now let's metamorph into actor apify/web-scraper using the created input.
73    await Apify.metamorph('apify/web-scraper', metamorphInput);
74});

package.json

1{
2    "name": "my-actor",
3    "version": "0.0.1",
4    "dependencies": {
5        "apify": "^0.14.5"
6    },
7    "scripts": {
8        "start": "node main.js"
9    },
10    "author": "Me!"
11}
Developer
Maintained by Community
Categories