Actor picture

Economist Category Scraper

mtrunkat/economist-category-scraper

Example implementation of economist.com scraper built using apify/web-scraper actor. Crawls latest updates from a given economist category.

Author's avatarMarek Trunkát
  • Modified
  • Users13
  • Runs22
Actor picture

Economist Category Scraper

Dockerfile

# Dockerfile contains instructions how to build a Docker image that
# will contain all the code and configuration needed to run your actor.
# For a full Dockerfile reference,
# see https://docs.docker.com/engine/reference/builder/

# First, specify the base Docker image. Apify provides the following
# base images for your convenience:
#  apify/actor-node-basic (Node.js 10 on Alpine Linux, small and fast)
#  apify/actor-node-chrome (Node.js 10 + Chrome on Debian)
#  apify/actor-node-chrome-xvfb (Node.js 10 + Chrome + Xvfb on Debian)
# For more information, see https://apify.com/docs/actor#base-images
# Note that you can use any other image from Docker Hub.
FROM apify/actor-node-basic

# Second, copy just package.json since it should be the only file
# that affects NPM install in the next step
COPY package.json ./

# Install NPM packages, skip optional and development dependencies to
# keep the image small. Avoid logging too much and print the dependency
# tree for debugging
RUN npm --quiet set progress=false \
 && npm install --only=prod --no-optional \
 && echo "Installed NPM packages:" \
 && npm list \
 && echo "Node.js version:" \
 && node --version \
 && echo "NPM version:" \
 && npm --version
 
# Next, copy the remaining files and directories with the source code.
# Since we do this after NPM install, quick build will be really fast
# for most source file changes. 
COPY . ./

# Optionally, specify how to launch the source code of your actor.
# By default, Apify's base Docker images define the CMD instruction
# that runs the source code using the command specified
# in the "scripts.start" section of the package.json file.
# In short, the instruction looks something like this:  
# CMD npm start

INPUT_SCHEMA.json

{
    "title": "My input schema",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "category": {
            "title": "Category",
            "type": "string",
            "description": "Economist.com category to be scraped",
            "editor": "textfield",
            "prefill": "briefing"
        }
    }
}

README.md

# Economist category scraper

Example implementation of economist.com scraper built using apify/web-scraper 
actor. Crawls latest updates from a given economist category.

main.js

// This is the main Node.js source code file of your actor.
// It is referenced from the "scripts" section of the package.json file.

const Apify = require('apify');

Apify.main(async () => {
    // Get input of the actor. Input fields can be modified in INPUT_SCHEMA.json file.
    // For more information, see https://apify.com/docs/actor/input-schema
    const input = await Apify.getInput();
    console.log('Input:');
    console.dir(input);

    // Here you can prepare your input for actor apify/web-scraper this input is based on a actor
    // task you used as the starting point.
    const metamorphInput = {
        "startUrls": [
            {
                "url": `https://www.economist.com/${input.category}/?page=1`,
                "method": "GET"
            }
        ],
        "useRequestQueue": true,
        "pseudoUrls": [
            {
                "purl": `https://www.economist.com/${input.category}/?page=[\\d+]`,
                "method": "GET"
            }
        ],
        "linkSelector": "a",
        "pageFunction": async function pageFunction(context) {
            // request is an instance of Apify.Request (https://sdk.apify.com/docs/api/request)
            // $ is an instance of jQuery (http://jquery.com/)
            const request = context.request;
            const $ = context.jQuery;
            const pageNum = parseInt(request.url.split('?page=').pop());
        
            context.log.info(`Scraping ${context.request.url}`);
        
            // Extract all articles.
            const articles = [];
            $('article').each((index, articleEl) => {
                const $articleEl = $(articleEl);
        
                // H3 contains 2 child elements where first one is topic and second is article title.
                const $h3El = $articleEl.find('h3');
        
                // Extract additonal info and push it to data object.
                articles.push({
                    pageNum,
                    topic: $h3El.children().first().text(),
                    title: $h3El.children().last().text(),
                    url: $articleEl.find('a')[0].href,
                    teaser: $articleEl.find('.teaser__text').text(),
                });
            });
        
            // Return results.
            return articles;
        },
        "proxyConfiguration": {
            "useApifyProxy": false
        },
        "debugLog": false,
        "browserLog": false,
        "injectJQuery": true,
        "injectUnderscore": false,
        "downloadMedia": false,
        "downloadCss": false,
        "ignoreSslErrors": false
    };

    // Now let's metamorph into actor apify/web-scraper using the created input.
    await Apify.metamorph('apify/web-scraper', metamorphInput);
});

package.json

{
    "name": "my-actor",
    "version": "0.0.1",
    "dependencies": {
        "apify": "^0.14.5"
    },
    "scripts": {
        "start": "node main.js"
    },
    "author": "Me!"
}