Metadata Extractor avatar
Metadata Extractor
Try for free

No credit card required

View all Actors
Metadata Extractor

Metadata Extractor

jancurn/extract-metadata
Try for free

No credit card required

A small efficient actor that loads a web page, parses its HTML using Cheerio library and extracts the following meta-data from the <HEAD> tag, such as page title, description, author etc.

Dockerfile

1# This is a template for a Dockerfile used to run acts in Actor system.
2# The base image name below is set during the act build, based on user settings.
3# IMPORTANT: The base image must set a correct working directory, such as /usr/src/app or /home/user
4FROM apify/actor-node-basic:v0.21.10
5
6# Second, copy just package.json and package-lock.json since it should be
7# the only file that affects "npm install" in the next step, to speed up the build
8COPY package*.json ./
9
10# Install NPM packages, skip optional and development dependencies to
11# keep the image small. Avoid logging too much and print the dependency
12# tree for debugging
13RUN npm --quiet set progress=false \
14 && npm install --only=prod --no-optional \
15 && echo "Installed NPM packages:" \
16 && (npm list --all || true) \
17 && echo "Node.js version:" \
18 && node --version \
19 && echo "NPM version:" \
20 && npm --version
21
22# Copy source code to container
23# Do this in the last step, to have fast build if only the source code changed
24COPY  . ./
25
26# NOTE: The CMD is already defined by the base image.
27# Uncomment this for local node inspector debugging:
28# CMD [ "node", "--inspect=0.0.0.0:9229", "main.js" ]

package.json

1{
2    "name": "apify-project",
3    "version": "0.0.1",
4    "description": "",
5    "author": "It's not you it's me",
6    "license": "ISC",
7    "dependencies": {
8        "apify": "0.21.10",
9        "request-promise": "latest",
10        "cheerio": "latest"
11    },
12    "scripts": {
13        "start": "node main.js"
14    }
15}

main.js

1const Apify = require('apify');
2const request = require('request-promise');
3const cheerio = require('cheerio');
4
5
6Apify.main(async () => {
7    // Get input of the act
8    const input = await Apify.getValue('INPUT');
9    if (!input || typeof(input.url) !== 'string') {
10        throw new Error("Invalid input, it needs to contain 'url' field.");
11    }
12    
13    // Load the web page and extract meta-data
14    console.log(`Opening ${input.url}`);
15    const html = await request(input.url);
16    
17    const $ = cheerio.load(html);
18    
19    const meta = {};
20    $('head meta').each(function () {
21        const name = $(this).attr('name');
22        const content = $(this).attr('content');
23        if (name) meta[name] = content ? content.trim() : null;
24    });
25    
26    const result = {
27        url: input.url,
28        title: ($('head title').text() || '').trim(),
29        meta,
30    }
31
32    // Show and save result
33    console.log('Result:');
34    console.dir(result);
35    await Apify.setValue('OUTPUT', result);
36});
Developer
Maintained by Community
Actor metrics
  • 22 monthly users
  • 7 stars
  • 100.0% runs succeeded
  • Created in Feb 2018
  • Modified 10 months ago
Categories