Article Text Extractor avatar
Article Text Extractor
Try for free

No credit card required

View all Actors
Article Text Extractor

Article Text Extractor

mtrunkat/article-text-extractor
Try for free

No credit card required

Simply extracts article texts and other meta info from the given URL. Uses https://github.com/ageitgey/node-unfluff which is a NodeJS implementation of https://github.com/grangier/python-goose.

Dockerfile

1# This is a template for a Dockerfile used to run acts in Actor system.
2# The base image name below is set during the act build, based on user settings.
3# IMPORTANT: The base image must set a correct working directory, such as /usr/src/app or /home/user
4FROM apify/actor-node-chrome:v0.21.10
5
6# Second, copy just package.json and package-lock.json since it should be
7# the only file that affects "npm install" in the next step, to speed up the build
8COPY package*.json ./
9
10# Install NPM packages, skip optional and development dependencies to
11# keep the image small. Avoid logging too much and print the dependency
12# tree for debugging
13RUN npm --quiet set progress=false \
14 && npm install --only=prod --no-optional \
15 && echo "Installed NPM packages:" \
16 && (npm list --all || true) \
17 && echo "Node.js version:" \
18 && node --version \
19 && echo "NPM version:" \
20 && npm --version
21
22# Copy source code to container
23# Do this in the last step, to have fast build if only the source code changed
24COPY --chown=myuser:myuser . ./
25
26# NOTE: The CMD is already defined by the base image.
27# Uncomment this for local node inspector debugging:
28# CMD [ "node", "--inspect=0.0.0.0:9229", "main.js" ]

package.json

1{
2    "name": "apify-project",
3    "version": "0.0.1",
4    "description": "",
5    "author": "It's not you it's me",
6    "license": "ISC",
7    "dependencies": {
8        "apify": "0.21.10",
9        "request-promise": "latest",
10        "unfluff": "latest"
11    },
12    "scripts": {
13        "start": "node main.js"
14    }
15}

main.js

1const Apify = require('apify');
2const request = require('request-promise');
3const extractor = require('unfluff');
4
5Apify.main(async () => {
6    const { url } = await Apify.getValue('INPUT');
7    
8    if (!url) throw new Error('INPUT.url must be provided!!!');
9    
10    console.log('Opening browser ...');
11    const browser = await Apify.launchPuppeteer();
12    
13    console.log('Loading url ...');
14    const page = await browser.newPage();
15    await page.goto(url, { waitUntil: 'domcontentloaded' });
16    const html = await page.evaluate(() => document.documentElement.outerHTML);
17
18    await Apify.setValue('page.html', html, { contentType: 'text/html' });
19    
20    console.log('Extracting article data and saving results to key-value store ...');
21    await Apify.setValue('OUTPUT', extractor(html));
22    
23    console.log('Done!');
24});
Developer
Maintained by Community
Actor metrics
  • 22 monthly users
  • 8 stars
  • 99.7% runs succeeded
  • 7.3 hours response time
  • Created in Mar 2018
  • Modified 10 months ago
Categories