Article Text Extractor
Try for free
No credit card required
Go to Store
Article Text Extractor
mtrunkat/article-text-extractor
Try for free
No credit card required
Simply extracts article texts and other meta info from the given URL. Uses https://github.com/ageitgey/node-unfluff which is a NodeJS implementation of https://github.com/grangier/python-goose.
Dockerfile
1# This is a template for a Dockerfile used to run acts in Actor system.
2# The base image name below is set during the act build, based on user settings.
3# IMPORTANT: The base image must set a correct working directory, such as /usr/src/app or /home/user
4FROM apify/actor-node-chrome:v0.21.10
5
6# Second, copy just package.json and package-lock.json since it should be
7# the only file that affects "npm install" in the next step, to speed up the build
8COPY package*.json ./
9
10# Install NPM packages, skip optional and development dependencies to
11# keep the image small. Avoid logging too much and print the dependency
12# tree for debugging
13RUN npm --quiet set progress=false \
14 && npm install --only=prod --no-optional \
15 && echo "Installed NPM packages:" \
16 && (npm list --all || true) \
17 && echo "Node.js version:" \
18 && node --version \
19 && echo "NPM version:" \
20 && npm --version
21
22# Copy source code to container
23# Do this in the last step, to have fast build if only the source code changed
24COPY . ./
25
26# NOTE: The CMD is already defined by the base image.
27# Uncomment this for local node inspector debugging:
28# CMD [ "node", "--inspect=0.0.0.0:9229", "main.js" ]
package.json
1{
2 "name": "apify-project",
3 "version": "0.0.1",
4 "description": "",
5 "author": "It's not you it's me",
6 "license": "ISC",
7 "dependencies": {
8 "apify": "0.21.10",
9 "request-promise": "latest",
10 "unfluff": "latest"
11 },
12 "scripts": {
13 "start": "node main.js"
14 }
15}
main.js
1const Apify = require('apify');
2const request = require('request-promise');
3const extractor = require('unfluff');
4
5Apify.main(async () => {
6 const { url } = await Apify.getValue('INPUT');
7
8 if (!url) throw new Error('INPUT.url must be provided!!!');
9
10 console.log('Opening browser ...');
11 const browser = await Apify.launchPuppeteer();
12
13 console.log('Loading url ...');
14 const page = await browser.newPage();
15 await page.goto(url, { waitUntil: 'domcontentloaded' });
16 const html = await page.evaluate(() => document.documentElement.outerHTML);
17
18 await Apify.setValue('page.html', html, { contentType: 'text/html' });
19
20 console.log('Extracting article data and saving results to key-value store ...');
21 await Apify.setValue('OUTPUT', extractor(html));
22
23 console.log('Done!');
24});
Developer
Maintained by Community
Actor Metrics
22 monthly users
-
10 stars
>99% runs succeeded
Created in Mar 2018
Modified a year ago
Categories