Simply extracts article texts and other meta info from the given URL. Uses https://github.com/ageitgey/node-unfluff which is a NodeJS implementation of https://github.com/grangier/python-goose.

- Modified
- Users498
- Runs105,372
Dockerfile
# Dockerfile contains instructions how to build a Docker image that
# will contain all the code and configuration needed to run your actor.
# For a full Dockerfile reference,
# see https://docs.docker.com/engine/reference/builder/
# First, specify the base Docker image. Apify provides the following
# base images for your convenience:
# apify/actor-node-basic (Node.js 10 on Alpine Linux, small and fast)
# apify/actor-node-chrome (Node.js 10 + Chrome on Debian)
# apify/actor-node-chrome-xvfb (Node.js 10 + Chrome + Xvfb on Debian)
# For more information, see https://apify.com/docs/actor#base-images
# Note that you can use any other image from Docker Hub.
FROM apify/actor-node-chrome
# Second, copy just package.json since it should be the only file
# that affects NPM install in the next step
COPY package.json ./
# Install NPM packages, skip optional and development dependencies to
# keep the image small. Avoid logging too much and print the dependency
# tree for debugging
RUN npm --quiet set progress=false \
&& npm install --only=prod --no-optional \
&& echo "Installed NPM packages:" \
&& npm list \
&& echo "Node.js version:" \
&& node --version \
&& echo "NPM version:" \
&& npm --version
# Next, copy the remaining files and directories with the source code.
# Since we do this after NPM install, quick build will be really fast
# for most source file changes.
COPY . ./
# Optionally, specify how to launch the source code of your actor.
# By default, Apify's base Docker images define the CMD instruction
# that runs the source code using the command specified
# in the "scripts.start" section of the package.json file.
# In short, the instruction looks something like this:
# CMD npm start
INPUT_SCHEMA.json
{
"title": "Article text extractor input",
"description": "",
"type": "object",
"schemaVersion": 1,
"properties": {
"url": {
"title": "Article URL",
"type": "string",
"description": "Fill the article URL, from which you want to extract data.",
"prefill": "https://www.bbc.com/news/world-asia-china-48659073",
"editor": "textfield"
}
},
"required": ["url"]
}
README.md
Simply extracts article text and other meta info from given url.
Uses https://github.com/ageitgey/node-unfluff which is a NodeJS implementation of https://github.com/grangier/python-goose.
Check out also [lukaskrivka/article-extractor-smart](https://apify.com/lukaskrivka/article-extractor-smart).
Output get's saved into a default key-value store under the `OUTPUT` key. HTML of the given page is stored under the `page.html` key.
Example output:
```json
{
"title": "Sánchez no logra extender su poder territorial pese al triunfo del 26-M",
"softTitle": "Sánchez no logra extender su poder territorial pese al triunfo del 26-M",
"date": "16/06/2019 22:03",
"author": [
"Madrid"
],
"publisher": "La Vanguardia",
"copyright": "La Vanguardia Ediciones Todos los derechos reservados",
"favicon": "https://www.lavanguardia.com/rsc/images/ico/favicon.ico",
"description": "El PSOE ganó el pasado 26 de mayo las elecciones municipales y autonómicas de manera 'clara y rotunda', según celebró el propio Pedro Sánchez aquella misma noche. Aunque la victoria socialista se tiñó...",
"lang": "es",
"canonicalLink": "https://www.lavanguardia.com/politica/20190617/462906149711/psoe-pedro-sanchez-elecciones-26m-alcaldias-gobiernos-espana.html",
"tags": [],
"image": "https://www.lavanguardia.com/r/GODO/LV/p6/WebSite/2019/06/17/Recortada/20190614-636961455890161857_20190614215051428-kvhE-U462903686315FDE-992x558@LaVanguardia-Web.jpg",
"videos": [],
"links": [],
"text": "..."
}
```
main.js
const Apify = require('apify');
const request = require('request-promise');
const extractor = require('unfluff');
Apify.main(async () => {
const { url } = await Apify.getValue('INPUT');
if (!url) throw new Error('INPUT.url must be provided!!!');
console.log('Opening browser ...');
const browser = await Apify.launchPuppeteer();
console.log('Loading url ...');
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'domcontentloaded' });
const html = await page.evaluate(() => document.documentElement.outerHTML);
await Apify.setValue('page.html', html, { contentType: 'text/html' });
console.log('Extracting article data and saving results to key-value store ...');
await Apify.setValue('OUTPUT', extractor(html));
console.log('Done!');
});
package.json
{
"name": "my-actor",
"version": "0.0.1",
"dependencies": {
"apify": "^0.14.15",
"request-promise": "latest",
"unfluff": "latest"
},
"scripts": {
"start": "node main.js"
},
"author": "Me!"
}