
Article Text Extractor
- mtrunkat/article-text-extractor
- Modified
- Users 534
- Runs 108.5k
- Created by
Marek Trunkát
Simply extracts article texts and other meta info from the given URL. Uses https://github.com/ageitgey/node-unfluff which is a NodeJS implementation of https://github.com/grangier/python-goose.
Simply extracts article text and other meta info from given url. Uses https://github.com/ageitgey/node-unfluff which is a NodeJS implementation of https://github.com/grangier/python-goose. Check out also lukaskrivka/article-extractor-smart.
Output get's saved into a default key-value store under the OUTPUT
key. HTML of the given page is stored under the page.html
key.
Example output:
{
"title": "Sánchez no logra extender su poder territorial pese al triunfo del 26-M",
"softTitle": "Sánchez no logra extender su poder territorial pese al triunfo del 26-M",
"date": "16/06/2019 22:03",
"author": [
"Madrid"
],
"publisher": "La Vanguardia",
"copyright": "La Vanguardia Ediciones Todos los derechos reservados",
"favicon": "https://www.lavanguardia.com/rsc/images/ico/favicon.ico",
"description": "El PSOE ganó el pasado 26 de mayo las elecciones municipales y autonómicas de manera 'clara y rotunda', según celebró el propio Pedro Sánchez aquella misma noche. Aunque la victoria socialista se tiñó...",
"lang": "es",
"canonicalLink": "https://www.lavanguardia.com/politica/20190617/462906149711/psoe-pedro-sanchez-elecciones-26m-alcaldias-gobiernos-espana.html",
"tags": [],
"image": "https://www.lavanguardia.com/r/GODO/LV/p6/WebSite/2019/06/17/Recortada/20190614-636961455890161857_20190614215051428-kvhE-U462903686315FDE-992x558@LaVanguardia-Web.jpg",
"videos": [],
"links": [],
"text": "..."
}