Article Text Extractor avatar

Article Text Extractor

Try for free

No credit card required

Go to Store
Article Text Extractor

Article Text Extractor

mtrunkat/article-text-extractor
Try for free

No credit card required

Simply extracts article texts and other meta info from the given URL. Uses https://github.com/ageitgey/node-unfluff which is a NodeJS implementation of https://github.com/grangier/python-goose.

Simply extracts article text and other meta info from given url. Uses https://github.com/ageitgey/node-unfluff which is a NodeJS implementation of https://github.com/grangier/python-goose. Check out also lukaskrivka/article-extractor-smart.

Output get's saved into a default key-value store under the OUTPUT key. HTML of the given page is stored under the page.html key.

Example output:

1{
2  "title": "Sánchez no logra extender su poder territorial pese al triunfo del 26-M",
3  "softTitle": "Sánchez no logra extender su poder territorial pese al triunfo del 26-M",
4  "date": "16/06/2019 22:03",
5  "author": [
6    "Madrid"
7  ],
8  "publisher": "La Vanguardia",
9  "copyright": "La Vanguardia Ediciones Todos los derechos reservados",
10  "favicon": "https://www.lavanguardia.com/rsc/images/ico/favicon.ico",
11  "description": "El PSOE ganó el pasado 26 de mayo las elecciones municipales y autonómicas de manera 'clara y rotunda', según celebró el propio Pedro Sánchez aquella misma noche. Aunque la victoria socialista se tiñó...",
12  "lang": "es",
13  "canonicalLink": "https://www.lavanguardia.com/politica/20190617/462906149711/psoe-pedro-sanchez-elecciones-26m-alcaldias-gobiernos-espana.html",
14  "tags": [],
15  "image": "https://www.lavanguardia.com/r/GODO/LV/p6/WebSite/2019/06/17/Recortada/20190614-636961455890161857_20190614215051428-kvhE-U462903686315FDE-992x558@LaVanguardia-Web.jpg",
16  "videos": [],
17  "links": [],
18  "text": "..."
19}
Developer
Maintained by Community

Actor Metrics

  • 22 monthly users

  • 10 stars

  • >99% runs succeeded

  • Created in Mar 2018

  • Modified a year ago

Categories