Article Text Extractor

  • mtrunkat/article-text-extractor
  • Modified
  • Users 534
  • Runs 108.5k
  • Created by Author's avatarMarek Trunkát

Simply extracts article texts and other meta info from the given URL. Uses https://github.com/ageitgey/node-unfluff which is a NodeJS implementation of https://github.com/grangier/python-goose.

Simply extracts article text and other meta info from given url. Uses https://github.com/ageitgey/node-unfluff which is a NodeJS implementation of https://github.com/grangier/python-goose. Check out also lukaskrivka/article-extractor-smart.

Output get's saved into a default key-value store under the OUTPUT key. HTML of the given page is stored under the page.html key.

Example output:

{
  "title": "Sánchez no logra extender su poder territorial pese al triunfo del 26-M",
  "softTitle": "Sánchez no logra extender su poder territorial pese al triunfo del 26-M",
  "date": "16/06/2019 22:03",
  "author": [
    "Madrid"
  ],
  "publisher": "La Vanguardia",
  "copyright": "La Vanguardia Ediciones Todos los derechos reservados",
  "favicon": "https://www.lavanguardia.com/rsc/images/ico/favicon.ico",
  "description": "El PSOE ganó el pasado 26 de mayo las elecciones municipales y autonómicas de manera 'clara y rotunda', según celebró el propio Pedro Sánchez aquella misma noche. Aunque la victoria socialista se tiñó...",
  "lang": "es",
  "canonicalLink": "https://www.lavanguardia.com/politica/20190617/462906149711/psoe-pedro-sanchez-elecciones-26m-alcaldias-gobiernos-espana.html",
  "tags": [],
  "image": "https://www.lavanguardia.com/r/GODO/LV/p6/WebSite/2019/06/17/Recortada/20190614-636961455890161857_20190614215051428-kvhE-U462903686315FDE-992x558@LaVanguardia-Web.jpg",
  "videos": [],
  "links": [],
  "text": "..."
}

Industries

See how Article Text Extractor is used in industries around the world