PDF to HTML Converter

No credit card required

PDF to HTML Converter

PDF to HTML Converter

jancurn/pdf-to-html

No credit card required

Converts a PDF document to HTML using the pdf2htmlEX tool.

Dockerfile

1 2FROM debian:jessie 3 4RUN apt-get update --fix-missing \ 5 && DEBIAN_FRONTEND=noninteractive apt-get -y upgrade \ 6 && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends curl ca-certificates pdf2htmlex \ 7 && curl -sL https://deb.nodesource.com/setup_10.x | bash - \ 8 && apt-get install -y nodejs \ 9 && node -v \ 10 && rm -rf /var/lib/apt/lists/* 11 12RUN mkdir -p /pdf/kv-store-dev 13 14WORKDIR /pdf 15 16# Copy all files and directories from the directory to the Docker image 17COPY main.js package.json ./ 18 19# Install NPM packages, skip optional and development dependencies to keep the image small, 20# avoid logging to much and show log the dependency tree 21RUN npm install --quiet --only=prod --no-optional \ 22 && npm list \ 23 && pwd \ 24 && ls -l 25 26# Define that start command 27CMD [ "node", "main.js" ]

INPUT_SCHEMA.json

1{ 2 "title": "PDF to HTML input", 3 "type": "object", 4 "schemaVersion": 1, 5 "properties": { 6 "url": { 7 "title": "URL", 8 "type": "string", 9 "description": "URL that links to a PDF file", 10 "editor": "textfield", 11 "prefill": "https://apify.com/ext/ycf_application.pdf" 12 } 13 }, 14 "required": ["url"] 15}

main.js

1const fs = require('fs'); 2const util = require('util'); 3const exec = util.promisify(require('child_process').exec); 4const Apify = require('apify'); 5const requestPromise = require('request-promise'); 6 7Apify.main(async () => { 8 // Fetch the input and check it has a valid format 9 // You don't need to check the input, but it's a good practice. 10 const input = await Apify.getValue('INPUT'); 11 if (!input || !input.url) throw new Error('Received invalid input'); 12 13 console.log(`Downloading PDF file: ${input.url}`); 14 const options = { 15 url: input.url, 16 encoding: null // set to `null`, if you expect binary data. 17 }; 18 const response = await requestPromise(options); 19 const buffer = Buffer.from(response); 20 21 const tmpTarget = 'temp.pdf'; 22 console.log('Saving PDF file to: ' + tmpTarget); 23 fs.writeFileSync(tmpTarget, buffer); 24 25 const { stdout, stderr } = await exec('pdf2htmlEX --zoom 1.3 temp.pdf'); 26 console.log('stdout:', stdout); 27 console.log('stderr:', stderr); 28 29 const htmlBuffer = fs.readFileSync('temp.html'); 30 31 console.log(`Saving HTML (size: ${htmlBuffer.length} bytes) to output...`); 32 await Apify.setValue('OUTPUT', htmlBuffer, { contentType: 'text/html' }); 33 34 const storeId = process.env.APIFY_DEFAULT_KEY_VALUE_STORE_ID; 35 36 // NOTE: Adding disableRedirect=1 param, because for some reason Chrome doesn't allow pasting URLs to PDF 37 // that redirect into the browser address bar (yeah, wtf...) 38 console.log('HTML file has been stored to:'); 39 console.log(`https://api.apify.com/v2/key-value-stores/${storeId}/records/OUTPUT`); 40});

package.json

1{ 2 "name": "act-pdf-to-html", 3 "version": "0.0.1", 4 "private": true, 5 "dependencies": { 6 "apify": "^0.15.2", 7 "request-promise": "^4.2.4" 8 }, 9 "devDependencies": {}, 10 "scripts": { 11 "test-local": "APIFY_DEV_KEY_VALUE_STORE_DIR=./kv-store-dev/ node main.js" 12 } 13}
Developer
Maintained by Community
Actor stats
  • 348 users
  • 213.8k runs
  • Modified 2 months ago
Categories

You might also like these Actors