PDF to HTML Converter avatar
PDF to HTML Converter
Try for free

No credit card required

View all Actors
PDF to HTML Converter

PDF to HTML Converter

jancurn/pdf-to-html
Try for free

No credit card required

Converts a PDF document to HTML using the pdf2htmlEX tool.

Dockerfile

1FROM debian:jessie
2
3RUN apt-get update --fix-missing \
4 && DEBIAN_FRONTEND=noninteractive apt-get -y upgrade \
5 && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends curl ca-certificates pdf2htmlex \
6 && curl -sL https://deb.nodesource.com/setup_10.x | bash - \
7 && apt-get install -y nodejs \
8 && node -v \
9 && rm -rf /var/lib/apt/lists/*
10
11RUN mkdir -p /pdf/kv-store-dev
12
13WORKDIR /pdf
14
15# Copy all files and directories from the directory to the Docker image
16COPY main.js package.json ./
17
18# Install NPM packages, skip optional and development dependencies to keep the image small,
19# avoid logging to much and show log the dependency tree
20RUN npm install --quiet --only=prod --no-optional \
21 && npm list \
22 && pwd \
23 && ls -l
24
25# Define that start command
26CMD [ "node", "main.js" ]

INPUT_SCHEMA.json

1{
2    "title": "PDF to HTML input",
3    "type": "object",
4    "schemaVersion": 1,
5    "properties": {
6        "url": {
7            "title": "URL",
8            "type": "string",
9            "description": "URL that links to a PDF file",
10            "editor": "textfield",
11            "prefill": "https://apify.com/ext/ycf_application.pdf"
12        }
13    },
14    "required": ["url"]
15}

main.js

1const fs = require('fs');
2const util = require('util');
3const exec = util.promisify(require('child_process').exec);
4const Apify = require('apify');
5const requestPromise = require('request-promise');
6
7Apify.main(async () => {
8    // Fetch the input and check it has a valid format
9    // You don't need to check the input, but it's a good practice.
10    const input = await Apify.getValue('INPUT');
11    if (!input || !input.url) throw new Error('Received invalid input');
12
13    console.log(`Downloading PDF file: ${input.url}`);
14    const options = {
15        url: input.url,
16        encoding: null // set to `null`, if you expect binary data.
17    };
18    const response = await requestPromise(options);
19    const buffer = Buffer.from(response);
20
21    const tmpTarget = 'temp.pdf';
22    console.log('Saving PDF file to: ' + tmpTarget);
23    fs.writeFileSync(tmpTarget, buffer);
24
25    const { stdout, stderr } = await exec('pdf2htmlEX --zoom 1.3 temp.pdf');
26    console.log('stdout:', stdout);
27    console.log('stderr:', stderr);
28
29    const htmlBuffer = fs.readFileSync('temp.html');
30
31    console.log(`Saving HTML (size: ${htmlBuffer.length} bytes) to output...`);
32    await Apify.setValue('OUTPUT', htmlBuffer, { contentType: 'text/html' });
33
34    const storeId = process.env.APIFY_DEFAULT_KEY_VALUE_STORE_ID;
35
36    // NOTE: Adding disableRedirect=1 param, because for some reason Chrome doesn't allow pasting URLs to PDF
37    // that redirect into the browser address bar (yeah, wtf...)
38    console.log('HTML file has been stored to:');
39    console.log(`https://api.apify.com/v2/key-value-stores/${storeId}/records/OUTPUT`);
40});

package.json

1{
2  "name": "act-pdf-to-html",
3  "version": "0.0.1",
4  "private": true,
5  "dependencies": {
6    "apify": "^0.15.2",
7    "request-promise": "^4.2.4"
8  },
9  "devDependencies": {},
10  "scripts": {
11    "test-local": "APIFY_DEV_KEY_VALUE_STORE_DIR=./kv-store-dev/ node main.js"
12  }
13}
Developer
Maintained by Community
Actor metrics
  • 13 monthly users
  • 66.7% runs succeeded
  • 0.0 days response time
  • Created in Nov 2017
  • Modified 7 months ago
Categories