PDF to HTML Converter avatar

PDF to HTML Converter

Deprecated
View all Actors
This Actor is deprecated

This Actor is unavailable because the developer has decided to deprecate it. Would you like to try a similar Actor instead?

See alternative Actors
PDF to HTML Converter

PDF to HTML Converter

jancurn/pdf-to-html

Converts a PDF document to HTML using the pdf2htmlEX tool.

Dockerfile

1FROM debian:jessie
2
3RUN apt-get update --fix-missing \
4 && DEBIAN_FRONTEND=noninteractive apt-get -y upgrade \
5 && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends curl ca-certificates pdf2htmlex \
6 && curl -sL https://deb.nodesource.com/setup_10.x | bash - \
7 && apt-get install -y nodejs \
8 && node -v \
9 && rm -rf /var/lib/apt/lists/*
10
11RUN mkdir -p /pdf/kv-store-dev
12
13WORKDIR /pdf
14
15# Copy all files and directories from the directory to the Docker image
16COPY main.js package.json ./
17
18# Install NPM packages, skip optional and development dependencies to keep the image small,
19# avoid logging to much and show log the dependency tree
20RUN npm install --quiet --only=prod --no-optional \
21 && npm list \
22 && pwd \
23 && ls -l
24
25# Define that start command
26CMD [ "node", "main.js" ]

INPUT_SCHEMA.json

1{
2    "title": "PDF to HTML input",
3    "type": "object",
4    "schemaVersion": 1,
5    "properties": {
6        "url": {
7            "title": "URL",
8            "type": "string",
9            "description": "URL that links to a PDF file",
10            "editor": "textfield",
11            "prefill": "https://apify.com/ext/ycf_application.pdf"
12        }
13    },
14    "required": ["url"]
15}

main.js

1const fs = require('fs');
2const util = require('util');
3const exec = util.promisify(require('child_process').exec);
4const Apify = require('apify');
5const requestPromise = require('request-promise');
6
7Apify.main(async () => {
8    // Fetch the input and check it has a valid format
9    // You don't need to check the input, but it's a good practice.
10    const input = await Apify.getValue('INPUT');
11    if (!input || !input.url) throw new Error('Received invalid input');
12
13    console.log(`Downloading PDF file: ${input.url}`);
14    const options = {
15        url: input.url,
16        encoding: null // set to `null`, if you expect binary data.
17    };
18    const response = await requestPromise(options);
19    const buffer = Buffer.from(response);
20
21    const tmpTarget = 'temp.pdf';
22    console.log('Saving PDF file to: ' + tmpTarget);
23    fs.writeFileSync(tmpTarget, buffer);
24
25    const { stdout, stderr } = await exec('pdf2htmlEX --zoom 1.3 temp.pdf');
26    console.log('stdout:', stdout);
27    console.log('stderr:', stderr);
28
29    const htmlBuffer = fs.readFileSync('temp.html');
30
31    console.log(`Saving HTML (size: ${htmlBuffer.length} bytes) to output...`);
32    await Apify.setValue('OUTPUT', htmlBuffer, { contentType: 'text/html' });
33
34    const storeId = process.env.APIFY_DEFAULT_KEY_VALUE_STORE_ID;
35
36    // NOTE: Adding disableRedirect=1 param, because for some reason Chrome doesn't allow pasting URLs to PDF
37    // that redirect into the browser address bar (yeah, wtf...)
38    console.log('HTML file has been stored to:');
39    console.log(`https://api.apify.com/v2/key-value-stores/${storeId}/records/OUTPUT`);
40});

package.json

1{
2  "name": "act-pdf-to-html",
3  "version": "0.0.1",
4  "private": true,
5  "dependencies": {
6    "apify": "^0.15.2",
7    "request-promise": "^4.2.4"
8  },
9  "devDependencies": {},
10  "scripts": {
11    "test-local": "APIFY_DEV_KEY_VALUE_STORE_DIR=./kv-store-dev/ node main.js"
12  }
13}
Developer
Maintained by Community
Categories