Actor picture

PDF to HTML Converter

jancurn/pdf-to-html

Converts a PDF document to HTML using the pdf2htmlEX tool.

No credit card required

Author's avatarJan Čurn
  • Modified
  • Users94
  • Runs119,838
Actor picture

PDF to HTML Converter

Dockerfile


FROM debian:jessie

RUN apt-get update --fix-missing \
 && DEBIAN_FRONTEND=noninteractive apt-get -y upgrade \
 && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends curl ca-certificates pdf2htmlex \
 && curl -sL https://deb.nodesource.com/setup_10.x | bash - \
 && apt-get install -y nodejs \
 && node -v \
 && rm -rf /var/lib/apt/lists/*

RUN mkdir -p /pdf/kv-store-dev

WORKDIR /pdf

# Copy all files and directories from the directory to the Docker image
COPY main.js package.json ./

# Install NPM packages, skip optional and development dependencies to keep the image small,
# avoid logging to much and show log the dependency tree
RUN npm install --quiet --only=prod --no-optional \
 && npm list \
 && pwd \
 && ls -l

# Define that start command
CMD [ "node", "main.js" ]

INPUT_SCHEMA.json

{
    "title": "PDF to HTML input",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "url": {
            "title": "URL",
            "type": "string",
            "description": "URL that links to a PDF file",
            "editor": "textfield",
            "prefill": "https://apify.com/ext/ycf_application.pdf"
        }
    },
    "required": ["url"]
}

README.md

Simple actor that fetches a PDF file from a specific
URL and converts it to HTML document.
The resulting HTML document is stored in the default key-value store
associated with the run, under the `OUTPUT` key.

main.js

const fs = require('fs');
const util = require('util');
const exec = util.promisify(require('child_process').exec);
const Apify = require('apify');
const requestPromise = require('request-promise');

Apify.main(async () => {
    // Fetch the input and check it has a valid format
    // You don't need to check the input, but it's a good practice.
    const input = await Apify.getValue('INPUT');
    if (!input || !input.url) throw new Error('Received invalid input');

    console.log(`Downloading PDF file: ${input.url}`);
    const options = {
        url: input.url,
        encoding: null // set to `null`, if you expect binary data.
    };
    const response = await requestPromise(options);
    const buffer = Buffer.from(response);

    const tmpTarget = 'temp.pdf';
    console.log('Saving PDF file to: ' + tmpTarget);
    fs.writeFileSync(tmpTarget, buffer);

    const { stdout, stderr } = await exec('pdf2htmlEX --zoom 1.3 temp.pdf');
    console.log('stdout:', stdout);
    console.log('stderr:', stderr);

    const htmlBuffer = fs.readFileSync('temp.html');

    console.log(`Saving HTML (size: ${htmlBuffer.length} bytes) to output...`);
    await Apify.setValue('OUTPUT', htmlBuffer, { contentType: 'text/html' });

    const storeId = process.env.APIFY_DEFAULT_KEY_VALUE_STORE_ID;

    // NOTE: Adding disableRedirect=1 param, because for some reason Chrome doesn't allow pasting URLs to PDF
    // that redirect into the browser address bar (yeah, wtf...)
    console.log('HTML file has been stored to:');
    console.log(`https://api.apify.com/v2/key-value-stores/${storeId}/records/OUTPUT`);
});

package.json

{
  "name": "act-pdf-to-html",
  "version": "0.0.1",
  "private": true,
  "dependencies": {
    "apify": "^0.15.2",
    "request-promise": "^4.2.4"
  },
  "devDependencies": {},
  "scripts": {
    "test-local": "APIFY_DEV_KEY_VALUE_STORE_DIR=./kv-store-dev/ node main.js"
  }
}