PDF to HTML Converter
View all Actors
This Actor is unavailable because the developer has decided to deprecate it. Would you like to try a similar Actor instead?
See alternative ActorsPDF to HTML Converter
jancurn/pdf-to-html
Converts a PDF document to HTML using the pdf2htmlEX tool.
Dockerfile
1FROM debian:jessie
2
3RUN apt-get update --fix-missing \
4 && DEBIAN_FRONTEND=noninteractive apt-get -y upgrade \
5 && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends curl ca-certificates pdf2htmlex \
6 && curl -sL https://deb.nodesource.com/setup_10.x | bash - \
7 && apt-get install -y nodejs \
8 && node -v \
9 && rm -rf /var/lib/apt/lists/*
10
11RUN mkdir -p /pdf/kv-store-dev
12
13WORKDIR /pdf
14
15# Copy all files and directories from the directory to the Docker image
16COPY main.js package.json ./
17
18# Install NPM packages, skip optional and development dependencies to keep the image small,
19# avoid logging to much and show log the dependency tree
20RUN npm install --quiet --only=prod --no-optional \
21 && npm list \
22 && pwd \
23 && ls -l
24
25# Define that start command
26CMD [ "node", "main.js" ]
INPUT_SCHEMA.json
1{
2 "title": "PDF to HTML input",
3 "type": "object",
4 "schemaVersion": 1,
5 "properties": {
6 "url": {
7 "title": "URL",
8 "type": "string",
9 "description": "URL that links to a PDF file",
10 "editor": "textfield",
11 "prefill": "https://apify.com/ext/ycf_application.pdf"
12 }
13 },
14 "required": ["url"]
15}
main.js
1const fs = require('fs');
2const util = require('util');
3const exec = util.promisify(require('child_process').exec);
4const Apify = require('apify');
5const requestPromise = require('request-promise');
6
7Apify.main(async () => {
8 // Fetch the input and check it has a valid format
9 // You don't need to check the input, but it's a good practice.
10 const input = await Apify.getValue('INPUT');
11 if (!input || !input.url) throw new Error('Received invalid input');
12
13 console.log(`Downloading PDF file: ${input.url}`);
14 const options = {
15 url: input.url,
16 encoding: null // set to `null`, if you expect binary data.
17 };
18 const response = await requestPromise(options);
19 const buffer = Buffer.from(response);
20
21 const tmpTarget = 'temp.pdf';
22 console.log('Saving PDF file to: ' + tmpTarget);
23 fs.writeFileSync(tmpTarget, buffer);
24
25 const { stdout, stderr } = await exec('pdf2htmlEX --zoom 1.3 temp.pdf');
26 console.log('stdout:', stdout);
27 console.log('stderr:', stderr);
28
29 const htmlBuffer = fs.readFileSync('temp.html');
30
31 console.log(`Saving HTML (size: ${htmlBuffer.length} bytes) to output...`);
32 await Apify.setValue('OUTPUT', htmlBuffer, { contentType: 'text/html' });
33
34 const storeId = process.env.APIFY_DEFAULT_KEY_VALUE_STORE_ID;
35
36 // NOTE: Adding disableRedirect=1 param, because for some reason Chrome doesn't allow pasting URLs to PDF
37 // that redirect into the browser address bar (yeah, wtf...)
38 console.log('HTML file has been stored to:');
39 console.log(`https://api.apify.com/v2/key-value-stores/${storeId}/records/OUTPUT`);
40});
package.json
1{
2 "name": "act-pdf-to-html",
3 "version": "0.0.1",
4 "private": true,
5 "dependencies": {
6 "apify": "^0.15.2",
7 "request-promise": "^4.2.4"
8 },
9 "devDependencies": {},
10 "scripts": {
11 "test-local": "APIFY_DEV_KEY_VALUE_STORE_DIR=./kv-store-dev/ node main.js"
12 }
13}
Developer
Maintained by Community
Categories