Actor picture

Download HTML from URLs

mtrunkat/url-list-download-html

This actor takes a list of URLs and downloads HTML of each page.

No credit card required

Author's avatarMarek Trunkát
  • Modified
  • Users419
  • Runs161,438
Actor picture
Download HTML from URLs

Dockerfile

FROM apify/actor-node-puppeteer-chrome:16

COPY package*.json ./

RUN npm --quiet set progress=false \
 && npm install --only=prod --no-optional \
 && echo "Installed NPM packages:" \
 && (npm list --all || true) \
 && echo "Node.js version:" \
 && node --version \
 && echo "NPM version:" \
 && npm --version

COPY . ./

ENV APIFY_DISABLE_OUTDATED_WARNING 1
ENV npm_config_loglevel=silent

INPUT_SCHEMA.json

{
    "title": "Input",
    "type": "object",
    "description": "Use the following form to configure this scraper. The URL list is required and all other fields are optional.",
    "schemaVersion": 1,
    "properties": {
        "requestListSources": {
            "title": "Start URLs",
            "type": "array",
            "description": "URLs to start with",
            "prefill": [
                { "url": "https://apify.com" }
            ],
            "editor": "requestListSources",
            "minItems": 1
        },
        "proxyConfiguration": {
            "title": "Proxy configuration",
            "type": "object",
            "description": "Choose to use no proxy, Apify Proxy, or provide custom proxy URLs.",
            "prefill": { "useApifyProxy": true },
            "default": {},
            "editor": "proxy"
        },
        "handlePageTimeoutSecs": {
            "title": "Page timeout",
            "type": "integer",
            "description": "Maximum time the scraper will spend processing one page.",
            "minimum": 1,
            "default": 60,
            "maximum": 360,
            "unit": "secs"
        },
        "maxRequestRetries": {
            "title": "Maximum request retries",
            "description": "How many retries before giving up.",
            "default": 1,
            "prefill": 1,
            "type": "integer",
            "editor": "number"
        },
        "useChrome": {
            "title": "Use Chrome",
            "type": "boolean",
            "description": "The scraper will use a real Chrome browser instead of a Chromium masking as Chrome. Using this option may help with bypassing certain anti-scraping protections, but risks that the scraper will be unstable or not work at all.",
            "default": false,
            "groupCaption": "Browser masking options",
            "groupDescription": "Settings that help mask as a real user and prevent scraper detection."
        }
    },
    "required": ["requestListSources", "proxyConfiguration"]
}

README.md

This actor simply scrapes the full HTML code for all given URLs.
In input you can define which proxies should be used.

Additionally, you can define a selector that will be awaited for each
of the URLs simply by adding `waitForSelector` field to `userData` property
of the request:

```json
{
    "requestListSources": [{
        "url": "https://example.com",
        "userData": {
            "waitForSelector": ".class-i-want-to-wait-for"
        }
    }]
}
```

main.js

const Apify = require('apify');

Apify.main(async () => {
    const input = await Apify.getInput();
 
    const requestList = await Apify.openRequestList('my-list', input.requestListSources);
    const proxyConfiguration = await Apify.createProxyConfiguration(input.proxyConfiguration);

    const handlePageFunction = async ({ request, response, page }) => {
        const { waitForSelector } = request.userData;

        if (waitForSelector) {
            await page.waitForSelector(waitForSelector);
        }
    
        await Apify.pushData({
            url: request.url,
            finishedAt: new Date(),
            fullHtml: await page.content(),
            html: await page.evaluate(() => document.body.outerHTML),
            '#debug': Apify.utils.createRequestDebugInfo(request, response),
            '#error': false,
        });
    };
    
    const handleFailedRequestFunction = async ({ request }) => {
        await Apify.pushData({
            url: request.url,
            finishedAt: new Date(),
            '#debug': Apify.utils.createRequestDebugInfo(request),
            '#error': true,
        });
    };

    const puppeteerCrawler = new Apify.PuppeteerCrawler({
        requestList,
        handlePageFunction,
        handleFailedRequestFunction,
        proxyConfiguration,
        useSessionPool: true,
        sessionPoolOptions: {
            sessionOptions: {
                maxErrorScore: 0.5,
            },
        },
        browserPoolOptions: {
            useFingerprints: true,
            retireBrowserAfterPageCount: 1,
            maxOpenPagesPerBrowser: 1, // required to use one IP per tab
        },
        persistCookiesPerSession: false,
        maxRequestRetries: typeof input.maxRequestRetries === 'number' ? input.maxRequestRetries : 1,
        handlePageTimeoutSecs: input.handlePageTimeoutSecs || 60,
        launchContext: {
            useChrome: input.useChrome || false,
            launchOptions: {
                headless: false,
                ignoreHTTPSErrors: true,
                args: ['--ignore-certificate-errors']
            }
        },
    });
    
    await puppeteerCrawler.run();
});

package.json

{
    "name": "my-actor",
    "version": "0.0.1",
    "dependencies": {
        "apify": "^2.3.2",
        "puppeteer": "^13"
    },
    "scripts": {
        "start": "node main.js"
    },
    "author": "Me!"
}