# Dockerfile contains instructions how to build a Docker image that
# will contain all the code and configuration needed to run your actor.
# For a full Dockerfile reference,
# see https://docs.docker.com/engine/reference/builder/

# First, specify the base Docker image. Apify provides the following
# base images for your convenience:
#  apify/actor-node-basic (Node.js 10 on Alpine Linux, small and fast)
#  apify/actor-node-chrome (Node.js 10 + Chrome on Debian)
#  apify/actor-node-chrome-xvfb (Node.js 10 + Chrome + Xvfb on Debian)
# For more information, see https://apify.com/docs/actor#base-images
# Note that you can use any other image from Docker Hub.
FROM apify/actor-node-chrome

# Second, copy just package.json since it should be the only file
# that affects NPM install in the next step
COPY package.json ./

# Install NPM packages, skip optional and development dependencies to
# keep the image small. Avoid logging too much and print the dependency
# tree for debugging
RUN npm --quiet set progress=false \
 && npm install --only=prod --no-optional \
 && echo "Installed NPM packages:" \
 && npm list \
 && echo "Node.js version:" \
 && node --version \
 && echo "NPM version:" \
 && npm --version

# Next, copy the remaining files and directories with the source code.
# Since we do this after NPM install, quick build will be really fast
# for most source file changes.
COPY . ./

# Optionally, specify how to launch the source code of your actor.
# By default, Apify's base Docker images define the CMD instruction
# that runs the source code using the command specified
# in the "scripts.start" section of the package.json file.
# In short, the instruction looks something like this:
# CMD npm start

INPUT_SCHEMA.json

{
    "title": "Input",
    "type": "object",
    "description": "Use the following form to configure this scraper. The URL list is required and all other fields are optional.",
    "schemaVersion": 1,
    "properties": {
        "requestListSources": {
            "title": "Start URLs",
            "type": "array",
            "description": "URLs to start with",
            "prefill": [
                { "url": "https://apify.com" }
            ],
            "editor": "requestListSources",
            "minItems": 1
        },
        "proxyConfiguration": {
            "title": "Proxy configuration",
            "type": "object",
            "description": "Choose to use no proxy, Apify Proxy, or provide custom proxy URLs.",
            "prefill": { "useApifyProxy": false },
            "default": {},
            "editor": "proxy"
        },
        "handlePageTimeoutSecs": {
            "title": "Page timeout",
            "type": "integer",
            "description": "Maximum time the scraper will spend processing one page.",
            "minimum": 1,
            "default": 60,
            "maximum": 360,
            "unit": "secs"
        },
        "useChrome": {
            "title": "Use Chrome",
            "type": "boolean",
            "description": "The scraper will use a real Chrome browser instead of a Chromium masking as Chrome. Using this option may help with bypassing certain anti-scraping protections, but risks that the scraper will be unstable or not work at all.",
            "default": false,
            "groupCaption": "Browser masking options",
            "groupDescription": "Settings that help mask as a real user and prevent scraper detection."
        },
        "useStealth": {
            "title": "Use Stealth",
            "type": "boolean",
            "description": "The scraper will apply various browser emulation techniques to match a real user as closely as possible. This feature works best in conjunction with the Use Chrome option and also carries the risk of making the scraper unstable.",
            "default": false
        }
    },
    "required": ["requestListSources"]
}

main.js

1const Apify = require('apify');
2
3Apify.main(async () => {
4    const input = await Apify.getValue('INPUT');
5 
6    console.log(input);
7
8    const requestList = await Apify.openRequestList('my-list', input.requestListSources);
9    const launchPuppeteerOptions = Object.assign({}, input.proxyConfiguration);
10
11    if (input.useChrome) launchPuppeteerOptions.useChrome;
12    if (input.useStealth) launchPuppeteerOptions.stealth;
13
14    const handlePageFunction = async ({ request, response, page }) => {
15        if (request.userData.waitForSelector) {
16            await page.waitForSelector(request.userData.waitForSelector);
17        }
18    
19        await Apify.pushData({
20            url: request.url,
21            finishedAt: new Date(),
22            html: await page.evaluate(() => document.body.outerHTML),
23            '#debug': Apify.utils.createRequestDebugInfo(request, response),
24            '#error': false,
25        });
26    };
27    
28    const handleFailedRequestFunction = async ({ request }) => {
29        await Apify.pushData({
30            url: request.url,
31            finishedAt: new Date(),
32            '#debug': Apify.utils.createRequestDebugInfo(request),
33            '#error': true,
34        });
35    };
36
37    const crawlerOptions = {
38        requestList,
39        handlePageFunction,
40        handleFailedRequestFunction,
41        launchPuppeteerOptions,
42    };
43
44    if (input.handlePageTimeoutSecs) {
45        crawlerOptions.handlePageTimeoutSecs = input.handlePageTimeoutSecs;
46    }
47
48    const puppeteerCrawler = new Apify.PuppeteerCrawler(crawlerOptions);
49    await puppeteerCrawler.run();
50});

package.json

{
    "name": "my-actor",
    "version": "0.0.1",
    "dependencies": {
        "apify": "^0.14.15"
    },
    "scripts": {
        "start": "node main.js"
    },
    "author": "Me!"
}

HTML Scraper pro

scrapingxpert/html-scraper-pro

The HTML Scraper Pro is a powerful tool designed to extract the HTML source code and metadata from websites. It uses advanced web scraping techniques to retrieve the full HTML content of web pages,page title and HTTP status code.This tool is ideal for data extraction, website analysis, and archiving

scrapingxpert

100

My Actor

david15999/my-actor

HTML scraper

David Emanuel Moreira

🔗✨ Link Extractor Pro: URL to HTML List Downloader

dainty_screw/link-extractor-pro-url-to-html-list-downloader

Maximize productivity with HTML URL List Downloader. Quickly extract, manage, and organize URLs from HTML pages. Ideal for SEO professionals and digital marketers. Streamline your workflow today!

codemaster devops

130

Language Detector

zsoftware/language-detector

Detect the language of each line of text using machine learning. Paste multiple lines of text into the input, and this actor will identify the language of each one, returning results with confidence scores and alternative guesses based on a trained statistical model.

Karim

Fast Scraper

danielherman/fast-scraper

Fast Scraper is a blazingly fast web scraper powered by Rust on the backend. It allows you to scrape static HTML pages extremely quickly while using only <128 MB of memory. With this scraper, you can maximize the efficiency of your credits on Apify.

Daniel Herman

Text Scraper (Free)

karamelo/text-scraper-free

Website Text Extractor. Extract Text from Webpages and Feed Your LLMs

karamelo

603

Web Scraper

apify/web-scraper

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

90K

4.5

HTML/Website Media Scraper

aweworkz/html-web-media-scraper

The Website Media scraper extracts all media files, i.e images, videos, audio, and other related media elements, from multiple websites. It then provides the corresponding descriptions or the alt="" content. You'll need to use proxies to run this actor for some websites with bot blocking features.

aweworkz

171

1.0

Sitemap URL Extractor

onescales/sitemap-url-extractor

Provide a link to a sitemap.xml and the app will extract and list all URLs in the sitemap as well as additional data in the sitemap (i.e. https://onescales.com/sitemap.xml).

One Scales

5.0

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

8.9K

4.7