Actor picture

Prazdne Domy Scraper

lukaskrivka/prazdne-domy

Simple scraper for https://prazdnedomy.cz which gather old, valuable but not inhabited houses.

Author's avatarLukáš Křivka
  • Modified
  • Users6
  • Runs7
Actor picture

Prazdne Domy Scraper

Dockerfile

# Dockerfile contains instructions how to build a Docker image that
# will contain all the code and configuration needed to run your actor.
# For a full Dockerfile reference,
# see https://docs.docker.com/engine/reference/builder/

# First, specify the base Docker image. Apify provides the following
# base images for your convenience:
#  apify/actor-node-basic (Node.js 10 on Alpine Linux, small and fast)
#  apify/actor-node-chrome (Node.js 10 + Chrome on Debian)
#  apify/actor-node-chrome-xvfb (Node.js 10 + Chrome + Xvfb on Debian)
# For more information, see https://apify.com/docs/actor#base-images
# Note that you can use any other image from Docker Hub.
FROM apify/actor-node-basic

# Copy all files and directories with the source code
COPY . ./

# Install NPM packages, skip optional and development dependencies to
# keep the image small. Avoid logging to much and print the dependency
# tree for debugging
RUN npm --quiet set progress=false \
 && npm install --only=prod --no-optional \
 && echo "Installed NPM packages:" \
 && npm list \
 && echo "Node.js version:" \
 && node --version \
 && echo "NPM version:" \
 && npm --version

# Specify how to run the source code
CMD npm start

README.md

# My beautiful actor

Contains a documentation what your actor does and how to use it,
which is then displayed in the app or library. It's always a good
idea to write a good README.md, in a few months not even you
will remember all the details about the actor.

You can use [Markdown](https://www.markdownguide.org/cheat-sheet)
language for rich formatting.

package.json

{
    "name": "my-actor",
    "version": "0.0.1",
    "dependencies": {
        "apify": "^0.13.7"
    },
    "scripts": {
        "start": "node main.js"
    },
    "author": "Me!"
}

INPUT_SCHEMA.json

{
    "title": "My input schema",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "myField": {
            "title": "My input field",
            "type": "string",
            "nullable": false,
            "description": "This is a customizable description providing help to the users of your actor.",
            "editor": "textarea"
        }
    }
}

main.js

// This is the main Node.js source code file of your actor.
// It is referenced from the "scripts" section of the package.json file.

const Apify = require('apify');

Apify.main(async () => {
    // Get input of the actor. Input fields can be modified in INPUT_SCHEMA.json file.
    // For more information, see https://apify.com/docs/actor/input-schema
    const input = await Apify.getInput();
    console.log('Input:');
    console.dir(input);

    // Here you can prepare your input for actor apify/cheerio-scraper this input is based on a actor
    // task you used as the starting point.
    const metamorphInput = {
        "startUrls": [
          {
            "url": "https://prazdnedomy.cz/domy/objekty/?paginator-page=1",
            "method": "GET"
          }
        ],
        "useRequestQueue": true,
        "pseudoUrls": [
          {
            "purl": "https://prazdnedomy.cz/domy/objekty/?paginator-page=[\\d+]",
            "method": "GET"
          }
        ],
        "linkSelector": "a",
        "pageFunction": "async function pageFunction(context) {\n    const { request, $ } = context;\n    let result = [];\n    $('.estates-list .estate').each(function(i) {\n        let typ = null;\n        let stav = null;\n        let gps = null;\n        let adresa = null;\n\n        $(this).find('.icons .icon').each(function() {\n            const maybeHtml = $(this).attr('title')\n            if (!maybeHtml) return\n            const maybeTyp = maybeHtml.match(/<td>Typ: <\\/td><td>(.+?)<\\/td>/)\n            if (maybeTyp) {\n                typ = maybeTyp[1]\n            }\n            const maybeStav = maybeHtml.match(/<td>Stav: <\\/td><td>(.+)<\\/td>/)\n            if (maybeStav) {\n                stav = maybeStav[1]\n                return\n            }\n            \n            const maybeGps = maybeHtml.match(/\\d+°.+''/)\n            if (maybeGps) {\n                gps = maybeGps[0]\n                adresa = maybeHtml.replace(gps, '')\n            }\n            \n        })\n\n        result.push({\n            title: $(this).find('.content .title').text().trim(),\n            url: 'https://prazdnedomy.cz' + $(this).find('a').attr('href'),\n            typ,\n            stav,\n            gps,\n            adresa,\n        })\n    })\n    return result;\n}",
        "proxyConfiguration": {
          "useApifyProxy": false
        },
        "debugLog": false,
        "ignoreSslErrors": false,
        "useCookieJar": false
      };

    // Now let's metamorph into actor apify/cheerio-scraper using the created input.
    await Apify.metamorph('apify/cheerio-scraper', metamorphInput);
});