Example Hacker News avatar
Example Hacker News
Deprecated
View all Actors
This Actor is deprecated

This Actor is unavailable because the developer has decided to deprecate it. Would you like to try a similar Actor instead?

See alternative Actors
Example Hacker News

Example Hacker News

mtrunkat/example-hacker-news

Example crawler for news.ycombinator.com built using Apify SDK.

Dockerfile

1# This is a template for a Dockerfile used to run acts in Actor system.
2# The base image name below is set during the act build, based on user settings.
3# IMPORTANT: The base image must set a correct working directory, such as /usr/src/app or /home/user
4FROM apify/actor-node
5
6# Second, copy just package.json and package-lock.json since it should be
7# the only file that affects "npm install" in the next step, to speed up the build
8COPY package*.json ./
9
10# Install NPM packages, skip optional and development dependencies to
11# keep the image small. Avoid logging too much and print the dependency
12# tree for debugging
13RUN npm --quiet set progress=false \
14 && npm install --only=prod --no-optional \
15 && echo "Installed NPM packages:" \
16 && (npm list --all || true) \
17 && echo "Node.js version:" \
18 && node --version \
19 && echo "NPM version:" \
20 && npm --version
21
22# Copy source code to container
23# Do this in the last step, to have fast build if only the source code changed
24COPY --chown=myuser:myuser . ./
25
26# NOTE: The CMD is already defined by the base image.
27# Uncomment this for local node inspector debugging:
28# CMD [ "node", "--inspect=0.0.0.0:9229", "main.js" ]

main.js

1const Apify = require('apify');
2
3Apify.main(async () => {
4    // Get queue and enqueue first url.
5    const requestQueue = await Apify.openRequestQueue();
6    const enqueueUrl = async url => requestQueue.addRequest({ url });
7    await enqueueUrl('https://news.ycombinator.com/');
8
9    // Create crawler.
10    const crawler = new Apify.PuppeteerCrawler({
11        requestQueue,
12        
13        launchPuppeteerOptions: {
14          liveView: true, 
15        },
16
17        // This page is executed for each request.
18        // If request failes then it's retried 3 times.
19        // Parameter page is Puppeteers page object with loaded page.
20        handlePageFunction: async ({ page, request }) => {
21            console.log(`Request ${request.url} succeeded!`);
22
23            // We inject JQuery for easier data extracting
24            await Apify.utils.puppeteer.injectJQuery(page)
25
26            // Extract all posts. This is a function that gets executed inside a browser context
27            // $ is JQuery variable that is actualy defined on the browser itself 
28            // so don't worry about the red line warning
29            const data = await page.evaluate(() => {
30                let posts = [];
31                $('.athing').each(function() {
32                    posts.push({
33                        rank: Number($(this).find('.rank').text().replace('.', '').trim()),
34                        title: $(this).find('.storylink').text().trim(),
35                        link: $(this).find('.storylink').attr('href'),
36                        domain: $(this).find('.sitestr').text().trim(),
37                        score: Number($(this).next().find('.score').text().replace('points', '').replace(',', '').trim()),
38                        author: $(this).next().find('.hnuser').text().trim(),
39                        posted: $(this).next().find('.age').text().trim(),
40                        comments: Number($(this).next().find('a:contains("comments")').text().replace('comments', '').replace(',', '').trim()),
41                        url: window.location.href,
42                    })
43                })
44                return posts;
45            });
46            
47            // Save data.
48            await Apify.pushData(data);
49            
50            // Enqueue next page.
51            try {
52                const nextHref = await page.$eval('.morelink', el => el.href);
53                await enqueueUrl(nextHref);
54            } catch (err) {
55                console.log(`Url ${request.url} is the last page!`);
56            }
57        },
58
59        // If request failed 4 times then this function is executed.
60        handleFailedRequestFunction: async ({ request }) => {
61            console.log(`Request ${request.url} failed 4 times`);
62            
63            await Apify.pushData({
64                url: request.url,
65                errors: request.errorMessages,
66            })
67        },
68    });
69    
70    // Run crawler.
71    await crawler.run();
72});

package.json

1{
2    "name": "apify-project",
3    "version": "0.0.1",
4    "description": "",
5    "author": "It's not you it's me",
6    "license": "ISC",
7    "dependencies": {
8        "apify": "1.0.0"
9    },
10    "scripts": {
11        "start": "node main.js"
12    }
13}
Developer
Maintained by Community
Categories