Example Hacker News
Go to Store
This Actor is unavailable because the developer has decided to deprecate it. Would you like to try a similar Actor instead?
See alternative ActorsExample Hacker News
mtrunkat/example-hacker-news
Example crawler for news.ycombinator.com built using Apify SDK.
Dockerfile
1# This is a template for a Dockerfile used to run acts in Actor system.
2# The base image name below is set during the act build, based on user settings.
3# IMPORTANT: The base image must set a correct working directory, such as /usr/src/app or /home/user
4FROM apify/actor-node
5
6# Second, copy just package.json and package-lock.json since it should be
7# the only file that affects "npm install" in the next step, to speed up the build
8COPY package*.json ./
9
10# Install NPM packages, skip optional and development dependencies to
11# keep the image small. Avoid logging too much and print the dependency
12# tree for debugging
13RUN npm --quiet set progress=false \
14 && npm install --only=prod --no-optional \
15 && echo "Installed NPM packages:" \
16 && (npm list --all || true) \
17 && echo "Node.js version:" \
18 && node --version \
19 && echo "NPM version:" \
20 && npm --version
21
22# Copy source code to container
23# Do this in the last step, to have fast build if only the source code changed
24COPY . ./
25
26# NOTE: The CMD is already defined by the base image.
27# Uncomment this for local node inspector debugging:
28# CMD [ "node", "--inspect=0.0.0.0:9229", "main.js" ]
main.js
1const Apify = require('apify');
2
3Apify.main(async () => {
4 // Get queue and enqueue first url.
5 const requestQueue = await Apify.openRequestQueue();
6 const enqueueUrl = async url => requestQueue.addRequest({ url });
7 await enqueueUrl('https://news.ycombinator.com/');
8
9 // Create crawler.
10 const crawler = new Apify.PuppeteerCrawler({
11 requestQueue,
12
13 launchPuppeteerOptions: {
14 liveView: true,
15 },
16
17 // This page is executed for each request.
18 // If request failes then it's retried 3 times.
19 // Parameter page is Puppeteers page object with loaded page.
20 handlePageFunction: async ({ page, request }) => {
21 console.log(`Request ${request.url} succeeded!`);
22
23 // We inject JQuery for easier data extracting
24 await Apify.utils.puppeteer.injectJQuery(page)
25
26 // Extract all posts. This is a function that gets executed inside a browser context
27 // $ is JQuery variable that is actualy defined on the browser itself
28 // so don't worry about the red line warning
29 const data = await page.evaluate(() => {
30 let posts = [];
31 $('.athing').each(function() {
32 posts.push({
33 rank: Number($(this).find('.rank').text().replace('.', '').trim()),
34 title: $(this).find('.storylink').text().trim(),
35 link: $(this).find('.storylink').attr('href'),
36 domain: $(this).find('.sitestr').text().trim(),
37 score: Number($(this).next().find('.score').text().replace('points', '').replace(',', '').trim()),
38 author: $(this).next().find('.hnuser').text().trim(),
39 posted: $(this).next().find('.age').text().trim(),
40 comments: Number($(this).next().find('a:contains("comments")').text().replace('comments', '').replace(',', '').trim()),
41 url: window.location.href,
42 })
43 })
44 return posts;
45 });
46
47 // Save data.
48 await Apify.pushData(data);
49
50 // Enqueue next page.
51 try {
52 const nextHref = await page.$eval('.morelink', el => el.href);
53 await enqueueUrl(nextHref);
54 } catch (err) {
55 console.log(`Url ${request.url} is the last page!`);
56 }
57 },
58
59 // If request failed 4 times then this function is executed.
60 handleFailedRequestFunction: async ({ request }) => {
61 console.log(`Request ${request.url} failed 4 times`);
62
63 await Apify.pushData({
64 url: request.url,
65 errors: request.errorMessages,
66 })
67 },
68 });
69
70 // Run crawler.
71 await crawler.run();
72});
package.json
1{
2 "name": "apify-project",
3 "version": "0.0.1",
4 "description": "",
5 "author": "It's not you it's me",
6 "license": "ISC",
7 "dependencies": {
8 "apify": "1.0.0"
9 },
10 "scripts": {
11 "start": "node main.js"
12 }
13}
Developer
Maintained by Community
Categories