
Example Hacker News
Deprecated
Pricing
Pay per usage
Go to Store

Example Hacker News
Deprecated
Example crawler for news.ycombinator.com built using Apify SDK.
0.0 (0)
Pricing
Pay per usage
8
Total users
134
Monthly users
1
Last modified
2 years ago
Dockerfile
# This is a template for a Dockerfile used to run acts in Actor system.# The base image name below is set during the act build, based on user settings.# IMPORTANT: The base image must set a correct working directory, such as /usr/src/app or /home/userFROM apify/actor-node
# Second, copy just package.json and package-lock.json since it should be# the only file that affects "npm install" in the next step, to speed up the buildCOPY package*.json ./
# Install NPM packages, skip optional and development dependencies to# keep the image small. Avoid logging too much and print the dependency# tree for debuggingRUN npm --quiet set progress=false \ && npm install --only=prod --no-optional \ && echo "Installed NPM packages:" \ && (npm list --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version
# Copy source code to container# Do this in the last step, to have fast build if only the source code changedCOPY . ./
# NOTE: The CMD is already defined by the base image.# Uncomment this for local node inspector debugging:# CMD [ "node", "--inspect=0.0.0.0:9229", "main.js" ]
main.js
1const Apify = require('apify');2
3Apify.main(async () => {4 // Get queue and enqueue first url.5 const requestQueue = await Apify.openRequestQueue();6 const enqueueUrl = async url => requestQueue.addRequest({ url });7 await enqueueUrl('https://news.ycombinator.com/');8
9 // Create crawler.10 const crawler = new Apify.PuppeteerCrawler({11 requestQueue,12 13 launchPuppeteerOptions: {14 liveView: true, 15 },16
17 // This page is executed for each request.18 // If request failes then it's retried 3 times.19 // Parameter page is Puppeteers page object with loaded page.20 handlePageFunction: async ({ page, request }) => {21 console.log(`Request ${request.url} succeeded!`);22
23 // We inject JQuery for easier data extracting24 await Apify.utils.puppeteer.injectJQuery(page)25
26 // Extract all posts. This is a function that gets executed inside a browser context27 // $ is JQuery variable that is actualy defined on the browser itself 28 // so don't worry about the red line warning29 const data = await page.evaluate(() => {30 let posts = [];31 $('.athing').each(function() {32 posts.push({33 rank: Number($(this).find('.rank').text().replace('.', '').trim()),34 title: $(this).find('.storylink').text().trim(),35 link: $(this).find('.storylink').attr('href'),36 domain: $(this).find('.sitestr').text().trim(),37 score: Number($(this).next().find('.score').text().replace('points', '').replace(',', '').trim()),38 author: $(this).next().find('.hnuser').text().trim(),39 posted: $(this).next().find('.age').text().trim(),40 comments: Number($(this).next().find('a:contains("comments")').text().replace('comments', '').replace(',', '').trim()),41 url: window.location.href,42 })43 })44 return posts;45 });46 47 // Save data.48 await Apify.pushData(data);49 50 // Enqueue next page.51 try {52 const nextHref = await page.$eval('.morelink', el => el.href);53 await enqueueUrl(nextHref);54 } catch (err) {55 console.log(`Url ${request.url} is the last page!`);56 }57 },58
59 // If request failed 4 times then this function is executed.60 handleFailedRequestFunction: async ({ request }) => {61 console.log(`Request ${request.url} failed 4 times`);62 63 await Apify.pushData({64 url: request.url,65 errors: request.errorMessages,66 })67 },68 });69 70 // Run crawler.71 await crawler.run();72});
package.json
{ "name": "apify-project", "version": "0.0.1", "description": "", "author": "It's not you it's me", "license": "ISC", "dependencies": { "apify": "1.0.0" }, "scripts": { "start": "node main.js" }}