3 days trial then $10.00/month - No credit card required now

Hacker News Data Scraper

epctex/hackernews-scraper

3 days trial then $10.00/month - No credit card required now

Extract Y Combinator's Hacker News based on any search criteria. Crawl the front page, Show HN, Ask HN, news, job listings, and historical data. Get links, titles, comments, ratings, and more!

Actor - Hacker News Scraper

Hacker News scraper

Since Hacker News doesn't provide a good API, this actor should help you to retrieve data from it.

The Hacker News data scraper supports the following features:

Scrape front page listing - You can scrape the homepage listings with any page you want.
Scrape the newest listing - Latest news can be scraped right away from Hacker News.
Scrape historical data - If you are looking for historical data, you can pick any date you want and scrape it over.
Scrape listings of Ask HN - If you are specifically looking for an "Ask HN" type of listing, you can target it.
Scrape listings of Show HN - If you are specifically looking for a "Show HN" type of listing, you can target it.
Scrape listing details - You can scrape a single listing.
Scrape Hacker News Algolia - You can retrieve everything from Hacker News' Algolia website
Scrape job listings - You can scrape the latest job listings that are posted on Hacker News.

Bugs, fixes, updates, and changelog

This scraper is under active development. If you have any feature requests you can create an issue from here.

Input Parameters

The input of this scraper should be JSON containing the list of pages on Hacker News that should be visited. Possible fields are:

startUrls: (Optional) (Array) List of Hacker News URLs. You should only provide a news list, jobs list, or detail URLs.
mode: (Optional) (String) Mode of the actor. Can be FRONTPAGE, NEWEST, ASK, SHOW, JOBS or PAST.
enableCommentHierarchy: (Optional) (Boolean) Enables comment hierarchy over the posts. It retrieves all the comments and builds a tree within the replies.
endPage: (Optional) (Number) Final number of page that you want to scrape. The default is Infinite. This applies to all search requests and startUrls individually.
maxItems: (Optional) (Number) You can limit scraped items. This should be useful when you search through the big lists or search results.
proxy: (Required) (Proxy Object) Proxy configuration.
extendOutputFunction: (Optional) (String) Function that takes a JQuery handle ($) as an argument and returns an object with data.

This solution requires the use of Proxy servers, either your own proxy servers or you can use Apify Proxy.

Tip

When you want to scrape over a specific listing URL, just copy and paste the link as one of the startUrl.

If you would like to scrape only the first page of a list then put the link for the page and have the endPage as 1.

With the last approach that is explained above you can also fetch any interval of pages. If you provide the 5th page of a list and define the endPage parameter as 6 then you'll have the 5th and 6th pages only.

If you would like to scrape historical data (ex: 2020-03-18) go to Hacker News, click on the "Past" tab, and find the URL that you are looking for. Then use the link as startUrl. Also; this is the format of historical data: https://news.ycombinator.com/front?day=2020-03-18

Compute Unit Consumption

The actor is optimized to run blazing fast and scrape many listings as possible. Therefore, it forefronts all listing detail requests. If the actor doesn't block very often it'll scrape 100 listings in 1 minute with ~0.03-0.04 compute units.

Hacker News Scraper Input example

1{
2    "startUrls": [
3        {
4            "url": "https://news.ycombinator.com/item?id=26501527"
5        },
6        {
7            "url": "https://news.ycombinator.com/front?day=2020-03-18"
8        }
9    ],
10    "mode": "FRONTPAGE",
11    "enableCommentHierarchy": false,
12    "proxy": {
13        "useApifyProxy": true
14    },
15    "endPage": 1,
16    "maxItems": 100
17}

During the Run

During the run, the actor will output messages letting you know what is going on. Each message always contains a short label specifying which page from the provided list is currently specified. When items are loaded from the page, you should see a message about this event with a loaded item count and total item count for each page.

If you provide incorrect input to the actor, it will immediately stop with a failure state and output an explanation of what is wrong.

Hacker News Export

During the run, the actor stores results into a dataset. Each item is a separate item in the dataset.

You can manage the results in any language (Python, PHP, Node JS/NPM). See the FAQ or our API reference to learn more about getting results from this Hacker News actor.

Scraped Hacker News Properties

The structure of each item in Hacker News listings looks like this:

Job Listings

1{
2    "id": "26437893",
3    "title": "Substack (YC W18) is hiring to build a better business model for writing",
4    "link": "https://substack.com/jobs",
5    "age": "7 days ago",
6    "scrapedAt": "2021-03-19T21:56:00.085Z"
7}

Single Comment

1{
2    "scrapedType": "comment",
3    "userName": "tsl54",
4    "userLink": "https://news.ycombinator.com/user?id=tsl54",
5    "age": "9 months ago",
6    "message": "Congratulations on getting here! I’ve been part of a few management changes from the board level.",
7    "comments":[]
8},

News Listings

1{
2    "id": "26501262",
3    "title": "Fancy Defines",
4    "link": "https://idiomdrottning.org/fancy-defines",
5    "points": 15,
6    "postedUserName": "todsacerdoti",
7    "postedUserLink": "https://news.ycombinator.com/user?id=todsacerdoti",
8    "numberOfComments": 1,
9    "comments": [
10        {
11            "userName": "nerdponx",
12            "userLink": "https://news.ycombinator.com/user?id=nerdponx",
13            "age": "2 hours ago",
14            "message": "Interesting, I didn't know about this at all. Is it that common in Scheme to write functions that immediately return other functions? Seems like an oddly \"blessed\" usage of syntax that IMO could be better used for something like pattern matching.Looking at this example from the linked SRFI [0]:    (define ((greet-with-prefix prefix) suffix)\n      (string-append prefix \" \" suffix))\n\n    (define greet (greet-with-prefix \"Hello\"))\n\n    (greet \"there!\") => \"Hello there!\"\n\nI'm not convinced that this is anything but an obfuscation, compared to the standard R5RS version:    (define (greet-with-prefix suffix)\n      (lambda (prefix)\n        (string-append prefix \" \" suffix)))\n\n    (define greet (greet-with-prefix \"Hello\"))\n\n    (greet \"there!\") => \"Hello there!\"\n\nWhat do the experienced Schemers think?[0]: https://srfi.schemers.org/srfi-219/srfi-219.html"
15        }
16    ],
17    "age": "4 hours ago",
18    "scrapedAt": "2021-03-19T21:37:45.462Z"
19}

Contact

Please visit us through epctex.com to see all the products that are available for you. If you are looking for any custom integration or so, please reach out to us through the chat box in epctex.com. In need of support? devops@epctex.com is at your service.

Developer

epctex

Actor metrics

2 monthly users
100.0% runs succeeded
0.0 days response time
Created in Mar 2021
Modified about 23 hours ago

Categories

News

Business

Other

Google Maps Scraper

compass/crawler-google-places

Extract data from hundreds of Google Maps locations and businesses. Get Google Maps data including reviews, images, contact info, opening hours, location, popular times, prices & more. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.

Compass

60.5k

Website Content Crawler

apify/website-content-crawler

Automatically crawl and extract text content from websites with documentation, knowledge bases, help centers, or blogs. This Actor is designed to provide data to feed, fine-tune, or train large language models such as ChatGPT or LLaMA.

Apify

12.2k

Twitter Scraper

quacker/twitter-scraper

Scrape tweets from any Twitter user profile. Top Twitter API alternative to scrape Twitter hashtags, threads, replies, followers, images, videos, statistics, and Twitter history. Export scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.

Quacker

23.7k

GPT Scraper

drobnikj/gpt-scraper

Extract data from any website and feed it into GPT via the OpenAI API. Use ChatGPT to proofread content, analyze sentiment, summarize reviews, extract contact details, and much more.

Jakub Drobník

4.1k

AI Product Matcher

equidem/ai-product-matcher

Match products across multiple e-commerce websites. Use this AI product matching Actor whenever you need to find matching pairs of products from different online shops for dynamic pricing, competitor analysis or market research.

Matěj Sochor

281

Facebook Ads Scraper

apify/facebook-ads-scraper

Extract advertising data from one or multiple Facebook Pages. Get page details, reach estimates, publisher platforms, report count, number of impressions, ad IDs, timestamps, and more. Download Facebook ads data in JSON, CSV, and Excel and use it in apps, spreadsheets, and reports.

Apify

3.8k

Google Trends Scraper

emastra/google-trends-scraper

Scrape data from Google Trends by search terms or URLs. Specify locations, define time ranges, select categories to get interest by subregion and over time, related queries and topics, and more. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.

Emiliano Mastragostino

3.1k

📩📍 Google Maps Email Extractor

lukaskrivka/google-maps-with-contact-details

Extract Google Maps contact details. Scrape websites of Google Maps places for contact details and get email addresses, website, location, address, zipcode, phone number, social media links. Export scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.

Lukáš Křivka

Twitter URL Scraper

quacker/twitter-url-scraper

Copy any Twitter URL and extract Twitter usernames, profile photos, follower count, tweets, hashtags, favorite count, and more. Export scraped datasets, run the scraper via API, schedule and monitor runs or integrate with other tools.

Quacker

4.1k

Smart Article Extractor

lukaskrivka/article-extractor-smart

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

Lukáš Křivka

3.2k

How to get data from Hacker News with unofficial HN API

How to never miss a beat on ever changing websites

Web scraping with Beautiful Soup & Requests (Python tutorial)

Build new tools

Are you a developer? Build your own Actors and run them on Apify.

Learn more

Get a custom solution

Get a custom web scraping or RPA solution.

Book a demo

Hacker News Data Scraper

Actor - Hacker News Scraper

Hacker News scraper

Bugs, fixes, updates, and changelog

Input Parameters

Tip

Compute Unit Consumption

Hacker News Scraper Input example

During the Run

Hacker News Export

Scraped Hacker News Properties

Job Listings

Single Comment

News Listings

Contact

Google Maps Scraper

Website Content Crawler

Twitter Scraper

GPT Scraper

AI Product Matcher

Facebook Ads Scraper

Google Trends Scraper

📩📍 Google Maps Email Extractor

Twitter URL Scraper

Smart Article Extractor

Related articles

Where next?

Build new tools

Get a custom solution

Actor - Hacker News Scraper

Hacker News scraper

Bugs, fixes, updates, and changelog

Input Parameters

Tip

Compute Unit Consumption

Hacker News Scraper Input example

During the Run

Hacker News Export

Scraped Hacker News Properties

Job Listings

Single Comment

News Listings

Contact

You might also like these Actors

Google Maps Scraper

Website Content Crawler

Twitter Scraper

GPT Scraper

AI Product Matcher

Facebook Ads Scraper

Google Trends Scraper

📩📍 Google Maps Email Extractor

Twitter URL Scraper

Smart Article Extractor

Related articles

Where next?

Build new tools

Get a custom solution