HackerNews Insights Scraper — Stories, Comments & Velocity avatar

HackerNews Insights Scraper — Stories, Comments & Velocity

Pricing

from $0.005 / story scraped

Go to Apify Store
HackerNews Insights Scraper — Stories, Comments & Velocity

HackerNews Insights Scraper — Stories, Comments & Velocity

Hacker News stories, full comment trees, user karma and contact info, story velocity tracking, history deltas. Search all 3.7M stories with filters for points, karma, domain, dates, keywords. For VCs hunting Show HN, recruiters mining talent, journalists tracking tech, and AI/RAG pipelines.

Pricing

from $0.005 / story scraped

Rating

0.0

(0)

Developer

Yuliia Kulakova

Yuliia Kulakova

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

6 days ago

Last modified

Share

HackerNews Insights Scraper

Stories, comments, user karma, velocity tracking, and contact intelligence — turn Hacker News into a structured intelligence feed.

HackerNews Insights Scraper


Why this scraper

Hacker News is the single highest-signal community in tech: where launches break, where engineers vent, where investors hunt. But the site itself gives you a ranked list and a thread view — no filters, no trends, no exports, no way to track a story's momentum or pull a list of senior commenters by domain expertise.

This scraper turns Hacker News into a structured data feed you can pipe straight into your CRM, dashboard, LLM, or spreadsheet. Search the entire 3.7M-story archive, filter by score / karma / domain / keywords, pull full comment trees with depth and reply analytics, enrich author profiles with contact info, and track how stories grow between runs.


What you get

Stories with rich context

  • Top, Best, New, Ask HN, Show HN, and Job listings
  • Title, URL, domain, body text, author, score, comment count, submission time
  • Auto-tagged: story, show_hn, ask_hn, job
  • Permalinks straight back to news.ycombinator.com

Full-text search across all 3.7M Hacker News stories

  • Search by keyword across the entire HN archive (2007 → today)
  • Filter by tag, date range, score threshold, and author karma
  • Sort by popularity or newest

Complete comment trees with analytics

  • Recursive thread fetch with configurable depth (1–10 tiers)
  • Hard cap per story so viral threads don't explode your budget
  • Per-comment: author, body text, parent, depth, reply count, timestamps
  • Per-story analytics: max depth, average replies per node, top 5 commenters by reply count

User intelligence

  • Karma, account age, total submission count
  • Recent submission IDs (last 50)
  • Dominant domains the user posts about (signal for "what they're into")
  • Average score on their last 20 stories

Contact extraction from user bios

  • Emails, X/Twitter handles, GitHub, LinkedIn, Mastodon
  • Personal websites (auto-classified separately from social profiles)
  • Smart parsing that ignores false positives (an email like thomas@fly.io won't produce a fake @fly Twitter handle)

Story velocity tracking

  • Points per hour and comments per hour since submission
  • Points-to-comments ratio (viral vs. controversial signal)
  • Story age in hours

History delta — track stories across runs

  • Persistent snapshot store keyed by story ID
  • On every subsequent run, each story includes: scoreDelta, commentsDelta, scorePerHour, commentsPerHour, trend (up/down/flat)
  • See exactly how a Show HN gained traction overnight or how a controversial post peaked and stalled

Input-time filters (the headline differentiator)

  • Minimum score, minimum comments, minimum author karma
  • Date range (from / to)
  • Keyword include list (case-insensitive title + body match)
  • Domain whitelist (only stories pointing to github.com, arxiv.org, etc.)
  • Filters apply before expensive fetches — you only pay for records that pass

Use cases

WhoWhat they pull
VCs & angel investorsShow HN deal flow — every product launch with > 100 points + maker contact + velocity since launch
RecruitersHigh-karma authors who post about specific domains (Rust, ML, infrastructure) — with surfaced contact info
Tech journalistsTrending stories from arxiv.org, github.com, or competitor domains; sentiment via comment trees
PR & comms teamsTrack when your company / product gets mentioned; full comment thread for response strategy
AI / RAG engineersHigh-signal, opinion-rich training and retrieval data — full comments, not just titles
Startup foundersCompetitor monitoring; see what users are saying about adjacent products in threads
Product managersPull all "Ask HN: how do you…" threads in your category for organic user research
Open-source maintainersFind every HN discussion of your project across years; see which features users actually care about

Quick start

Drop this into the Input panel and run:

{
"lists": ["top"],
"maxStoriesPerList": 30
}

You'll get 30 top stories with velocity analytics and tag classification — typically in under 15 seconds.


Common input examples

All Show HN with at least 500 points from the last year

{
"tagSearchOnly": true,
"tags": ["show_hn"],
"minPoints": 500,
"dateFrom": "2025-06-01",
"maxStoriesPerQuery": 100,
"sortBy": "popularity"
}

Track a topic and gather author contacts

{
"queries": ["LLM observability", "rust async"],
"minPoints": 50,
"includeAuthorProfiles": true,
"includeContactInfo": true,
"maxStoriesPerQuery": 30
}

Pull a single story with the full comment tree

{
"storyIds": ["48513806"],
"includeComments": true,
"commentDepth": 5,
"maxCommentsPerStory": 500,
"includeAuthorProfiles": true
}

Daily monitor with growth tracking

{
"lists": ["best"],
"maxStoriesPerList": 50,
"enableHistory": true,
"minPoints": 100
}

Run on a schedule. Every run after the first includes history.delta showing how each story has grown.

Domain-specific intelligence (arxiv papers on the front page)

{
"queries": ["AI", "machine learning"],
"domains": ["arxiv.org"],
"minPoints": 100,
"dateFrom": "2025-01-01",
"maxStoriesPerQuery": 50
}

Look up specific power users

{
"userIds": ["pg", "tptacek", "patio11", "dang"],
"includeContactInfo": true
}

Output overview

Three record types in the dataset:

Story

FieldDescription
type"story" or "job"
idHacker News story ID
titleStory title
urlExternal link (null for Ask HN / Tell HN self-posts)
domainApex domain of the URL
textBody text for Ask HN / Show HN / Tell HN (HTML stripped)
byAuthor username
scoreCurrent points
descendantsTotal comment count
tagstory, show_hn, ask_hn, or job
createdAtISO timestamp
permalinkLink to the story on news.ycombinator.com
analyticspointsPerHour, commentsPerHour, pointsToCommentsRatio, ageHours, plus comment-tree shape stats when comments are fetched
historyscoreDelta, commentsDelta, trend, snapshot series — present when history tracking is on

Comment

FieldDescription
type"comment"
id, parent, storyIdComment ID, immediate parent, root story
by, textAuthor and full comment body (HTML stripped)
depth1 = direct reply, 2 = reply-to-reply, etc.
replyCountNumber of direct child replies
createdAt, permalinkTimestamp and link

User

FieldDescription
type"user"
usernameHN handle
karmaTotal karma
about, aboutHtmlBio (cleaned and original)
createdAtWhen the account was created
submittedCountLifetime submission count
recentSubmittedIdsLast 50 submission IDs
contactInfoemails, twitter, github, linkedin, mastodon, websites
activityRecent activity sample, dominant domains, average story score

Pricing

ChargeCost
Actor start$0.01 per run
Story scraped$0.005 per story (or job listing)
Comment scraped$0.001 per comment
User profile scraped$0.005 per user

Records are only counted after filters pass — you don't pay for stories that get dropped by minPoints, domains, or dateRange. Comment trees and author profiles are opt-in.

Worked examples:

ScenarioStoriesCommentsUsersCost
50 top stories, no comments5000$0.26
100 Show HN historical search10000$0.51
30 stories + full comment trees (~30 avg)30~9000$1.06
1 viral story + 500 comments + author profile15001$0.52
Daily best-of-50 monitor with author profiles50050$0.51
Deep weekly review: 100 stories + 5000 comments + 100 authors1005000100$6.01

Comments are intentionally priced low so that comment-tree analytics and AI/RAG workloads stay affordable.


Proxies

Proxies are included and configured automatically. No setup required.


FAQ

Does this work with the Hacker News API directly? You don't need an API key or any setup. Pass an input, get a dataset. We handle the upstream calls.

Can I get comments from before HN existed? Comments and stories go back to HN's launch in 2007. Full-text search covers the entire archive.

Will this hit rate limits if I run it often? Hacker News exposes a generous public data surface for scrapers. Per-request throttling is built in. You can safely schedule this every 15 minutes for monitoring use cases.

Can I track sentiment? The scraper returns full comment text. Sentiment is something you'd run downstream (an LLM call, your own classifier, etc.). We don't bundle sentiment to keep pricing flat and the data unopinionated.

Why don't I see Twitter handles for tptacek even though his bio has email addresses? The contact parser is intentionally strict: it won't extract @sockpuppet from the email thomas@sockpuppet.org because that would be a false positive. Real Twitter handles (text like @username written as a standalone mention, or a twitter.com/username URL) are extracted reliably.

How does history tracking work? Turn on enableHistory: true and pick a historyStoreName. On every run, each story's current score and comment count are snapshotted under that name. From the second run onward, every story includes a history.delta block with the change since the previous run, expressed as raw deltas and as per-hour rates.

Why might a story I expected to see not appear in the output? Most often a filter dropped it. Check the log: it prints active filters at the start of every run. Common gotchas: domains set with a self-post (Ask HN has no URL → automatically dropped), dateFrom cutoff too aggressive, or minAuthorKarma filtering out new accounts.

Does this fetch reply chains under deeply nested comments? Yes, up to commentDepth levels (default 3). HN threads sometimes go 8–10 levels deep; raise the limit if you need the full tree, but expect cost to scale with thread size.

Can I export to CSV / XML / RSS? Apify supports all of those formats out of the box — pick your format in the "Export results" panel after a run finishes.

What about private / dead / deleted content? Deleted comments and stories are skipped (you won't see hollow placeholder records). Reply chains beneath a deleted comment are still traversed when present.

Will this work on a free Apify plan? Yes. Typical runs cost cents, well within the free tier's monthly compute budget.


Limits (the honest list)

  • Show HN / Ask HN classification is taken from the title prefix and from HN's own tags. Stories that aren't formally tagged Show HN but include "Show HN" in casual text will be classified as Show HN; this matches HN's own behavior.
  • Comment trees are capped by maxCommentsPerStory (default 200). On the most viral threads (Anthropic-acquires-Bun-tier discussions with 1000+ comments) you'll get the top 200 by BFS order, not every leaf.
  • Comment sentiment / topic extraction is not included. You get the raw text — sentiment is a downstream concern.
  • User contact extraction is best-effort. It scans the bio the user wrote about themselves; if they didn't put their email in there, we can't surface it.
  • Real-time push / streaming is not supported. This is a batch scraper. Schedule it on Apify's cron and pipe to a webhook for "almost-real-time" workflows.
  • No login-required content. Everything we return is public — HN doesn't gate content behind auth in any meaningful way, so this is rarely a problem.

Maintained by brilliant_gum on the Apify platform. Open an issue on the actor page for bugs, feature requests, or pricing questions.