HN Top Stories Scraper avatar

HN Top Stories Scraper

Pricing

$5.00 / 1,000 story scrapeds

Go to Apify Store
HN Top Stories Scraper

HN Top Stories Scraper

Scrape Hacker News top stories — extract title, URL, score, author, comment count, and submission time. Monitor HN front page in real time. CSV/JSON.

Pricing

$5.00 / 1,000 story scrapeds

Rating

0.0

(0)

Developer

Web Data Labs

Web Data Labs

Maintained by Community

Actor stats

0

Bookmarked

4

Total users

3

Monthly active users

5 days ago

Last modified

Share

Hacker News Top Stories Scraper

Pull the current Hacker News front page (or top / new / best / ask / show / job lists) as structured JSON. Title, URL, score, author, comment count, age, and the discussion thread link — ready for dashboards, digests, alerts, and ML pipelines.

Built by Web Data Labs and hosted on Apify with managed retries and uptime.


Why use a scraper for HN?

Hacker News exposes a public Firebase API — and for many use cases, that API is the right tool. So why does this actor exist?

  • The API gives you IDs, not stories. To assemble a top-30 list with titles, URLs, scores, and comment counts you need 1 list call + 30 item calls + handling for missing/dead/deleted items. This actor does that for you in one call.
  • You don't want to build the plumbing. Caching, retries, rate handling, schema normalization, edge cases (jobs without URLs, polls, "Ask HN" titles, dead items) — all already solved.
  • You want filtering at the source. Minimum score? Story type? Posted in the last N hours? One JSON input field instead of a custom Lambda.
  • You want one consistent output schema across HN, Reddit, Lobsters, Substack, and other community-news sources. This actor's shape matches the rest of the cryptosignals catalogue so you can stitch them together with no glue code.
  • You want it as a job, not a dependency. Runs on a schedule, dumps to a dataset, hits your webhook. No server, no cron, no error pages at 3am.

If you just want to play with HN data interactively, the public API is great. If you want a reliable feed into a product, dashboard, or notification pipeline, this actor is faster.


What you get

Each item in the output represents one Hacker News story:

FieldDescription
idHN item ID (canonical, stable).
titleStory title as posted.
urlOutbound URL (null for Ask HN / Show HN text posts).
scoreCurrent upvote score.
byUsername of the submitter.
commentCountNumber of comments (descendants in HN's terminology).
timeUnix timestamp (seconds) of submission.
ageHoursHours since submission (computed at scrape time).
hnUrlDirect link to the HN discussion thread.
domainHostname of url (e.g. github.com), null for text posts.
typestory, job, ask, show, poll.
sourceThe list this story came from (top, new, best, etc.).

Sample output

[
{
"id": 39842715,
"title": "Show HN: I built a tool to extract data from any website",
"url": "https://example.com/launch",
"score": 412,
"by": "founder123",
"commentCount": 138,
"time": 1709905200,
"ageHours": 4.2,
"hnUrl": "https://news.ycombinator.com/item?id=39842715",
"domain": "example.com",
"type": "story",
"source": "top"
},
{
"id": 39842901,
"title": "Ask HN: How do you keep up with new ML papers?",
"url": null,
"score": 187,
"by": "ml_curious",
"commentCount": 96,
"time": 1709908800,
"ageHours": 3.2,
"hnUrl": "https://news.ycombinator.com/item?id=39842901",
"domain": null,
"type": "ask",
"source": "top"
}
]

Stories are returned in HN's native ranking order for each list (i.e. top-of-list first).


Use cases

1. Daily digest emails / Slack bots. Run once a day at 9am, take the top 10 stories with score >= 100, format them into a digest, post to Slack or send via email. Five-line glue script.

2. Trending-topics dashboards. Feed scores and comment counts into a time-series store and chart momentum. Catch stories that are climbing fast before they peak.

3. Competitive monitoring. Filter for stories where domain matches your company's domain — or your competitors'. Get notified the moment something hits the front page.

4. Tech news ingestion for ML. Pull top and best daily, push to a vector store, run topic classification or summarization. Build a personalized "what's interesting today" feed.

5. Ask HN / Show HN watchlist. Filter by type=ask or type=show to track community questions and product launches without scrolling the site.

6. Hiring signal. Watch the monthly "Who is hiring?" thread and Show HN launches to identify hot startups, technologies, and hiring trends.


Input

The actor accepts a JSON input. The example default is:

{
"count": 30,
"type": "top",
"minScore": 50
}

Typical fields:

  • type — which HN list to pull. One of: top, new, best, ask, show, job. Default top.
  • count — how many stories to fetch from the chosen list. Default 30.
  • minScore — drop stories below this score. Useful for "front page worth reading" filters.
  • maxAgeHours — drop stories older than this many hours.
  • domains — optional allowlist of domains (e.g. ["github.com", "arxiv.org"]).
  • excludeDomains — optional blocklist.

Open the actor in the Apify Console and the form-style editor documents every field with examples. You don't need to memorize anything.


How to run it

1. Apify Console (no code)

Open the actor, edit input, click Start. Output lands in the run's dataset and exports as JSON, CSV, Excel, or RSS feed.

2. Apify API

Synchronous run that returns dataset items in the response — perfect for cron jobs and webhooks:

curl -X POST "https://api.apify.com/v2/acts/cryptosignals~hn-top-stories/run-sync-get-dataset-items?token=YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"count": 30, "type": "top", "minScore": 100}'

Async run:

curl -X POST "https://api.apify.com/v2/acts/cryptosignals~hn-top-stories/runs?token=YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"count": 100, "type": "best", "minScore": 200}'

Then poll GET /v2/acts/cryptosignals~hn-top-stories/runs/{runId} and fetch items from defaultDatasetId.

3. Apify JavaScript SDK

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: process.env.APIFY_TOKEN });
const run = await client.actor('cryptosignals/hn-top-stories').call({
count: 30,
type: 'top',
minScore: 100,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach(s => console.log(`${s.score}\t${s.title}\t${s.hnUrl}`));

4. Apify Python SDK

from apify_client import ApifyClient
client = ApifyClient(token="YOUR_TOKEN")
run = client.actor("cryptosignals/hn-top-stories").call(run_input={
"count": 30,
"type": "top",
"minScore": 100,
})
for s in client.dataset(run["defaultDatasetId"]).iterate_items():
print(f"{s['score']:>4} {s['title']}")
print(f" {s['hnUrl']}")

5. Schedule it

In the Apify Console go to Schedules → Create new and pick this actor. Set a cron expression (e.g. 0 9 * * * for daily 9am UTC) and an input. Use a webhook on the schedule to push results into Slack / Discord / your API on every run.


Pricing

Pay Per Event:

  • $0.005 per scraped story.
  • No compute-minute charges, no proxy charges, no per-request fees.
  • A typical front-page run (30 stories) costs $0.15. A count: 100 run costs $0.50.
  • Failed runs that produce no items cost you nothing.
  • Apify free accounts get monthly credit — usually enough to run a daily 30-story digest at no cost.

Output destinations

  • Apify dataset (default) — query later via API, export as JSON/CSV.
  • Webhooks — fire on run completion and POST the dataset URL to your endpoint.
  • Apify integrations — Zapier, Make, Slack, Google Sheets, Airtable, Pipedream all available out of the box from the actor's run page.
  • RSS — the dataset has a built-in RSS view if you'd rather treat HN as a feed.

FAQ

How fresh is the data? Each run pulls live from HN at the time of the request. The HN ranking algorithm itself updates roughly every minute or two.

Why is url sometimes null? Ask HN, Show HN (text-only), and poll posts have no outbound URL. Use hnUrl to get to the discussion.

Can I get the comments too? This actor focuses on the story listing — the high-frequency, low-cost feed. Comment-tree extraction is a separate concern (much larger payloads, much higher cost). Reach out via the actor page if you need it.

What if a story is deleted between list-fetch and detail-fetch? The actor silently drops it. Your dataset will never contain null rows or items missing core fields.

Does this comply with HN's terms? HN's API and front-end content are publicly accessible and intended for programmatic use. The actor uses public endpoints only and respects rate limits. Don't use the data for spam, mass-DM campaigns, or to harass posters — that's on you.


Other actors you might like

  • amazon-scraper — Amazon products, prices, ratings, reviews across all major locales.
  • See the full catalogue at apify.com/cryptosignals — Reddit, Lobsters, Product Hunt, GitHub trending, and more community-news / market-data sources, all using the same input/output conventions.

Support

  • Web: web-data-labs.com
  • Issues: open an issue on the actor page on Apify.
  • Updates: actively maintained. If HN changes its layout or API behavior, fixes typically ship within 24 hours.