Hacker News Scraper avatar

Hacker News Scraper

Pricing

Pay per event

Go to Apify Store
Hacker News Scraper

Hacker News Scraper

Scrape Hacker News stories (top, new, best, ask, show, jobs) plus per-story metadata in one call — title, URL, score, author, comment count, posted-at — export to JSON or CSV. A Hacker News API wrapper that handles pagination, fan-out, retries, and rate-limit pacing.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

DevilScrapes

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

an hour ago

Last modified

Share


🎯 What this scrapes

This Actor fetches Hacker News story lists from any of the six available feeds — top, new, best, ask, show, or jobs — fans out to each individual story record, and writes one typed dataset row per story. The underlying Firebase API returns only item IDs at the feed level; we perform the full N+1 enrichment call per story and assemble the complete record before it hits your dataset.

You pick the feed and the result cap; we deliver clean, schema-validated rows on a schedule in JSON, CSV, or Excel — ready to pipe into Google Sheets, S3, a data warehouse, a webhook, or a RAG pipeline.

🔥 Features

  • 🛡️ Browser fingerprint rotationcurl-cffi impersonates real Chrome, Firefox, and Safari TLS handshakes so we look like a browser, not a Python script.
  • 🌐 Residential proxy rotation via Apify Proxy — fresh session and exit IP on every block or rate-limit response.
  • 🔁 Retries with exponential backoff on 408 / 429 / 5xx — up to 5 attempts per request, Retry-After header honoured.
  • 🧱 Rate-limit-aware pacing — when the target pushes back we slow down and surface a clear status message instead of silently returning an empty dataset.
  • 🧊 Clean, typed dataset rows — Pydantic-validated, ISO-8601 timestamps, stable IDs, feed rank included.
  • 💰 Pay-Per-Event pricing — you pay only for results that land in your dataset. No data, no charge.

💡 Use cases

  • Trend monitoring — diff top stories hourly to see which posts gain traction fastest.
  • Comment-volume alerts — pipe rows into Slack when a story passes 100 comments.
  • Lead gen for dev tools — surface Show HN launches that mention your stack and reach out early.
  • Newsletter curation — feed the top 10 stories from the best feed into a weekly digest.
  • ML training data — historical top-story metadata for score-prediction or topic-classification models.
  • Show HN tracker — schedule a daily run against the show feed to watch new product launches.

⚙️ How to use it

  1. Click Try for free at the top of the Store page.
  2. Fill in the input form — most fields have sensible defaults (feed: top, max results: 100).
  3. Click Start. Output streams into the run's dataset in real time.
  4. Export from Storage → Dataset as JSON, CSV, or Excel — or pull via the Apify REST API.

📥 Input

FieldTypeRequiredDefaultNotes
feedstringnotopWhich story feed to pull: top (front page), new (most recent), best (time-decayed best), ask, show, or jobs.
maxResultsintegerno100Total dataset rows to produce. Each feed exposes up to 500 items; set to 0 for the full feed length.
includeTextbooleannotrueFetch the full self-post body for Ask HN and Show HN entries. Has no effect on regular link stories.
concurrencyintegerno8How many story records to fetch in parallel (1–32).
proxyConfigurationobjectno{"useApifyProxy": false}Apify Proxy configuration. Enable residential proxies if you need to route traffic through Apify for compliance or high-volume runs.

Example input

{
"feed": "top",
"maxResults": 3,
"includeText": false,
"concurrency": 4,
"proxyConfiguration": {
"useApifyProxy": false
}
}

📤 Output

Every row is one dataset item.

FieldTypeNotes
idintegerHacker News story ID (stable, monotonically increasing).
typestringHN item type — story, job, ask, show, comment, poll.
titlestringStory headline.
urlstring | nullOutbound link (null for self-posts).
permalinkstringHacker News permalink (news.ycombinator.com/item?id=...).
bystringAuthor username on Hacker News.
scoreinteger | nullUpvotes — null for jobs and dead items.
descendantsinteger | nullTotal comment count, including replies.
textstring | nullSelf-post body (Ask HN / Show HN). HTML; only present when includeText is true.
timeintegerUnix epoch seconds — when the story was posted.
posted_atstringISO-8601 UTC timestamp derived from time.
scraped_atstringISO-8601 UTC timestamp of when this row was recorded.
rankintegerPosition of this story in the feed at scrape time (1-indexed).

Example output

{
"id": 39000000,
"type": "story",
"title": "Show HN: Devil Scrapes — public-data Apify Actors with honest pricing",
"url": "https://apify.com/DevilScrapes",
"permalink": "https://news.ycombinator.com/item?id=39000000",
"by": "devilscrapes",
"score": 142,
"descendants": 33,
"text": null,
"time": 1747353600,
"posted_at": "2026-05-15T20:00:00+00:00",
"scraped_at": "2026-05-15T20:05:00+00:00",
"rank": 1
}

💰 Pricing

Pay-Per-Event — you pay only when these events fire:

EventUSDWhat it is
actor-start$0.005One-off warm-up charge per run
result$0.002Per dataset item written

Example: 1,000 results at the rates above ≈ $2.00. No subscription, no minimum, no card to start — Apify gives every new account $5 of free credit.

🚧 Limitations

Comment threads are not expanded — we return descendants (the count) but not the full tree. Dead and deleted stories are skipped automatically. The text field for self-posts is raw HTML, not Markdown — run it through your own sanitiser before display. Each feed is capped at 500 items by the upstream; we cannot exceed that without supplementing via search.

❓ FAQ

Is scraping Hacker News legal?

Yes — Y Combinator makes Hacker News data available through a documented, open API at github.com/HackerNews/API. We fetch only what that API surfaces, pace requests responsibly, and surface every call in the run log.

Why use this instead of calling the API myself?

The raw API returns an array of item IDs at the feed level — you need a separate round-trip per story to get title, score, and comment count. At 500 stories that is 501 HTTP calls to coordinate, de-duplicate, and fan out concurrently. We do that work, add ISO timestamps, attach the feed rank column (which the API does not expose), and deliver structured rows you can export or schedule without writing a line of code.

What about the hacker news show HN tracker use case?

Set feed to show and schedule your run on a cron. Each run captures the Show HN feed at that point in time with title, score, comment count, and author — ready for a Slack alert or spreadsheet diff without any glue code.

Can I export Hacker News data to a spreadsheet?

Yes — finish a run, open Storage → Dataset, and click Export as CSV or Export as Excel. Every field in the output table maps cleanly to a spreadsheet column. You can also connect the dataset URL directly to a Google Sheets IMPORTDATA formula.

Can I scrape comments too?

Not in this Actor — comment trees fan out 10-100x per story and would multiply cost significantly. A sibling hacker-news-comments-scraper will follow if there is enough demand.

How fresh is the data?

The upstream API reflects changes in near-real time. Your run captures whatever the feed contained the moment each story record was fetched.

💬 Your feedback

Spotted a bug, hit a weird edge case, or need a new field? Open an issue on the Actor's Issues tab on Apify Console — we ship fixes weekly and we read every report.