# Y Combinator News Scraper (`dadhalfdev/y-combinator-news-scraper`) Actor

Get the latest news from the Y Combinator Hacker News page.  The output fields are: title, score, author, timing, discussion link, and body. Only saves rows when the article text comes through. Pick 20–200 stories (default 100). Export CSV or JSON.

- **URL**: https://apify.com/dadhalfdev/y-combinator-news-scraper.md
- **Developed by:** [Marco Rodrigues](https://apify.com/dadhalfdev) (community)
- **Categories:** News
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $8.00 / 1,000 results

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## 🚀 Y Combinator News Scraper

Want the **latest submissions** from **[Y Combinator Hacker News](https://news.ycombinator.com/newest)** together with the **actual article text** from each linked source? This actor does both in one run.

It always starts from Y Combinator’s **HN “newest”** feed at [news.ycombinator.com/newest](https://news.ycombinator.com/newest). There it reads titles, scores, authors, and each story’s **outbound URL** (GitHub, blogs, newspapers, another HN `item` page—whatever the submitter linked). Then it **opens those destination sites** in a browser and extracts **main body text** there. So the corpus is **not** “only HN HTML”: metadata comes from the YC-run listing; **content** depends on each submission’s source.

![Y Combinator Hacker News](https://i.ibb.co/0jx22x4R/Screenshot-From-2026-04-19-16-22-56.png)

### 💡 Perfect for...

- **Researchers & analysts:** Track what’s being submitted and pull readable text from the original publisher or repo page.
- **Newsletters & dashboards:** Combine HN metadata (`points`, `author`, `hn_discuss_url`) with article excerpts for digests.
- **📚 RAG systems:** Index `title`, `news_link`, and `content` so answers can cite both the HN context and what the linked page actually says.

### ✨ Why you'll love this scraper

- 🧹 **Clean saves:** Rows are pushed to the dataset **only when non-empty body text** was extracted—blocked or empty pages are skipped, not stored as hollow rows.
- 👤 **Structured HN fields:** Every saved item includes ids, title, outbound link, site label, score, author, timestamps, discussion URL, plus `content`.

### 📦 What's inside the data?

For every story that yields extractable text, you will get:

- **HN listing:** `id`, `title`, `news_link`, `site_domain`, `points`, `author`, `posted_at_iso`, `posted_at_human`, `hn_discuss_url`
- **From the linked source:** `content` (plain article text from the destination page when extraction succeeds)

### 🚀 Quick start

1. **Decide how many stories** you want (`max_news`). The actor collects that many unique items from **[/newest](https://news.ycombinator.com/newest)** (using **More** if needed).
2. **Start the actor** on Apify—no listing URL to paste; the feed URL is fixed.
3. **Export** the default dataset as CSV, Excel, or JSON when the run finishes.

---

#### Tech details for developers 🧑‍💻

**Input Example:**

```json
{
  "max_news": 100
}
````

**Output Example:**

```json
{
  "id": "47824343",
  "title": "HTTP11Probe – Probe web frameworks for compliance",
  "news_link": "https://www.http-probe.com/",
  "site_domain": "http-probe.com",
  "points": 1,
  "author": "MDA2AV",
  "posted_at_iso": "2026-04-19T13:55:47",
  "posted_at_human": "1 minute ago",
  "hn_discuss_url": "https://news.ycombinator.com/item?id=47824343",
  "content": "An open testing platform that probes HTTP/1.1 servers against RFC 9110/9112 requirements, smuggling vectors, and malformed input handling. Add your framework, get compliance results automatically.\nHttp11Probe sends a suite of crafted HTTP requests to each server and checks whether the response matches the exact expected behavior from the RFCs. Every server is tested identically, producing a side-by-side compliance comparison.\nHttp11Probe is open source and built for contributions. Add your HTTP server to the leaderboard, or write new test cases to expand coverage.\nEvery new framework added makes the comparison more useful for the entire community, and every new test strengthens the compliance bar for all servers on the platform. If you’ve found an edge case that isn’t covered, or you maintain a framework that isn’t listed yet, your contribution directly improves HTTP security and interoperability for everyone."
}
```

**Parameters:**

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `max_news` | integer | No | Target number of unique stories from `/newest` (via **More**), then opened for content. Default **100**, min **20**, max **200** (see `.actor/input_schema.json`). |

**Stack:** Python, [Apify SDK](https://docs.apify.com/sdk/python/), Crawlee `PlaywrightCrawler`, Playwright, [trafilatura](https://github.com/adbar/trafilatura). Article requests ignore **401 / 403 / 429** as session-killers so the handler can still attempt extraction; empty text still means **no** dataset row.

**Local run:** From this actor directory, install dependencies and run `playwright install`, then use **`apify run`** so `input.json` is applied, or wire input the way your environment expects for `python -m src`.

# Actor input Schema

## `max_news` (type: `integer`):

How many newest stories to scrape (metadata from Hacker News plus article content from each story link).

## Actor input object example

```json
{
  "max_news": 100
}
```

# Actor output Schema

## `overview` (type: `string`):

Table view using the dataset 'overview' view.

## `results` (type: `string`):

All items from the default dataset without view transformation.

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {};

// Run the Actor and wait for it to finish
const run = await client.actor("dadhalfdev/y-combinator-news-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {}

# Run the Actor and wait for it to finish
run = client.actor("dadhalfdev/y-combinator-news-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{}' |
apify call dadhalfdev/y-combinator-news-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=dadhalfdev/y-combinator-news-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Y Combinator News Scraper",
        "description": "Get the latest news from the Y Combinator Hacker News page.  The output fields are: title, score, author, timing, discussion link, and body. Only saves rows when the article text comes through. Pick 20–200 stories (default 100). Export CSV or JSON.",
        "version": "0.1",
        "x-build-id": "qGY7MLKCuFf4dCVwg"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/dadhalfdev~y-combinator-news-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-dadhalfdev-y-combinator-news-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/dadhalfdev~y-combinator-news-scraper/runs": {
            "post": {
                "operationId": "runs-sync-dadhalfdev-y-combinator-news-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/dadhalfdev~y-combinator-news-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-dadhalfdev-y-combinator-news-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "max_news": {
                        "title": "Maximum news items",
                        "minimum": 20,
                        "maximum": 200,
                        "type": "integer",
                        "description": "How many newest stories to scrape (metadata from Hacker News plus article content from each story link).",
                        "default": 100
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
