# News & Article Extractor (`automation-lab/news-article-extractor`) Actor

Auto-discover and extract articles from news sites, blogs, and publications. Finds RSS feeds and sitemaps automatically. Outputs title, author, date, full text, images, and metadata. No proxy needed.

- **URL**: https://apify.com/automation-lab/news-article-extractor.md
- **Developed by:** [Stas Persiianenko](https://apify.com/automation-lab) (community)
- **Categories:** News
- **Stats:** 8 total users, 3 monthly users, 88.6% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per event

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## News & Article Extractor

Extract articles from any news website, blog, or publication — automatically. Give it a URL and it discovers articles via RSS feeds, sitemaps, or HTML crawling, then pulls the full text using @mozilla/readability.

No API key needed. No browser overhead. Just pure HTTP extraction.

### 📰 What does News & Article Extractor do?

**News & Article Extractor** auto-discovers and extracts articles from news sites, blogs, and academic publications. Point it at any website — [TechCrunch](https://techcrunch.com), [BBC News](https://bbc.com), your company blog, or any RSS feed — and it returns structured article data: title, author, publish date, full text content, images, and more.

The extractor uses a three-tier discovery strategy:
1. **RSS auto-discovery** — detects RSS/Atom feeds from `<link rel="alternate">` tags or common paths (`/feed`, `/rss.xml`)
2. **sitemap.xml** — parses XML sitemaps including news sitemaps for systematic URL discovery
3. **HTML crawl** — falls back to extracting article links from the homepage

Once articles are found, [@mozilla/readability](https://github.com/mozilla/readability) (the same engine Firefox Reader View uses) strips navigation, ads, and boilerplate to return clean article text.

### 👤 Who is News & Article Extractor for?

**Researchers and academics** monitoring a beat or topic:
- Track publications across dozens of news sources daily
- Build training datasets from articles across multiple blogs
- Monitor academic preprint servers and publication feeds

**Content marketers and SEO teams**:
- Audit competitor blog content and publishing cadence
- Aggregate industry news for internal newsletters
- Monitor brand mentions across news publications

**Data scientists and ML engineers**:
- Build NLP training corpora from news articles
- Create RAG (Retrieval-Augmented Generation) knowledge bases
- Feed structured article data into analysis pipelines

**Business intelligence teams**:
- Monitor competitor press releases and announcements
- Track industry trends from multiple publications
- Export article data to Google Sheets, Airtable, or databases

### ✅ Why use News & Article Extractor?

- **Automatic discovery** — no need to manually find RSS feeds or sitemaps; the extractor tries all methods automatically
- **Clean text extraction** — @mozilla/readability removes ads, navigation, footers, and cookie banners
- **RSS metadata included** — when articles come from RSS, you get author, date, and description for free (no extra HTTP request)
- **Metadata-only mode** — set `extractFullContent: false` to get just titles, dates, and URLs blazing fast and at minimal cost
- **Date filtering** — filter articles by publication date range to get only recent content
- **No proxy needed** — most news sites are publicly accessible; pure HTTP extraction
- **Structured output** — every article outputs the same fields: title, author, publishedDate, content, wordCount, imageUrl, images, sourceDomain
- **Graceful error handling** — failed articles are skipped and logged; the run continues
- **No API key or login required**

### 📊 What data can you extract?

| Field | Description | Type |
|-------|-------------|------|
| `title` | Article headline | string |
| `author` | Byline / author name | string |
| `publishedDate` | ISO 8601 publication date | string |
| `description` | Article summary or meta description | string |
| `content` | Full article body text (plain text) | string |
| `wordCount` | Number of words in article content | number |
| `url` | Canonical article URL | string |
| `imageUrl` | Primary/OG image URL | string |
| `images` | All image URLs found in article | array |
| `sourceDomain` | Domain of the source site | string |
| `sourceUrl` | Root URL of the source site | string |
| `discoveryMethod` | How the article was found: `rss`, `sitemap`, or `html-crawl` | string |
| `extractedAt` | Timestamp of extraction | string |
| `success` | Whether extraction succeeded | boolean |
| `error` | Error message if extraction failed | string |

### 💰 How much does it cost to extract news articles?

News & Article Extractor uses **Pay-Per-Event (PPE) pricing** — you pay only for results, not for compute time.

| Event | FREE tier | BRONZE | SILVER | GOLD |
|-------|-----------|--------|--------|------|
| Actor start (one-time) | $0.005 | $0.005 | $0.005 | $0.005 |
| Per article extracted | $0.0023 | $0.002 | $0.00156 | $0.0012 |

**Real-world cost examples:**
- Extract 10 articles: ~$0.025 (pennies)
- Extract 100 articles: ~$0.205
- Extract 1,000 articles: ~$2.005

**On the free plan** ($5 Apify credits), you can extract roughly **2,100+ articles** before spending a cent of your own money.

Metadata-only mode (`extractFullContent: false`) is charged the same — the savings come from faster runs, not lower per-article cost.

### 🚀 How to extract articles from a news website

1. **Go to the [News & Article Extractor page](https://apify.com/automation-lab/news-article-extractor)** on Apify Store
2. Click **Try for free**
3. In the **Website URLs** field, enter one or more website URLs (e.g., `https://techcrunch.com`, `https://bbc.com/news`)
4. Set **Max Articles per Site** — start with 10-20 for a quick test
5. Leave **Extract Full Content** enabled to get full article text, or disable it for metadata-only (faster)
6. Click **Start** and wait for results (typically 30-90 seconds for 10-20 articles)
7. Export results as **JSON**, **CSV**, or **Excel** from the dataset tab

**Input JSON examples:**

Extract recent articles from two sources:
```json
{
    "startUrls": ["https://techcrunch.com", "https://theverge.com"],
    "maxArticles": 20,
    "extractFullContent": true,
    "includeImages": true
}
````

Use an RSS feed URL directly:

```json
{
    "startUrls": ["https://feeds.bbci.co.uk/news/rss.xml"],
    "maxArticles": 50,
    "extractFullContent": true
}
```

Metadata-only, last 7 days:

```json
{
    "startUrls": ["https://blog.apify.com"],
    "maxArticles": 100,
    "extractFullContent": false,
    "dateFrom": "2026-04-01"
}
```

### ⚙️ Input parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `startUrls` | array | required | Website URLs or RSS/sitemap URLs to process |
| `maxArticles` | integer | 20 | Maximum articles to extract per site |
| `extractFullContent` | boolean | true | Fetch and extract full article body text |
| `includeImages` | boolean | true | Include image URLs in output |
| `dateFrom` | string | — | Only articles on/after this date (YYYY-MM-DD) |
| `dateTo` | string | — | Only articles on/before this date (YYYY-MM-DD) |
| `requestTimeout` | integer | 30 | HTTP request timeout in seconds |
| `maxRetries` | integer | 2 | Retry attempts per failed request |

**Tips for `startUrls`:**

- You can enter full domain URLs (`https://techcrunch.com`) or direct RSS feed URLs (`https://feeds.bbci.co.uk/news/rss.xml`)
- For sites with many sections, enter specific section URLs (e.g., `https://bbc.com/news/technology`)
- Academic sites like arXiv work via HTML crawl: `https://arxiv.org/list/cs.AI/recent`

### 📤 Output examples

Full content extraction:

```json
{
    "url": "https://techcrunch.com/2026/04/07/waymo-opens-robotaxi-service-in-nashville/",
    "title": "Waymo opens robotaxi service in Nashville, partners with Lyft",
    "author": "Kirsten Korosec",
    "publishedDate": "2026-04-07T14:00:00.000Z",
    "description": "Waymo is expanding its robotaxi service beyond its current markets...",
    "content": "Waymo is expanding its autonomous vehicle service to Nashville, Tennessee...",
    "wordCount": 537,
    "imageUrl": "https://techcrunch.com/wp-content/uploads/2026/04/waymo.jpg",
    "images": ["https://techcrunch.com/wp-content/uploads/2026/04/waymo.jpg"],
    "sourceDomain": "techcrunch.com",
    "sourceUrl": "https://techcrunch.com",
    "discoveryMethod": "rss",
    "extractedAt": "2026-04-07T14:05:22.000Z",
    "success": true,
    "error": null
}
```

Metadata-only mode (`extractFullContent: false`):

```json
{
    "url": "https://www.bbc.com/news/articles/cx23p6j5gxgo",
    "title": "Artemis II crew head for home after historic lunar flyby",
    "author": "Jonathan Amos",
    "publishedDate": "2026-04-07T13:00:04.000Z",
    "description": "The four astronauts flew closer to the Moon than any humans since Apollo 17 in 1972.",
    "content": null,
    "wordCount": 0,
    "imageUrl": "https://ichef.bbci.co.uk/news/1024/branded_news/...jpg",
    "images": [],
    "sourceDomain": "bbc.com",
    "discoveryMethod": "rss",
    "success": true,
    "error": null
}
```

### 💡 Tips for best results

- **Start with RSS feed URLs** — if you know a site's RSS feed (`/feed`, `/rss.xml`), enter it directly. RSS feeds include metadata (author, date, description) without an extra HTTP request per article.
- **Use metadata-only mode for monitoring** — when you just need to know what articles were published and when, disable `extractFullContent`. It's much faster and costs the same per-article.
- **Set date filters for recurring runs** — schedule the actor daily and use `dateFrom` to avoid re-extracting old articles.
- **Lower `maxArticles` for quick tests** — start with 5-10 articles to verify the site works before scaling up.
- **Paywalled sites** — the extractor fetches pages as a regular browser would. Paywalled content that requires login won't be accessible.
- **JavaScript-heavy sites** — some modern sites render content via JavaScript. If the extractor returns empty content for a site that has articles, the site may require a browser-based approach.
- **Multiple sections** — for large news sites, add multiple section URLs to `startUrls` to cover more ground (e.g., both `https://nytimes.com/section/technology` and `https://nytimes.com/section/business`).

### 🔗 Integrations

**News & Article Extractor → Google Sheets**
Schedule the actor to run daily and automatically append new articles to a Google Sheet. Use the built-in Apify → Google Sheets integration to build a living content archive. Great for editorial teams tracking industry news.

**News & Article Extractor → Slack/Discord alerts**
Set up a webhook trigger: when the actor completes and finds new articles matching certain keywords, post a summary to your Slack channel. Perfect for brand monitoring or competitor tracking.

**News & Article Extractor → Make/Zapier content pipeline**
Connect via Apify's Make or Zapier integration to route new articles to your CMS, Notion database, or email newsletter tool. Build a fully automated content curation pipeline.

**Scheduled monitoring runs**
Schedule runs every hour or day using Apify's built-in scheduler. Combine with date filters to only extract articles published since the last run. No duplicates, no manual work.

**News & Article Extractor → RAG / LLM pipeline**
Export article content to your vector database (Pinecone, Weaviate, Chroma) for retrieval-augmented generation. The `content` field gives you clean plain text ready for embedding.

### 🖥️ Using the Apify API

#### Node.js

```javascript
import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });

const run = await client.actor('automation-lab/news-article-extractor').call({
    startUrls: ['https://techcrunch.com', 'https://theverge.com'],
    maxArticles: 50,
    extractFullContent: true,
    includeImages: true,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Extracted ${items.length} articles`);
items.slice(0, 3).forEach(article => {
    console.log(`[${article.publishedDate}] ${article.title} — ${article.wordCount} words`);
});
```

#### Python

```python
from apify_client import ApifyClient

client = ApifyClient(token='YOUR_API_TOKEN')

run = client.actor('automation-lab/news-article-extractor').call(run_input={
    'startUrls': ['https://techcrunch.com', 'https://theverge.com'],
    'maxArticles': 50,
    'extractFullContent': True,
    'includeImages': True,
})

dataset = client.dataset(run['defaultDatasetId'])
articles = dataset.list_items().items
print(f'Extracted {len(articles)} articles')
for article in articles[:3]:
    print(f"[{article['publishedDate']}] {article['title']} — {article['wordCount']} words")
```

#### cURL

```bash
curl -X POST "https://api.apify.com/v2/acts/automation-lab~news-article-extractor/runs" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -d '{
    "startUrls": ["https://techcrunch.com"],
    "maxArticles": 20,
    "extractFullContent": true
  }'
```

### 🤖 Use with AI agents via MCP

News & Article Extractor is available as a tool for AI assistants that support the [Model Context Protocol (MCP)](https://docs.apify.com/platform/integrations/mcp).

Add the Apify MCP server to your AI client — this gives you access to all Apify actors, including this one:

#### Setup for Claude Code

```bash
claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/news-article-extractor"
```

#### Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

```json
{
    "mcpServers": {
        "apify": {
            "url": "https://mcp.apify.com?tools=automation-lab/news-article-extractor"
        }
    }
}
```

Your AI assistant will use OAuth to authenticate with your Apify account on first use.

#### Example prompts

Once connected, try asking your AI assistant:

- "Use automation-lab/news-article-extractor to extract the 20 most recent articles from TechCrunch and summarize the main themes"
- "Extract all articles published today from https://bbc.com/news and identify which topics appear most frequently"
- "Get metadata for the last 50 articles from https://blog.apify.com and tell me the average word count per post"

Learn more in the [Apify MCP documentation](https://docs.apify.com/platform/integrations/mcp).

### ⚖️ Is it legal to scrape news articles?

News & Article Extractor only accesses **publicly available pages** — the same content any web browser would see. It does not bypass authentication, circumvent paywalls, or access restricted content.

**Best practices:**

- Respect `robots.txt` guidelines for the sites you scrape
- Do not scrape personal data beyond what's in public article bylines
- Check each website's terms of service regarding automated access
- Use the data ethically — for research, monitoring, and analysis, not plagiarism or content theft
- The actor does not store or redistribute copyrighted content — it extracts it to your own Apify dataset

For more information, see [Apify's web scraping guide on legality](https://apify.com/legal).

### ❓ FAQ

**How does article discovery work?**
The extractor tries three methods in order: (1) RSS/Atom feed detection via `<link rel="alternate">` tags or common paths like `/feed` and `/rss.xml`; (2) sitemap.xml parsing; (3) HTML link extraction from the homepage. If one method fails, it falls back to the next automatically.

**Why are some articles returning empty content?**
Some sites use JavaScript to render their content (e.g., heavy React/Next.js sites). Since this extractor uses pure HTTP (no browser), JavaScript-rendered content won't be visible. If you see `success: true` but empty `content`, the site likely requires JavaScript rendering. Try extracting metadata only from the RSS feed instead.

**How fast is extraction?**
Metadata-only mode (RSS) processes 50-100 articles in under 10 seconds — it's just parsing a feed. Full content extraction takes 1-3 seconds per article depending on page size and server speed. 20 articles typically complete in 30-60 seconds.

**Can I use this with paywalled sites?**
No — the extractor does not support authentication or login. It can only access content that's publicly visible without logging in. Some sites offer free articles up to a limit before showing a paywall.

**Why does the actor find fewer articles than expected?**
The `maxArticles` limit per site applies. Also, RSS feeds typically contain only the 20-50 most recent articles. For older content, use the sitemap method by entering the site URL (not the RSS URL) — sitemaps often contain thousands of URLs.

**Why are images not showing up?**
Some sites serve images with lazy-loading (no `src` attribute until JavaScript runs) or use CSS backgrounds instead of `<img>` tags. The OG image (`og:image` meta tag) is always captured when available. Enable `includeImages: true` and check the `imageUrl` field first.

### 🔗 Other content scrapers

- [Google News Scraper](https://apify.com/automation-lab/google-news-scraper) — scrape Google News results by keyword
- [Bing News Scraper](https://apify.com/automation-lab/bing-news-scraper) — extract news from Bing News search
- [HackerNews Scraper](https://apify.com/automation-lab/hackernews-scraper) — scrape Hacker News posts and comments
- [Webpage to Markdown Converter](https://apify.com/automation-lab/webpage-to-markdown-converter) — convert any webpage to clean Markdown for LLMs
- [ArXiv Scraper](https://apify.com/automation-lab/arxiv-scraper) — extract academic papers from arXiv.org

# Actor input Schema

## `startUrls` (type: `array`):

Enter the website URLs to extract articles from (e.g. https://bbc.com, https://techcrunch.com). Each URL will be scanned for RSS feeds or sitemaps to discover articles.

## `maxArticles` (type: `integer`):

Maximum number of articles to extract per website. Keep low for quick tests (10-20), set higher for full crawls.

## `extractFullContent` (type: `boolean`):

Enable to fetch and extract the full article body text using @mozilla/readability. Disable to return only metadata (title, date, author) from RSS/sitemap — much faster and cheaper.

## `includeImages` (type: `boolean`):

Include image URLs found in the article content.

## `dateFrom` (type: `string`):

Only extract articles published on or after this date (ISO format: YYYY-MM-DD). Leave empty for no date filter.

## `dateTo` (type: `string`):

Only extract articles published on or before this date (ISO format: YYYY-MM-DD). Leave empty for no date filter.

## `requestTimeout` (type: `integer`):

Timeout for each HTTP request in seconds. Increase for slow sites.

## `maxRetries` (type: `integer`):

Number of times to retry a failed HTTP request before skipping the article.

## Actor input object example

```json
{
  "startUrls": [
    "https://feeds.bbci.co.uk/news/rss.xml"
  ],
  "maxArticles": 10,
  "extractFullContent": true,
  "includeImages": true,
  "requestTimeout": 30,
  "maxRetries": 2
}
```

# Actor output Schema

## `overview` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        "https://feeds.bbci.co.uk/news/rss.xml"
    ],
    "maxArticles": 10,
    "requestTimeout": 30,
    "maxRetries": 2
};

// Run the Actor and wait for it to finish
const run = await client.actor("automation-lab/news-article-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": ["https://feeds.bbci.co.uk/news/rss.xml"],
    "maxArticles": 10,
    "requestTimeout": 30,
    "maxRetries": 2,
}

# Run the Actor and wait for it to finish
run = client.actor("automation-lab/news-article-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    "https://feeds.bbci.co.uk/news/rss.xml"
  ],
  "maxArticles": 10,
  "requestTimeout": 30,
  "maxRetries": 2
}' |
apify call automation-lab/news-article-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=automation-lab/news-article-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "News & Article Extractor",
        "description": "Auto-discover and extract articles from news sites, blogs, and publications. Finds RSS feeds and sitemaps automatically. Outputs title, author, date, full text, images, and metadata. No proxy needed.",
        "version": "0.1",
        "x-build-id": "gyt5kPeqe1syFSmBZ"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/automation-lab~news-article-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-automation-lab-news-article-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/automation-lab~news-article-extractor/runs": {
            "post": {
                "operationId": "runs-sync-automation-lab-news-article-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/automation-lab~news-article-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-automation-lab-news-article-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls"
                ],
                "properties": {
                    "startUrls": {
                        "title": "🌐 Website URLs",
                        "type": "array",
                        "description": "Enter the website URLs to extract articles from (e.g. https://bbc.com, https://techcrunch.com). Each URL will be scanned for RSS feeds or sitemaps to discover articles.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "maxArticles": {
                        "title": "Max Articles per Site",
                        "minimum": 1,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Maximum number of articles to extract per website. Keep low for quick tests (10-20), set higher for full crawls.",
                        "default": 20
                    },
                    "extractFullContent": {
                        "title": "Extract Full Article Content",
                        "type": "boolean",
                        "description": "Enable to fetch and extract the full article body text using @mozilla/readability. Disable to return only metadata (title, date, author) from RSS/sitemap — much faster and cheaper.",
                        "default": true
                    },
                    "includeImages": {
                        "title": "Include Images",
                        "type": "boolean",
                        "description": "Include image URLs found in the article content.",
                        "default": true
                    },
                    "dateFrom": {
                        "title": "Articles From Date",
                        "pattern": "^(\\d{4}-\\d{2}-\\d{2})?$",
                        "type": "string",
                        "description": "Only extract articles published on or after this date (ISO format: YYYY-MM-DD). Leave empty for no date filter."
                    },
                    "dateTo": {
                        "title": "Articles To Date",
                        "pattern": "^(\\d{4}-\\d{2}-\\d{2})?$",
                        "type": "string",
                        "description": "Only extract articles published on or before this date (ISO format: YYYY-MM-DD). Leave empty for no date filter."
                    },
                    "requestTimeout": {
                        "title": "Request Timeout (seconds)",
                        "minimum": 5,
                        "maximum": 120,
                        "type": "integer",
                        "description": "Timeout for each HTTP request in seconds. Increase for slow sites.",
                        "default": 30
                    },
                    "maxRetries": {
                        "title": "Max Retries per Request",
                        "minimum": 0,
                        "maximum": 5,
                        "type": "integer",
                        "description": "Number of times to retry a failed HTTP request before skipping the article.",
                        "default": 2
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
