# 🧠 Smart Article Extractor (`scrapier/smart-article-extractor`) Actor

- **URL**: https://apify.com/scrapier/smart-article-extractor.md
- **Developed by:** [Scrapier](https://apify.com/scrapier) (community)
- **Categories:** News, Developer tools, Automation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $4.99 / 1,000 results

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## 🧠 Smart Article Extractor — News & Blog Scraper

> **One-paragraph summary:** Smart Article Extractor is an Apify Actor that bulk-extracts clean article content — **title, author, publish date, full text, summary, images, videos, in-body links and rich metadata** — from any news site, blog or sitemap. Point it at a homepage / section / topic URL and it will discover, classify and extract every article automatically using a BFS crawler, sitemap scanning, and configurable URL-shape heuristics.

---

### 🚀 Why Choose Us?

| Feature | Smart Article Extractor | Typical 1-URL article scraper |
|---|---|---|
| **Bulk discovery (BFS crawler)** | ✅ Yes | ❌ One URL at a time |
| **Sitemap & robots.txt scanning** | ✅ Built-in | ❌ |
| **Sub-domain / sub-path scoping** | ✅ Per Start URL | ❌ |
| **`onlyNewArticles` cross-run dedup** | ✅ Per-domain & global | ❌ |
| **Date filters** (`dateFrom`, `lastDays`, `mustHaveDate`) | ✅ All three | ⚠️ Limited |
| **Anti-block proxy fallback** (none → DC → RES) | ✅ Automatic | ❌ |
| **Optional Playwright rendering** | ✅ Toggle | ❌ |
| **Extend-output Python hook** | ✅ Inline snippet | ❌ |
| **Live dataset push + state KVS** | ✅ | ⚠️ |

---

### 🔥 Key Features

- 📰 **Clean article extraction** — trafilatura + BeautifulSoup combo for high recall.
- 🌐 **Bulk discovery** — drop a homepage URL and the actor discovers articles via BFS.
- 🗺️ **Sitemap & robots.txt** — automatic `Sitemap:` parsing + common candidates.
- 🛡️ **Smart proxy fallback** — starts direct, then datacenter, then residential.
- 🎭 **Headless browser mode** — Playwright + Chromium for JS-heavy or protected sites.
- 🧠 **Cross-run memory** — `onlyNewArticles` and `onlyNewArticlesPerDomain`.
- 🪜 **Depth / page / article caps** — never over-crawl.
- 📅 **Date filters** — `dateFrom`, `onlyArticlesForLastDays`, `mustHaveDate`.
- 🛠️ **`extendOutputFunction`** — inject your own Python `extend(soup, article, html)`.
- 💾 **Save HTML / snapshots** — full HTML in-record or as KVS link, PNG screenshots.

---

### 📥 Input

| Field | Type | Default | Description |
|---|---|---|---|
| `startUrls` | array | required | Homepages, sections, topic pages — used as crawl seeds. |
| `articleUrls` | array | `[]` | Direct article URLs to extract (no discovery needed). |
| `onlyNewArticles` | boolean | `false` | Skip URLs already seen in any previous run. |
| `onlyNewArticlesPerDomain` | boolean | `false` | Per-domain dedup memory. |
| `onlyInsideArticles` | boolean | `true` | Enqueue only same-domain links from articles. |
| `onlySubdomainArticles` | boolean | `false` | Restrict to URLs sharing the Start URL path prefix. |
| `enqueueFromArticles` | boolean | `true` | Discover further links inside extracted articles. |
| `crawlWholeSubdomain` | boolean | `true` | Treat any same-subdomain link as a category candidate. |
| `scanSitemaps` | boolean | `true` | Discover articles from `robots.txt` and common sitemap paths. |
| `useGoogleBotHeaders` | boolean | `true` | Identify as Googlebot. |
| `useBrowser` | boolean | `false` | Render with headless Chromium. |
| `scrollToBottom` | boolean | `false` | Force lazy-loaded content (browser mode only). |
| `mustHaveDate` | boolean | `false` | Drop articles with no detectable date. |
| `dateFrom` | string (ISO date) | — | Earliest article date. |
| `onlyArticlesForLastDays` | integer | — | Convenience cut-off. |
| `minWords` | integer | `150` | Reject short articles. |
| `maxDepth` | integer | `2` | BFS depth. |
| `maxPagesPerCrawl` | integer | `50` | Hard cap on fetched pages. |
| `maxArticlesPerCrawl` | integer | `25` | Hard cap on saved articles. |
| `maxArticlesPerStartUrl` | integer | `25` | Cap per Start URL. |
| `isUrlArticleDefinition` | object | see schema | URL-shape heuristic. |
| `linkSelector` | string | — | CSS selector restricting where links are collected from. |
| `pseudoUrls` | array | `[]` | Custom URL patterns for category pages. |
| `sitemapUrls` | array | `[]` | Explicit sitemap URLs (skip auto-discovery). |
| `saveHtml` | boolean | `false` | Include raw HTML in the dataset record. |
| `saveHtmlAsLink` | boolean | `false` | Save HTML to KVS and put a link in the record. |
| `saveSnapshots` | boolean | `false` | PNG screenshot (browser mode only). |
| `extendOutputFunction` | string | — | Python snippet — must define `extend(soup, article, html) -> dict`. |
| `proxyConfiguration` | object | `{useApifyProxy: false}` | Default = no proxy; auto-fallback to DC → RES if blocked. |

**Example input:**

```json
{
  "startUrls": [{ "url": "https://www.theguardian.com" }],
  "onlyArticlesForLastDays": 2,
  "minWords": 150,
  "maxArticlesPerCrawl": 5,
  "useGoogleBotHeaders": true,
  "scanSitemaps": true,
  "proxyConfiguration": { "useApifyProxy": false }
}
````

***

### 📤 Output

Each pushed record contains:

| Field | Type | Description |
|---|---|---|
| `url`, `loadedUrl` | string | Original / resolved URL. |
| `domain`, `loadedDomain` | string | Bare host. |
| `referrer`, `startUrl` | string | Where the link was discovered. |
| `depth` | integer | BFS depth at time of crawl. |
| `title`, `softTitle` | string | Best-effort headline. |
| `date` | string (ISO) | Publication date if found. |
| `author` | array | Author URL(s) or name(s). |
| `publisher`, `copyright`, `lang`, `favicon`, `canonicalLink` | string | Site metadata. |
| `description`, `keywords` | string | Meta description / keywords. |
| `tags` | array | `article:tag` values. |
| `image` | string | Hero / OG image URL. |
| `videos` | array | `<video> / <iframe> / <source>` URLs. |
| `links` | array of `{text, href}` | Inner-body links. |
| `wordCount` | integer | Word count of the extracted text. |
| `text` | string | Cleaned article body. |
| `html` | string | Full HTML (only if `saveHtml` / `saveHtmlAsLink`). |
| `screenshotUrl` | string | KVS link (only if `saveSnapshots` + `useBrowser`). |

**Example output (truncated):**

```json
{
  "url": "https://www.theguardian.com/lifeandstyle/2026/may/21/how-often-should-you-go-to-the-toilet…",
  "domain": "theguardian.com",
  "title": "How often should you go to the toilet?…",
  "date": "2026-05-21T04:00:02.000Z",
  "author": ["https://www.theguardian.com/profile/sarahphillips"],
  "publisher": "the Guardian",
  "wordCount": 1620,
  "text": "Think balance, diversity and routine. \"Our gut is a complex machine,\" says…",
  "image": "https://i.guim.co.uk/img/media/…"
}
```

***

### 🚀 How to Use (Apify Console)

1. Log in at <https://console.apify.com> → **Actors**.
2. Open **Smart Article Extractor**.
3. Configure inputs (Start URLs, date filters, caps, proxy).
4. Click **Start**.
5. Watch logs in real time — the actor prints a per-article live feed.
6. Open the **Output** tab once the run completes.
7. Export to JSON / CSV / XLSX or wire to a webhook.

***

### 🤖 Use via API / MCP

```bash
curl -X POST "https://api.apify.com/v2/acts/<USERNAME>~smart-article-extractor/run-sync-get-dataset-items?token=$APIFY_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
       "startUrls": [{"url": "https://www.theguardian.com"}],
       "maxArticlesPerCrawl": 5,
       "onlyArticlesForLastDays": 2,
       "proxyConfiguration": {"useApifyProxy": false}
     }'
```

MCP-server tool name: `smart-article-extractor`.

***

### 💡 Best Use Cases

- 📰 News monitoring on a topic / publisher
- 📊 NLP / sentiment / summarisation datasets
- 🏛️ Brand or competitor coverage tracking
- 🔍 SEO / SERP enrichment with full article text
- 📚 Knowledge-base construction for RAG / LLMs
- 🗞️ Press-clipping archives

***

### 💰 Pricing

Pay-per-usage. You only pay the Apify platform charges (compute time + proxies + transfer). No separate developer fee.

***

### ❓ Frequently Asked Questions

**Q: Why are some articles skipped?**\
A: They failed at least one filter — date cut-off, `mustHaveDate`, `minWords`, or `onlyNewArticles` (already seen in a previous run). The log line states which one.

**Q: The site keeps blocking me.**\
A: Leave `proxyConfiguration.useApifyProxy = false`. The actor will auto-escalate to **datacenter** and then **residential** proxies (and retry up to 3 times residential). If even that fails, enable `useBrowser`.

**Q: Will it work for paywalled articles?**\
A: It honours soft-paywall workarounds (Googlebot UA) but does not bypass strict authentication.

**Q: How do I keep cross-run memory?**\
A: Toggle `onlyNewArticles` or `onlyNewArticlesPerDomain`. The actor keeps state in a named KVS — if that fails (e.g. Store run with limited permissions) it falls back to the run-default store.

**Q: Can I customise the output?**\
A: Yes — supply `extendOutputFunction` as a Python snippet defining `extend(soup, article, html) -> dict`. The returned dict is merged into the record.

***

### 🛟 Support & Feedback

Use the **Issues** tab on the Actor page, or open a discussion on the Apify community forum. Pull requests are welcome.

***

### ⚖️ Cautions / legal

- Data is collected only from publicly available sources.
- Do not scrape private accounts or content behind authentication unless explicitly authorised.
- The end user is responsible for legal compliance (GDPR, CCPA, anti-spam laws, target site ToS, etc.).
- The actor honours `robots.txt` for sitemap discovery; it does not enforce robots.txt blocks on crawl URLs — please be a good citizen.

# Actor input Schema

## `startUrls` (type: `array`):

Top-level pages the crawler should start from — homepages, sections, topic pages. Each one is treated as a category page and articles are discovered from it.

## `articleUrls` (type: `array`):

Already-known article URLs to extract directly (no discovery needed). Mix with Website URLs for hybrid runs.

## `onlyNewArticles` (type: `boolean`):

Skip articles that were extracted in any previous run (deduplicated globally via the key-value store). Best for low-volume runs.

## `onlyNewArticlesPerDomain` (type: `boolean`):

Same as above, but the deduplication memory is kept separately per domain — preferable for multi-domain runs.

## `onlyInsideArticles` (type: `boolean`):

When enqueueing from an article, accept only links that point back to the same registrable domain.

## `enqueueFromArticles` (type: `boolean`):

Discover further article links inside extracted articles and add them to the crawl queue.

## `crawlWholeSubdomain` (type: `boolean`):

Treat every same-subdomain link as a potential category page (depth-limited).

## `onlySubdomainArticles` (type: `boolean`):

Restrict articles to URLs starting with the same path prefix as the Start URL (e.g. example.com/news/\*).

## `scanSitemaps` (type: `boolean`):

Discover article URLs from robots.txt → Sitemap entries and the usual /sitemap.xml candidates. Disable if it produces too many noisy candidates.

## `sitemapUrls` (type: `array`):

Explicit sitemap URLs — skips auto-discovery and only uses these. Safer than full robots.txt scanning.

## `saveHtml` (type: `boolean`):

Include the full page HTML in the dataset record (produces large records).

## `saveHtmlAsLink` (type: `boolean`):

Save HTML to the run's key-value store and put the link in the record (smaller dataset).

## `saveSnapshots` (type: `boolean`):

Take a PNG screenshot of every article. Only effective when the headless browser is enabled.

## `useGoogleBotHeaders` (type: `boolean`):

Send the Googlebot User-Agent + headers. Many publishers allow Googlebot through paywalls / soft-blocks.

## `minWords` (type: `integer`):

Reject articles whose extracted text has fewer than this many words.

## `dateFrom` (type: `string`):

ISO date (YYYY-MM-DD). Only keep articles published on or after this date.

## `onlyArticlesForLastDays` (type: `integer`):

Drop anything older than X days. Combined with dateFrom, the stricter of the two wins.

## `mustHaveDate` (type: `boolean`):

Drop articles where no publication-date metadata can be detected.

## `isUrlArticleDefinition` (type: `object`):

Heuristics for classifying a URL as an article. minDashes = minimum dashes in the path, hasDate = path contains a /YYYY/MM/DD/ pattern, linkIncludes = substrings that mark a URL as an article.

## `pseudoUrls` (type: `array`):

Additional URL patterns (\[.\*], \[\d+]) that mark a page as a crawlable category. If you want to enqueue direct article URLs this way, you have to add { "label": "article" } to the userData.

## `linkSelector` (type: `string`):

Optional CSS selector restricting which parts of a category page contribute links (e.g. main a, .article-list a).

## `maxDepth` (type: `integer`):

Maximum BFS depth from the Start URL (Start URL = 0). Empty = no extra cap.

## `maxPagesPerCrawl` (type: `integer`):

Hard cap on pages fetched in one run (articles + category pages combined).

## `maxArticlesPerCrawl` (type: `integer`):

Hard cap on extracted articles per run.

## `maxArticlesPerStartUrl` (type: `integer`):

Cap how many articles can be attributed to a single Start URL.

## `maxConcurrency` (type: `integer`):

How many fetches the crawler may run in parallel. Higher = faster, but more pressure on the target site and proxy quota. Leave empty for safe sequential mode.

## `proxyConfiguration` (type: `object`):

Proxy settings. Default = NO PROXY (direct). If the target blocks the request, the actor automatically falls back to DATACENTER, then RESIDENTIAL (with up to 3 retries on residential). Once a fallback occurs, it sticks.

## `useBrowser` (type: `boolean`):

Render with Chromium when raw HTTP fails or the page is JS-heavy. Slower but bypasses many anti-bot walls.

## `pageWaitMs` (type: `integer`):

Extra time to wait after navigation finishes (milliseconds). Useful for lazily-loaded scripts.

## `waitUntil` (type: `string`):

Which navigation event Playwright waits for before considering the page ready.

## `categoryWaitForSelector` (type: `string`):

Optional CSS selector. The browser will wait for this element to appear before extracting links from category pages.

## `articleWaitForSelector` (type: `string`):

Optional CSS selector. The browser will wait for this element to appear before extracting article content.

## `scrollToBottom` (type: `boolean`):

Auto-scroll to the bottom of category/article pages so lazy-loaded content is rendered.

## `scrollToBottomButtonSelector` (type: `string`):

Optional CSS selector for a 'Load more' button. The crawler will click it repeatedly while scrolling.

## `scrollToBottomMaxSeconds` (type: `integer`):

Maximum time spent scrolling per page (safety cap).

## `extendOutputFunction` (type: `string`):

Only needed if you want more data than is included in the default output. Keep in mind that you should provide a valid Python function: def extend(soup, article, html): return {...}. The returned dict is merged into each article record.

## `maxCUs` (type: `integer`):

Soft cap on Apify Compute Units this run may consume. The actor checks usage between requests and exits gracefully when the cap is hit. Leave empty for no cap.

## `notificationEmails` (type: `array`):

Email addresses to notify when the CU thresholds below are crossed.

## `notifyAfterCUs` (type: `integer`):

Send a one-time notification once this many CUs have been consumed.

## `notifyAfterCUsEvery` (type: `integer`):

Send a notification every N CUs after the initial notification threshold.

## Actor input object example

```json
{
  "startUrls": [
    "https://www.theguardian.com"
  ],
  "articleUrls": [],
  "onlyNewArticles": false,
  "onlyNewArticlesPerDomain": false,
  "onlyInsideArticles": true,
  "enqueueFromArticles": false,
  "crawlWholeSubdomain": false,
  "onlySubdomainArticles": false,
  "scanSitemaps": false,
  "sitemapUrls": [],
  "saveHtml": false,
  "saveHtmlAsLink": false,
  "saveSnapshots": false,
  "useGoogleBotHeaders": false,
  "minWords": 150,
  "mustHaveDate": true,
  "isUrlArticleDefinition": {
    "minDashes": 4,
    "hasDate": true,
    "linkIncludes": [
      "article",
      "storyid",
      "?p=",
      "id=",
      "/fpss/track",
      ".html",
      "/content/"
    ]
  },
  "pseudoUrls": [],
  "maxDepth": 2,
  "maxPagesPerCrawl": 50,
  "maxArticlesPerCrawl": 25,
  "maxArticlesPerStartUrl": 25,
  "maxConcurrency": 1,
  "proxyConfiguration": {
    "useApifyProxy": false
  },
  "useBrowser": false,
  "pageWaitMs": 0,
  "waitUntil": "load",
  "scrollToBottom": false,
  "scrollToBottomMaxSeconds": 60,
  "extendOutputFunction": "# def extend(soup, article, html):\n#     return {\"pageTitle\": soup.title.string.strip() if soup.title else None}\n",
  "notificationEmails": []
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        "https://www.theguardian.com"
    ],
    "articleUrls": [],
    "proxyConfiguration": {
        "useApifyProxy": false
    },
    "extendOutputFunction": `# def extend(soup, article, html):
#     return {"pageTitle": soup.title.string.strip() if soup.title else None}`
};

// Run the Actor and wait for it to finish
const run = await client.actor("scrapier/smart-article-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": ["https://www.theguardian.com"],
    "articleUrls": [],
    "proxyConfiguration": { "useApifyProxy": False },
    "extendOutputFunction": """# def extend(soup, article, html):
#     return {\"pageTitle\": soup.title.string.strip() if soup.title else None}
""",
}

# Run the Actor and wait for it to finish
run = client.actor("scrapier/smart-article-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    "https://www.theguardian.com"
  ],
  "articleUrls": [],
  "proxyConfiguration": {
    "useApifyProxy": false
  },
  "extendOutputFunction": "# def extend(soup, article, html):\\n#     return {\\"pageTitle\\": soup.title.string.strip() if soup.title else None}\\n"
}' |
apify call scrapier/smart-article-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=scrapier/smart-article-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "🧠 Smart Article Extractor",
        "description": null,
        "version": "0.3",
        "x-build-id": "TvaCmQgjsQalPmyJ2"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/scrapier~smart-article-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-scrapier-smart-article-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/scrapier~smart-article-extractor/runs": {
            "post": {
                "operationId": "runs-sync-scrapier-smart-article-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/scrapier~smart-article-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-scrapier-smart-article-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls"
                ],
                "properties": {
                    "startUrls": {
                        "title": "🌐 Website / Category URLs",
                        "type": "array",
                        "description": "Top-level pages the crawler should start from — homepages, sections, topic pages. Each one is treated as a category page and articles are discovered from it.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "articleUrls": {
                        "title": "📰 Article URLs",
                        "type": "array",
                        "description": "Already-known article URLs to extract directly (no discovery needed). Mix with Website URLs for hybrid runs.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "onlyNewArticles": {
                        "title": "🆕 Only new articles (only for small runs)",
                        "type": "boolean",
                        "description": "Skip articles that were extracted in any previous run (deduplicated globally via the key-value store). Best for low-volume runs.",
                        "default": false
                    },
                    "onlyNewArticlesPerDomain": {
                        "title": "🌍 Only new articles (saved per domain, preferable)",
                        "type": "boolean",
                        "description": "Same as above, but the deduplication memory is kept separately per domain — preferable for multi-domain runs.",
                        "default": false
                    },
                    "onlyInsideArticles": {
                        "title": "🔗 Only inside domain articles",
                        "type": "boolean",
                        "description": "When enqueueing from an article, accept only links that point back to the same registrable domain.",
                        "default": true
                    },
                    "enqueueFromArticles": {
                        "title": "🧭 Enqueue articles from articles",
                        "type": "boolean",
                        "description": "Discover further article links inside extracted articles and add them to the crawl queue.",
                        "default": false
                    },
                    "crawlWholeSubdomain": {
                        "title": "🕸️ Crawl whole subdomain (same base as Start URL)",
                        "type": "boolean",
                        "description": "Treat every same-subdomain link as a potential category page (depth-limited).",
                        "default": false
                    },
                    "onlySubdomainArticles": {
                        "title": "🏷️ Limit articles to only from subdomain",
                        "type": "boolean",
                        "description": "Restrict articles to URLs starting with the same path prefix as the Start URL (e.g. example.com/news/*).",
                        "default": false
                    },
                    "scanSitemaps": {
                        "title": "🗺️ Find articles in sitemaps (caution)",
                        "type": "boolean",
                        "description": "Discover article URLs from robots.txt → Sitemap entries and the usual /sitemap.xml candidates. Disable if it produces too many noisy candidates.",
                        "default": false
                    },
                    "sitemapUrls": {
                        "title": "🗺️ Sitemap URLs (safer)",
                        "type": "array",
                        "description": "Explicit sitemap URLs — skips auto-discovery and only uses these. Safer than full robots.txt scanning.",
                        "default": [],
                        "items": {
                            "type": "string"
                        }
                    },
                    "saveHtml": {
                        "title": "💾 Save full HTML",
                        "type": "boolean",
                        "description": "Include the full page HTML in the dataset record (produces large records).",
                        "default": false
                    },
                    "saveHtmlAsLink": {
                        "title": "🔗 Save full HTML (only as link to it)",
                        "type": "boolean",
                        "description": "Save HTML to the run's key-value store and put the link in the record (smaller dataset).",
                        "default": false
                    },
                    "saveSnapshots": {
                        "title": "📸 Save screenshots of article pages (browser only)",
                        "type": "boolean",
                        "description": "Take a PNG screenshot of every article. Only effective when the headless browser is enabled.",
                        "default": false
                    },
                    "useGoogleBotHeaders": {
                        "title": "🤖 Use Googlebot headers",
                        "type": "boolean",
                        "description": "Send the Googlebot User-Agent + headers. Many publishers allow Googlebot through paywalls / soft-blocks.",
                        "default": false
                    },
                    "minWords": {
                        "title": "📏 Minimum words",
                        "minimum": 0,
                        "maximum": 100000,
                        "type": "integer",
                        "description": "Reject articles whose extracted text has fewer than this many words.",
                        "default": 150
                    },
                    "dateFrom": {
                        "title": "📆 Extract articles from [date]",
                        "type": "string",
                        "description": "ISO date (YYYY-MM-DD). Only keep articles published on or after this date."
                    },
                    "onlyArticlesForLastDays": {
                        "title": "🕒 Only articles for last X days",
                        "minimum": 0,
                        "maximum": 3650,
                        "type": "integer",
                        "description": "Drop anything older than X days. Combined with dateFrom, the stricter of the two wins."
                    },
                    "mustHaveDate": {
                        "title": "📅 Must have date",
                        "type": "boolean",
                        "description": "Drop articles where no publication-date metadata can be detected.",
                        "default": true
                    },
                    "isUrlArticleDefinition": {
                        "title": "🧪 Is the URL an article?",
                        "type": "object",
                        "description": "Heuristics for classifying a URL as an article. minDashes = minimum dashes in the path, hasDate = path contains a /YYYY/MM/DD/ pattern, linkIncludes = substrings that mark a URL as an article.",
                        "default": {
                            "minDashes": 4,
                            "hasDate": true,
                            "linkIncludes": [
                                "article",
                                "storyid",
                                "?p=",
                                "id=",
                                "/fpss/track",
                                ".html",
                                "/content/"
                            ]
                        }
                    },
                    "pseudoUrls": {
                        "title": "🧩 Pseudo URLs",
                        "type": "array",
                        "description": "Additional URL patterns ([.*], [\\d+]) that mark a page as a crawlable category. If you want to enqueue direct article URLs this way, you have to add { \"label\": \"article\" } to the userData.",
                        "default": [],
                        "items": {
                            "type": "object",
                            "required": [
                                "purl"
                            ],
                            "properties": {
                                "purl": {
                                    "type": "string",
                                    "title": "Pseudo-URL of a web page"
                                }
                            }
                        }
                    },
                    "linkSelector": {
                        "title": "🎯 Link selector",
                        "type": "string",
                        "description": "Optional CSS selector restricting which parts of a category page contribute links (e.g. main a, .article-list a)."
                    },
                    "maxDepth": {
                        "title": "🪜 Max depth",
                        "minimum": 0,
                        "maximum": 20,
                        "type": "integer",
                        "description": "Maximum BFS depth from the Start URL (Start URL = 0). Empty = no extra cap.",
                        "default": 2
                    },
                    "maxPagesPerCrawl": {
                        "title": "📃 Max pages per crawl",
                        "minimum": 1,
                        "maximum": 100000,
                        "type": "integer",
                        "description": "Hard cap on pages fetched in one run (articles + category pages combined).",
                        "default": 50
                    },
                    "maxArticlesPerCrawl": {
                        "title": "✨ Max articles per crawl",
                        "minimum": 1,
                        "maximum": 100000,
                        "type": "integer",
                        "description": "Hard cap on extracted articles per run.",
                        "default": 25
                    },
                    "maxArticlesPerStartUrl": {
                        "title": "🎯 Max articles per start URL",
                        "minimum": 1,
                        "maximum": 100000,
                        "type": "integer",
                        "description": "Cap how many articles can be attributed to a single Start URL.",
                        "default": 25
                    },
                    "maxConcurrency": {
                        "title": "⚡ Max concurrency",
                        "minimum": 1,
                        "maximum": 100,
                        "type": "integer",
                        "description": "How many fetches the crawler may run in parallel. Higher = faster, but more pressure on the target site and proxy quota. Leave empty for safe sequential mode.",
                        "default": 1
                    },
                    "proxyConfiguration": {
                        "title": "🛡️ Proxy configuration",
                        "type": "object",
                        "description": "Proxy settings. Default = NO PROXY (direct). If the target blocks the request, the actor automatically falls back to DATACENTER, then RESIDENTIAL (with up to 3 retries on residential). Once a fallback occurs, it sticks."
                    },
                    "useBrowser": {
                        "title": "🎭 Use browser (Playwright)",
                        "type": "boolean",
                        "description": "Render with Chromium when raw HTTP fails or the page is JS-heavy. Slower but bypasses many anti-bot walls.",
                        "default": false
                    },
                    "pageWaitMs": {
                        "title": "⏱️ Wait on each page (ms)",
                        "minimum": 0,
                        "maximum": 60000,
                        "type": "integer",
                        "description": "Extra time to wait after navigation finishes (milliseconds). Useful for lazily-loaded scripts.",
                        "default": 0
                    },
                    "waitUntil": {
                        "title": "🚦 Wait until navigation event is finished",
                        "enum": [
                            "load",
                            "domcontentloaded",
                            "networkidle",
                            "commit"
                        ],
                        "type": "string",
                        "description": "Which navigation event Playwright waits for before considering the page ready.",
                        "default": "load"
                    },
                    "categoryWaitForSelector": {
                        "title": "🗂️ Wait for selector on each category page",
                        "type": "string",
                        "description": "Optional CSS selector. The browser will wait for this element to appear before extracting links from category pages."
                    },
                    "articleWaitForSelector": {
                        "title": "📰 Wait for selector on each article page",
                        "type": "string",
                        "description": "Optional CSS selector. The browser will wait for this element to appear before extracting article content."
                    },
                    "scrollToBottom": {
                        "title": "🖱️ Scroll to bottom of the page (infinite scroll)",
                        "type": "boolean",
                        "description": "Auto-scroll to the bottom of category/article pages so lazy-loaded content is rendered.",
                        "default": false
                    },
                    "scrollToBottomButtonSelector": {
                        "title": "🔘 Scroll to bottom button selector",
                        "type": "string",
                        "description": "Optional CSS selector for a 'Load more' button. The crawler will click it repeatedly while scrolling."
                    },
                    "scrollToBottomMaxSeconds": {
                        "title": "⏲️ Scroll to bottom max seconds",
                        "minimum": 1,
                        "maximum": 600,
                        "type": "integer",
                        "description": "Maximum time spent scrolling per page (safety cap).",
                        "default": 60
                    },
                    "extendOutputFunction": {
                        "title": "🛠️ Extend output function",
                        "type": "string",
                        "description": "Only needed if you want more data than is included in the default output. Keep in mind that you should provide a valid Python function: def extend(soup, article, html): return {...}. The returned dict is merged into each article record."
                    },
                    "maxCUs": {
                        "title": "🧮 Limit CU consumption",
                        "minimum": 0,
                        "maximum": 100000,
                        "type": "integer",
                        "description": "Soft cap on Apify Compute Units this run may consume. The actor checks usage between requests and exits gracefully when the cap is hit. Leave empty for no cap."
                    },
                    "notificationEmails": {
                        "title": "📧 Emails address for notifications",
                        "type": "array",
                        "description": "Email addresses to notify when the CU thresholds below are crossed.",
                        "default": [],
                        "items": {
                            "type": "string"
                        }
                    },
                    "notifyAfterCUs": {
                        "title": "🔔 Notify after [number] CUs",
                        "minimum": 0,
                        "maximum": 100000,
                        "type": "integer",
                        "description": "Send a one-time notification once this many CUs have been consumed."
                    },
                    "notifyAfterCUsEvery": {
                        "title": "🔁 Notify every [number] CUs",
                        "minimum": 0,
                        "maximum": 100000,
                        "type": "integer",
                        "description": "Send a notification every N CUs after the initial notification threshold."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
