# News Article Extractor for AI & RAG (`wiry_kingdom/news-article-extractor-ai`) Actor

Extract clean, structured JSON from any news article or blog post - title, authors, published date, full content, keywords, images. Perfect for LLM training data, RAG pipelines, content monitoring and news aggregation. Uses JSON-LD, Open Graph and readability heuristics.

- **URL**: https://apify.com/wiry\_kingdom/news-article-extractor-ai.md
- **Developed by:** [Mohieldin Mohamed](https://apify.com/wiry_kingdom) (community)
- **Categories:** AI, Developer tools
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per event

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## News Article Extractor for AI & RAG

**Turn any news article or blog post URL into clean, structured JSON in one API call.** News Article Extractor pulls the title, authors, publish date, full content, keywords, and images from any news site or blog - ready to drop straight into your **LLM training pipeline, RAG system, or content database**.

No more writing custom CSS selectors for every new site. No more stripping ads, nav bars, and cookie banners by hand. Paste a URL, get a perfect JSON payload.

### What does News Article Extractor for AI & RAG do?

This actor fetches any article URL and runs a layered extraction pipeline to get the cleanest possible text:

1. **JSON-LD schemas** - Most news sites publish `NewsArticle` / `Article` structured data. This is the highest-fidelity source for title, author, and publish date.
2. **Open Graph + Twitter Cards** - Fallback metadata used by virtually every modern site.
3. **`<article>` and `[itemprop="articleBody"]` tags** - Semantic HTML extraction.
4. **Readability heuristics** - Longest `<p>` cluster for sites that don't use any of the above.

Noise (ads, nav bars, share buttons, newsletter forms, related-article widgets, paywalls) is stripped before content extraction. The final output is a clean text body plus all the metadata an LLM or analytics pipeline needs.

### Why use News Article Extractor?

- **RAG pipelines** - Ingest articles into vector databases without cleanup work. Every output already has a word count, reading time, and canonical URL.
- **LLM fine-tuning** - Build high-quality training datasets of article bodies stripped of boilerplate.
- **Content monitoring** - Track what a publisher is posting over time and pipe it into your analytics stack.
- **News aggregators** - Build a Feedly clone or topic-tracking dashboard without scraping each site individually.
- **Sentiment analysis** - Get clean text inputs for your NLP models without fighting site-specific HTML.
- **SEO research** - Extract every competitor article on a topic and analyze their structure, word counts, and keywords.

Built on the Apify platform: scheduling, API access, proxy rotation, webhook integrations, and monitoring are included.

### How to use News Article Extractor for AI & RAG

1. Click **Try for free** and sign in to Apify
2. Paste the article URLs you want to extract into the **Article URLs** field
3. (Optional) Set a **Minimum word count** to skip homepages and category listings
4. Click **Start** - the actor processes URLs in parallel
5. Open the **Output** tab to view or download results

You can also trigger the actor from your own code via the Apify API - pass a list of URLs in the JSON body and poll for results.

### Input

```json
{
    "startUrls": [
        { "url": "https://www.bbc.com/news/articles/cq8v4dqj9y7o" },
        { "url": "https://techcrunch.com/2026/04/12/ai-roundup" }
    ],
    "minWordCount": 200,
    "includeHtml": false,
    "maxRequestsPerCrawl": 100
}
````

| Field | Type | Description |
|-------|------|-------------|
| `startUrls` | array | List of URLs to extract. Each entry is `{ "url": "..." }`. Required. |
| `minWordCount` | integer | Skip articles shorter than this. Default: 0 (accept all). |
| `includeHtml` | boolean | Also return raw HTML. Default: false. |
| `maxRequestsPerCrawl` | integer | Safety cap on requests. Default: 100, max: 5000. |

### Output

```json
{
    "url": "https://www.bbc.com/news/articles/cq8v4dqj9y7o",
    "statusCode": 200,
    "title": "Major AI breakthrough announced today",
    "description": "Researchers report new advances...",
    "authors": ["Jane Doe"],
    "publishedAt": "2026-04-13T08:00:00Z",
    "modifiedAt": "2026-04-13T10:15:00Z",
    "image": "https://ichef.bbci.co.uk/news/1024/...",
    "siteName": "BBC News",
    "language": "en",
    "content": "The full cleaned body of the article...",
    "wordCount": 842,
    "readingTimeMinutes": 4,
    "keywords": ["AI", "machine learning", "research"],
    "canonicalUrl": "https://www.bbc.com/news/articles/cq8v4dqj9y7o",
    "extractionMethod": "jsonld",
    "extractedAt": "2026-04-13T19:42:17.301Z"
}
```

You can download the dataset in various formats such as JSON, HTML, CSV, or Excel from the Output tab.

#### Output fields

| Field | Type | Description |
|-------|------|-------------|
| `url` | string | The canonical URL of the article |
| `title` | string | Article headline |
| `description` | string | Summary / subtitle |
| `authors` | array | List of author names |
| `publishedAt` | string | ISO timestamp of publication |
| `modifiedAt` | string | ISO timestamp of last edit |
| `image` | string | Lead image URL |
| `siteName` | string | Publisher site name |
| `language` | string | ISO 639 language code |
| `content` | string | Clean body text with noise removed |
| `wordCount` | integer | Number of words in the content |
| `readingTimeMinutes` | integer | Estimated reading time at 220 wpm |
| `keywords` | array | Article tags and keywords |
| `canonicalUrl` | string | Canonical URL from `<link rel="canonical">` |
| `extractionMethod` | string | Which extraction strategy succeeded (`jsonld`, `article-tag`, `readability`) |

### How much does it cost to extract news articles?

The actor uses a Cheerio crawler (no headless browser) with 8 concurrent requests. Extracting 100 articles typically consumes a few cents of platform credit on Apify. The free tier covers thousands of extractions per month.

### Tips and advanced options

- **Feed a sitemap** - Want every article from a publisher? Pass the sitemap URLs and the extractor will process each one.
- **Filter noise with `minWordCount`** - Set it to 200 or 300 to automatically skip homepages, tag pages, and author pages.
- **Schedule incremental crawls** - Use Apify Schedules to re-run daily against an RSS feed and push new articles to your RAG database.
- **Integrate with LLM APIs** - Chain this actor with an LLM summarization actor or a vector database webhook.

### FAQ

**Does it handle paywalled content?** No. It only extracts content that is served in the public HTML. Paywalled pages will either return the preview or nothing.

**Which sites are supported?** Anything that serves HTML. The extractor is site-agnostic. It has been tested against BBC, TechCrunch, The Verge, NYT (public pages), Medium, Substack, WordPress blogs, and more.

**Is this legal?** The actor fetches publicly served HTML, the same way your browser does. It does not bypass paywalls, log in, or circumvent any access controls. You are responsible for respecting the terms of service of the sites you scrape and for complying with copyright when using extracted content.

**Why not use a headless browser?** Headless browsers are 10-20x slower and cost 10-20x more. For news and blog content, HTTP + Cheerio works on the vast majority of sites. If you need JS-heavy sites, consider pairing this actor with a dedicated browser-based one.

### Support

Found an article that fails to extract cleanly? Open an issue with the URL and we will tune the extractor.

# Actor input Schema

## `startUrls` (type: `array`):

The news articles or blog posts to extract. Each item is a URL - the actor fetches the page and returns clean structured JSON.

## `maxRequestsPerCrawl` (type: `integer`):

Upper bound on the number of URLs processed (safety cap).

## `minWordCount` (type: `integer`):

Skip pages whose extracted article has fewer than this many words. Useful to filter out homepages and category listings.

## `includeHtml` (type: `boolean`):

Also return the raw HTML in the output. Off by default to keep datasets lean.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://www.bbc.com/news/articles/cq8v4dqj9y7o"
    },
    {
      "url": "https://techcrunch.com"
    }
  ],
  "maxRequestsPerCrawl": 100,
  "minWordCount": 0,
  "includeHtml": false
}
```

# Actor output Schema

## `dataset` (type: `string`):

One row per extracted article with title, authors, content, and metadata

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://www.bbc.com/news/articles/cq8v4dqj9y7o"
        },
        {
            "url": "https://techcrunch.com"
        }
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("wiry_kingdom/news-article-extractor-ai").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "startUrls": [
        { "url": "https://www.bbc.com/news/articles/cq8v4dqj9y7o" },
        { "url": "https://techcrunch.com" },
    ] }

# Run the Actor and wait for it to finish
run = client.actor("wiry_kingdom/news-article-extractor-ai").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://www.bbc.com/news/articles/cq8v4dqj9y7o"
    },
    {
      "url": "https://techcrunch.com"
    }
  ]
}' |
apify call wiry_kingdom/news-article-extractor-ai --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=wiry_kingdom/news-article-extractor-ai",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "News Article Extractor for AI & RAG",
        "description": "Extract clean, structured JSON from any news article or blog post - title, authors, published date, full content, keywords, images. Perfect for LLM training data, RAG pipelines, content monitoring and news aggregation. Uses JSON-LD, Open Graph and readability heuristics.",
        "version": "0.1",
        "x-build-id": "qTYyPsPE0GrZZxxlV"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/wiry_kingdom~news-article-extractor-ai/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-wiry_kingdom-news-article-extractor-ai",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/wiry_kingdom~news-article-extractor-ai/runs": {
            "post": {
                "operationId": "runs-sync-wiry_kingdom-news-article-extractor-ai",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/wiry_kingdom~news-article-extractor-ai/run-sync": {
            "post": {
                "operationId": "run-sync-wiry_kingdom-news-article-extractor-ai",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls"
                ],
                "properties": {
                    "startUrls": {
                        "title": "Article URLs",
                        "type": "array",
                        "description": "The news articles or blog posts to extract. Each item is a URL - the actor fetches the page and returns clean structured JSON.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "maxRequestsPerCrawl": {
                        "title": "Max requests per crawl",
                        "minimum": 1,
                        "maximum": 5000,
                        "type": "integer",
                        "description": "Upper bound on the number of URLs processed (safety cap).",
                        "default": 100
                    },
                    "minWordCount": {
                        "title": "Minimum word count",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Skip pages whose extracted article has fewer than this many words. Useful to filter out homepages and category listings.",
                        "default": 0
                    },
                    "includeHtml": {
                        "title": "Include raw HTML",
                        "type": "boolean",
                        "description": "Also return the raw HTML in the output. Off by default to keep datasets lean.",
                        "default": false
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
