# Generic Articles Main Content Extractor (`nlp_data_lni/generic-articles-content-extractor`) Actor

Extract the main content of articles. Input can be article links or pages from which to identify and extract article links. Articles are scraped and cleaned to extract the main text and many useful metadatas. Search terms and date post filters can be applied and highlighted snippets produced.

- **URL**: https://apify.com/nlp\_data\_lni/generic-articles-content-extractor.md
- **Developed by:** [LilaK](https://apify.com/nlp_data_lni) (community)
- **Categories:** Automation, News, AI
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $0.00001 / result

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

### Generic Articles Main Content Extractor

### Description
The tool extracts the main content of articles. The input can be direct article urls or page urls from which to extract article links. The tool uses specific algorithms to identify relevant article links and discard navigation links. Each article is scraped and cleaned (remove unimportant text such as navigation links and menus) to extract the main text and many useful metadatas. 

### Main features
✅ Scrapes article urls  
✅ Scrapes links pages and identify relevant article links (customizable feature)  
✅ For each scraped article, extract main text (plain text or markdown format) and various metadata (title,  description, author, data, categories, tags)  
✅ Searches given terms within the content of each article and produce highlighted snippets
✅ Checks if an article has been published since a given date  
✅ Output results in CSV/JSON  

### Usage
☑️ Monitor selected websites for technological or economic intelligence  
☑️ Keep up to date with the latest trends on a particular topic by monitoring specific websites  
☑️ Crawl news or blog websites and build text corpora for various purposes (academic research, machine learning, etc.)  

### Main Input
➡️ A list of article urls and/or a list of pages with article links (required)  
➡️ A set of search terms to look for in each article content (optional)  

General Input Configuration

<img title="General Input Configuration" src="https://api.apify.com/v2/key-value-stores/h8KTHfWCfdAGtcpId/records/generic.input.png" width="600">

\
Post Filtering Options Configuration

<img title="Post Filtering Options" src="https://api.apify.com/v2/key-value-stores/h8KTHfWCfdAGtcpId/records/generic.adv.input.png" width="600">  

### Output
➡️ A dataset of articles including the main text content and various metadata. The output can be found in the default dataset storage in many formats (JSON, CSV, XML, Excel, RSS, etc).  
➡️ Each article includes the following properties: _url_, _title_, _description_, _author_, _source_ (source name), _domain_ (website domain), _date_ (publication or last updated date), _categories_ (a list of detected categories), _tags_ (a list of detected tags), _search_terms_ (search terms found), _search_highlights_ (highlighted text snippets), _valid_date_ (Check if the article has been published since the given input date), _valid_ (valid article according to the post-filters), _text_ (main content in plain text or markdown format according to the input options)  

<!---
➡️ If search terms or _date from_ are provided, the _Articles_ view allows you to display only articles that include at least one term or meet the date criterion.
-->
➡️ If the _compute_stats_ option is set, a dataset including the total count (articles count) for each occuring category, tag or search term is built. The dataset can be displayed by selecting _Stats View_ in **Output** tab.

Here are some output examples:

![Articles table view](https://api.apify.com/v2/key-value-stores/h8KTHfWCfdAGtcpId/records/generic.output.articles.png "Blog articles table view with search terms ('AI' and 'Google')")

![Statistics JSON view](https://api.apify.com/v2/key-value-stores/h8KTHfWCfdAGtcpId/records/generic.output.stats.png "Statistics JSON view")


### Your feedback
If you’ve got any technical feedback, a bug to report or any suggestion to improve the actor usage, please create an issue on the Actor’s Issues tab.

# Actor input Schema

## `article_urls` (type: `array`):

A list of article urls to process.
## `article_links_urls` (type: `array`):

A list of pages with article links ('url' (required): page url, 'external' (defaults to 'false'): allow extra domain links,'auto' (defaults to 'true'): enable automatic links extraction mode,  'include': matching patterns for valid links,  'exclude': matching patterns for invalid links.
Example of valid input:
 [{ 'url': 'https://www.forbes.com/business/', 'auto': false, 'external': false, 'include': '/sites/', 'exclude': '' }]
## `max_links_by_url` (type: `integer`):

The maximum number of links to extract from each article links url provided in 'article_links_urls' input.
## `markdown_text` (type: `boolean`):

Enable Markdown text extraction
## `terms` (type: `array`):

A list of terms to look for in each article content. When a term is found, a specific column is filled in the result table
## `max_hl_snippets` (type: `integer`):

Maximum number of highlighted snippets to output by search term. If set to 0, highlighting is disabled
## `date_from` (type: `string`):

Check if an article has been published since a given date. Select date in format YYYY-MM-DD or {number}{unit}
## `or_post_filters` (type: `boolean`):

Enable global statistics computation.
## `compute_stats` (type: `boolean`):

Enable global statistics computation.
## `only_valid_stats` (type: `boolean`):

When enabled, statistics are only computed based on valid articles.
## `proxy_settings` (type: `object`):

Select proxies to be used for crawling.
## `max_retries` (type: `integer`):

The maximum number of times a request will be retried on network, proxy or server errors.
## `timeout` (type: `integer`):

Timeout in seconds for making a request.

## Actor input object example

```json
{
  "article_urls": [
    "https://blog.apify.com/google-antigravity-for-lead-gen/"
  ],
  "article_links_urls": [],
  "max_links_by_url": 10,
  "markdown_text": false,
  "max_hl_snippets": 3,
  "or_post_filters": false,
  "compute_stats": false,
  "only_valid_stats": false,
  "proxy_settings": {
    "useApifyProxy": false
  },
  "max_retries": 3,
  "timeout": 30
}
````

# Actor output Schema

## `articles` (type: `string`):

No description

## `stats` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "article_urls": [
        "https://blog.apify.com/google-antigravity-for-lead-gen/"
    ],
    "article_links_urls": [],
    "proxy_settings": {
        "useApifyProxy": false
    }
};

// Run the Actor and wait for it to finish
const run = await client.actor("nlp_data_lni/generic-articles-content-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "article_urls": ["https://blog.apify.com/google-antigravity-for-lead-gen/"],
    "article_links_urls": [],
    "proxy_settings": { "useApifyProxy": False },
}

# Run the Actor and wait for it to finish
run = client.actor("nlp_data_lni/generic-articles-content-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "article_urls": [
    "https://blog.apify.com/google-antigravity-for-lead-gen/"
  ],
  "article_links_urls": [],
  "proxy_settings": {
    "useApifyProxy": false
  }
}' |
apify call nlp_data_lni/generic-articles-content-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=nlp_data_lni/generic-articles-content-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Generic Articles Main Content Extractor",
        "description": "Extract the main content of articles. Input can be article links or pages from which to identify and extract article links. Articles are scraped and cleaned to extract the main text and many useful metadatas. Search terms and date post filters can be applied and highlighted snippets produced.",
        "version": "0.0",
        "x-build-id": "faS19vdVtXgKAGUrO"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/nlp_data_lni~generic-articles-content-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-nlp_data_lni-generic-articles-content-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/nlp_data_lni~generic-articles-content-extractor/runs": {
            "post": {
                "operationId": "runs-sync-nlp_data_lni-generic-articles-content-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/nlp_data_lni~generic-articles-content-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-nlp_data_lni-generic-articles-content-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "article_urls": {
                        "title": "Article URLs",
                        "type": "array",
                        "description": "A list of article urls to process.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "article_links_urls": {
                        "title": "Article Links URLS",
                        "type": "array",
                        "description": "A list of pages with article links ('url' (required): page url, 'external' (defaults to 'false'): allow extra domain links,'auto' (defaults to 'true'): enable automatic links extraction mode,  'include': matching patterns for valid links,  'exclude': matching patterns for invalid links.\nExample of valid input:\n [{ 'url': 'https://www.forbes.com/business/', 'auto': false, 'external': false, 'include': '/sites/', 'exclude': '' }]"
                    },
                    "max_links_by_url": {
                        "title": "Max links per article links url",
                        "type": "integer",
                        "description": "The maximum number of links to extract from each article links url provided in 'article_links_urls' input.",
                        "default": 10
                    },
                    "markdown_text": {
                        "title": "Get Markdown Text",
                        "type": "boolean",
                        "description": "Enable Markdown text extraction",
                        "default": false
                    },
                    "terms": {
                        "title": "Search Terms Post-Filter",
                        "type": "array",
                        "description": "A list of terms to look for in each article content. When a term is found, a specific column is filled in the result table",
                        "items": {
                            "type": "string"
                        }
                    },
                    "max_hl_snippets": {
                        "title": "Maximum highlighted snippets by search term",
                        "type": "integer",
                        "description": "Maximum number of highlighted snippets to output by search term. If set to 0, highlighting is disabled",
                        "default": 3
                    },
                    "date_from": {
                        "title": "From Date Post-Filter",
                        "pattern": "^(\\d{4})-(0[1-9]|1[0-2])-(0[1-9]|[12]\\d|3[01])$|^(\\d+)\\s*(day|week|month|year)s?$|^$",
                        "type": "string",
                        "description": "Check if an article has been published since a given date. Select date in format YYYY-MM-DD or {number}{unit}"
                    },
                    "or_post_filters": {
                        "title": "Enable 'OR' combination of Post-Filters to set if an article is valid. The default is an 'AND' combination. When no post-filter is applied, all collected articles are valid",
                        "type": "boolean",
                        "description": "Enable global statistics computation.",
                        "default": false
                    },
                    "compute_stats": {
                        "title": "Compute Statistics",
                        "type": "boolean",
                        "description": "Enable global statistics computation.",
                        "default": false
                    },
                    "only_valid_stats": {
                        "title": "Compute statistics based only on valid articles",
                        "type": "boolean",
                        "description": "When enabled, statistics are only computed based on valid articles.",
                        "default": false
                    },
                    "proxy_settings": {
                        "title": "Proxy configuration",
                        "type": "object",
                        "description": "Select proxies to be used for crawling."
                    },
                    "max_retries": {
                        "title": "Maximum number of retries on request errors",
                        "type": "integer",
                        "description": "The maximum number of times a request will be retried on network, proxy or server errors.",
                        "default": 3
                    },
                    "timeout": {
                        "title": "Request timeout in seconds",
                        "type": "integer",
                        "description": "Timeout in seconds for making a request.",
                        "default": 30
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
