# Web Page to Markdown & Text - URL Reader for LLMs & RAG (`entranced_gelato/ai-web-page-reader`) Actor

Read any web page as clean text + Markdown for LLMs and automations. Strips ads, nav, and scripts; returns the main content, metadata (title, author, date, word count), and an optional AI TL;DR + key points. The web-reading primitive for AI agents, RAG pipelines, and no-code flows.

- **URL**: https://apify.com/entranced\_gelato/ai-web-page-reader.md
- **Developed by:** [AIDevs](https://apify.com/entranced_gelato) (community)
- **Categories:** AI, Agents, Developer tools
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

from $20.00 / 1,000 page reads

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## AI Web Page Reader

[![Web Page to Markdown & Text](https://apify.com/actor-badge?actor=entranced_gelato/ai-web-page-reader)](https://apify.com/entranced_gelato/ai-web-page-reader)

**Convert any URL into clean, LLM-ready text + Markdown in one call — the web-reading primitive for AI agents, RAG pipelines, and no-code automations.**

Give it a page URL and it strips ads, navigation, and scripts, isolates the main content, and returns clean text, Markdown, and page metadata — plus an optional AI summary. It's the fast single-page alternative to a full-site crawler.

---

### Why AI Web Page Reader

AI agents and automations constantly need to "read this page" and get text an LLM can actually use. Doing that well means removing boilerplate (menus, cookie banners, footers) and converting messy HTML into clean Markdown. This Actor does exactly that, predictably, in a single call.

- **One call, one record** — no crawling, no configuration.
- **LLM-ready** — clean text *and* Markdown, with metadata (title, author, date, word count).
- **Cheap, high-volume** — a tiny per-read price designed for machine-driven, repeat usage.

### When to use it

- **RAG ingestion** of a specific article, doc page, or knowledge-base entry.
- **Research / chat agents** that fetch a URL and need its readable content.
- **No-code flows** (Make, Zapier, n8n) that pass a URL and store clean content.
- Quick **reader-mode + summarize** of any article.

### When NOT to use it

- **Crawling a whole site** (many pages) — use a deep crawler; this reads one URL.
- **Heavily client-rendered apps** that need full JS execution and interaction.
- **Login-gated pages** — it fetches as an anonymous visitor.

### Built for

AI engineers, RAG/LLM developers, automation builders, and anyone who wants a reliable "URL → clean text" tool.

---

### How it works

1. **Fetch** the page at `url` with a real browser-like user agent.
2. **Extract metadata** — title, description, author/byline, site name, published time, language, and OG image.
3. **Clean** — remove scripts, styles, nav, header, footer, ads, cookie/newsletter/share widgets.
4. **Isolate main content** — prefer `<article>`/`<main>`/content containers; otherwise pick the densest text block.
5. **Convert to Markdown** (headings, lists, links, bold/italic, blockquotes, images) and derive well-spaced plain text.
6. **(Optional) Summarize** with your OpenAI key.
7. **Output** one record; usage is billed per event.

### How to call it

#### From the Console
Paste a URL into **Page URL**, optionally enable **Generate AI summary** with your OpenAI key, click **Start**, and read the **Output** tab.

#### From the API

````

POST https://api.apify.com/v2/acts/entranced\_gelato~ai-web-page-reader/runs?token=\<APIFY\_TOKEN>
{
"url": "https://en.wikipedia.org/wiki/Web\_scraping",
"includeMarkdown": true,
"summarize": false
}

````

Also callable over **MCP** as an agent tool.

---

### Input reference

| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `url` | string | **Yes** | — | The public web page to read. |
| `includeMarkdown` | boolean | No | `true` | Also return a clean Markdown version. |
| `summarize` | boolean | No | `false` | Generate an AI TL;DR + key points (needs `openaiApiKey`). |
| `openaiApiKey` | string (secret) | No | — | Your OpenAI key; used only for the summary. |
| `model` | string | No | `gpt-4o-mini` | OpenAI model for the summary. |
| `maxChars` | integer | No | `0` | Cap returned text/markdown length (`0` = no limit). |

### Output reference

One dataset record per run:

| Field | Description |
|-------|-------------|
| `url` | The page URL that was read. |
| `title` | Page/article title. |
| `byline` | Author, if detected. |
| `siteName` | Publisher/site name (OG). |
| `publishedTime` | Published date, if available. |
| `lang` | Page language. |
| `description` | Meta description. |
| `image` | OG image URL. |
| `wordCount` | Word count of the extracted text. |
| `content` | Clean plain text. |
| `markdown` | LLM-ready Markdown. |
| `summary` | AI TL;DR (only when summarization is enabled). |
| `keyPoints` | Array of key points (only when summarization is enabled). |
| `fetchedAt` | ISO timestamp of the run. |

---

### Pricing

**Pay per event** — you only pay for what you run:

- **Page read** — charged once per successful run (one page).
- **AI summary** — a small premium that applies **only when you enable summarization**. You supply your own OpenAI key, so the model's cost is billed by OpenAI separately and is never added to the Actor price.

Apify platform/compute usage is included in the per-event price. See the **Pricing** tab for current rates.

### Integrations

- **LangChain / LlamaIndex** — load `content`/`markdown` into vector stores and RAG chains.
- **Make / Zapier / n8n** — URL in, clean content out.
- **MCP** — expose as a tool for autonomous agents.

### 🔌 Integrations & code examples

#### Call it from the API

```bash
curl "https://api.apify.com/v2/acts/entranced_gelato~ai-web-page-reader/run-sync-get-dataset-items?token=<APIFY_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://en.wikipedia.org/wiki/Web_scraping", "includeMarkdown": true }'
````

#### Python (Apify client)

```python
from apify_client import ApifyClient

client = ApifyClient("<APIFY_TOKEN>")
run = client.actor("entranced_gelato/ai-web-page-reader").call(
    run_input={"url": "https://example.com/article", "includeMarkdown": True}
)
item = next(client.dataset(run["defaultDatasetId"]).iterate_items())
print(item["title"], "->", item["wordCount"], "words")
print(item["markdown"][:500])
```

#### LangChain (load one page into a RAG chain)

```python
from langchain_community.utilities import ApifyWrapper
from langchain_core.documents import Document

apify = ApifyWrapper()
loader = apify.call_actor(
    actor_id="entranced_gelato/ai-web-page-reader",
    run_input={"url": "https://example.com/article"},
    dataset_mapping_function=lambda i: Document(
        page_content=i["markdown"] or i["content"] or "",
        metadata={"source": i["url"], "title": i.get("title")},
    ),
)
docs = loader.load()
```

#### MCP — add it to Claude, Cursor, or any agent

```json
{
  "mcpServers": {
    "apify": {
      "command": "npx",
      "args": ["-y", "@apify/actors-mcp-server", "--actors", "entranced_gelato/ai-web-page-reader"],
      "env": { "APIFY_TOKEN": "<APIFY_TOKEN>" }
    }
  }
}
```

Also works with **LlamaIndex**, **Make**, **Zapier**, and **n8n** — pass a URL, get clean content back into any workflow.

#### Example output

```json
{
  "url": "https://en.wikipedia.org/wiki/Web_scraping",
  "title": "Web scraping",
  "byline": null,
  "siteName": "Wikipedia",
  "wordCount": 3412,
  "content": "Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites...",
  "markdown": "# Web scraping\n\nWeb scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites...",
  "fetchedAt": "2026-07-02T07:20:00.000Z"
}
```

### FAQ

**Will it run JavaScript-heavy pages?** It fetches server-rendered HTML. Pages that render entirely client-side may return limited content.

**Markdown or plain text?** Both — `markdown` for rich formatting, `content` for plain text. Disable Markdown with `includeMarkdown: false`.

**How is it different from a content crawler?** It reads exactly one URL, fast and cheap — ideal as an agent/automation primitive rather than a bulk crawl.

### Limitations

- Single page per run (no crawling).
- No JS execution / interaction.
- Public pages only (no auth).

### See also

- [AI Document Reader](https://apify.com/entranced_gelato/ai-document-reader) - PDF, DOCX, or file URL to clean text + Markdown.
- [AI Competitive Brief Generator](https://apify.com/entranced_gelato/ai-competitive-brief-generator) - any company URL to a competitive, SEO, or sales brief.

# Actor input Schema

## `url` (type: `string`):

The public web page you want to read and clean.

## `includeMarkdown` (type: `boolean`):

Include a clean Markdown version of the page (ideal for LLMs).

## `summarize` (type: `boolean`):

Produce a TL;DR + key points. Requires an OpenAI API key below.

## `openaiApiKey` (type: `string`):

Your own OpenAI key. Only used when 'Generate AI summary' is on.

## `model` (type: `string`):

OpenAI model used for the summary.

## `maxChars` (type: `integer`):

Optionally cap the length of returned content/markdown.

## Actor input object example

```json
{
  "url": "https://example.com/article",
  "includeMarkdown": true,
  "summarize": false,
  "model": "gpt-4o-mini",
  "maxChars": 0
}
```

# Actor output Schema

## `results` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "url": "https://en.wikipedia.org/wiki/Web_scraping"
};

// Run the Actor and wait for it to finish
const run = await client.actor("entranced_gelato/ai-web-page-reader").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "url": "https://en.wikipedia.org/wiki/Web_scraping" }

# Run the Actor and wait for it to finish
run = client.actor("entranced_gelato/ai-web-page-reader").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "url": "https://en.wikipedia.org/wiki/Web_scraping"
}' |
apify call entranced_gelato/ai-web-page-reader --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=entranced_gelato/ai-web-page-reader",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Web Page to Markdown & Text - URL Reader for LLMs & RAG",
        "description": "Read any web page as clean text + Markdown for LLMs and automations. Strips ads, nav, and scripts; returns the main content, metadata (title, author, date, word count), and an optional AI TL;DR + key points. The web-reading primitive for AI agents, RAG pipelines, and no-code flows.",
        "version": "0.0",
        "x-build-id": "GdkuVHHAkkMQsMVCK"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/entranced_gelato~ai-web-page-reader/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-entranced_gelato-ai-web-page-reader",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/entranced_gelato~ai-web-page-reader/runs": {
            "post": {
                "operationId": "runs-sync-entranced_gelato-ai-web-page-reader",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/entranced_gelato~ai-web-page-reader/run-sync": {
            "post": {
                "operationId": "run-sync-entranced_gelato-ai-web-page-reader",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "url"
                ],
                "properties": {
                    "url": {
                        "title": "Page URL",
                        "type": "string",
                        "description": "The public web page you want to read and clean."
                    },
                    "includeMarkdown": {
                        "title": "Return Markdown",
                        "type": "boolean",
                        "description": "Include a clean Markdown version of the page (ideal for LLMs).",
                        "default": true
                    },
                    "summarize": {
                        "title": "Generate AI summary",
                        "type": "boolean",
                        "description": "Produce a TL;DR + key points. Requires an OpenAI API key below.",
                        "default": false
                    },
                    "openaiApiKey": {
                        "title": "OpenAI API key (for summary)",
                        "type": "string",
                        "description": "Your own OpenAI key. Only used when 'Generate AI summary' is on."
                    },
                    "model": {
                        "title": "LLM model",
                        "enum": [
                            "gpt-4o-mini",
                            "gpt-4o",
                            "gpt-4.1-mini"
                        ],
                        "type": "string",
                        "description": "OpenAI model used for the summary.",
                        "default": "gpt-4o-mini"
                    },
                    "maxChars": {
                        "title": "Max characters (0 = no limit)",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Optionally cap the length of returned content/markdown.",
                        "default": 0
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
