Web Page to Markdown & Text - URL Reader for LLMs & RAG avatar

Web Page to Markdown & Text - URL Reader for LLMs & RAG

Pricing

from $20.00 / 1,000 page reads

Go to Apify Store
Web Page to Markdown & Text - URL Reader for LLMs & RAG

Web Page to Markdown & Text - URL Reader for LLMs & RAG

Read any web page as clean text + Markdown for LLMs and automations. Strips ads, nav, and scripts; returns the main content, metadata (title, author, date, word count), and an optional AI TL;DR + key points. The web-reading primitive for AI agents, RAG pipelines, and no-code flows.

Pricing

from $20.00 / 1,000 page reads

Rating

0.0

(0)

Developer

AIDevs

AIDevs

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

5 hours ago

Last modified

Share

AI Web Page Reader

Web Page to Markdown & Text

Convert any URL into clean, LLM-ready text + Markdown in one call — the web-reading primitive for AI agents, RAG pipelines, and no-code automations.

Give it a page URL and it strips ads, navigation, and scripts, isolates the main content, and returns clean text, Markdown, and page metadata — plus an optional AI summary. It's the fast single-page alternative to a full-site crawler.


Why AI Web Page Reader

AI agents and automations constantly need to "read this page" and get text an LLM can actually use. Doing that well means removing boilerplate (menus, cookie banners, footers) and converting messy HTML into clean Markdown. This Actor does exactly that, predictably, in a single call.

  • One call, one record — no crawling, no configuration.
  • LLM-ready — clean text and Markdown, with metadata (title, author, date, word count).
  • Cheap, high-volume — a tiny per-read price designed for machine-driven, repeat usage.

When to use it

  • RAG ingestion of a specific article, doc page, or knowledge-base entry.
  • Research / chat agents that fetch a URL and need its readable content.
  • No-code flows (Make, Zapier, n8n) that pass a URL and store clean content.
  • Quick reader-mode + summarize of any article.

When NOT to use it

  • Crawling a whole site (many pages) — use a deep crawler; this reads one URL.
  • Heavily client-rendered apps that need full JS execution and interaction.
  • Login-gated pages — it fetches as an anonymous visitor.

Built for

AI engineers, RAG/LLM developers, automation builders, and anyone who wants a reliable "URL → clean text" tool.


How it works

  1. Fetch the page at url with a real browser-like user agent.
  2. Extract metadata — title, description, author/byline, site name, published time, language, and OG image.
  3. Clean — remove scripts, styles, nav, header, footer, ads, cookie/newsletter/share widgets.
  4. Isolate main content — prefer <article>/<main>/content containers; otherwise pick the densest text block.
  5. Convert to Markdown (headings, lists, links, bold/italic, blockquotes, images) and derive well-spaced plain text.
  6. (Optional) Summarize with your OpenAI key.
  7. Output one record; usage is billed per event.

How to call it

From the Console

Paste a URL into Page URL, optionally enable Generate AI summary with your OpenAI key, click Start, and read the Output tab.

From the API

POST https://api.apify.com/v2/acts/entranced_gelato~ai-web-page-reader/runs?token=<APIFY_TOKEN>
{
"url": "https://en.wikipedia.org/wiki/Web_scraping",
"includeMarkdown": true,
"summarize": false
}

Also callable over MCP as an agent tool.


Input reference

FieldTypeRequiredDefaultDescription
urlstringYesThe public web page to read.
includeMarkdownbooleanNotrueAlso return a clean Markdown version.
summarizebooleanNofalseGenerate an AI TL;DR + key points (needs openaiApiKey).
openaiApiKeystring (secret)NoYour OpenAI key; used only for the summary.
modelstringNogpt-4o-miniOpenAI model for the summary.
maxCharsintegerNo0Cap returned text/markdown length (0 = no limit).

Output reference

One dataset record per run:

FieldDescription
urlThe page URL that was read.
titlePage/article title.
bylineAuthor, if detected.
siteNamePublisher/site name (OG).
publishedTimePublished date, if available.
langPage language.
descriptionMeta description.
imageOG image URL.
wordCountWord count of the extracted text.
contentClean plain text.
markdownLLM-ready Markdown.
summaryAI TL;DR (only when summarization is enabled).
keyPointsArray of key points (only when summarization is enabled).
fetchedAtISO timestamp of the run.

Pricing

Pay per event — you only pay for what you run:

  • Page read — charged once per successful run (one page).
  • AI summary — a small premium that applies only when you enable summarization. You supply your own OpenAI key, so the model's cost is billed by OpenAI separately and is never added to the Actor price.

Apify platform/compute usage is included in the per-event price. See the Pricing tab for current rates.

Integrations

  • LangChain / LlamaIndex — load content/markdown into vector stores and RAG chains.
  • Make / Zapier / n8n — URL in, clean content out.
  • MCP — expose as a tool for autonomous agents.

🔌 Integrations & code examples

Call it from the API

curl "https://api.apify.com/v2/acts/entranced_gelato~ai-web-page-reader/run-sync-get-dataset-items?token=<APIFY_TOKEN>" \
-H "Content-Type: application/json" \
-d '{ "url": "https://en.wikipedia.org/wiki/Web_scraping", "includeMarkdown": true }'

Python (Apify client)

from apify_client import ApifyClient
client = ApifyClient("<APIFY_TOKEN>")
run = client.actor("entranced_gelato/ai-web-page-reader").call(
run_input={"url": "https://example.com/article", "includeMarkdown": True}
)
item = next(client.dataset(run["defaultDatasetId"]).iterate_items())
print(item["title"], "->", item["wordCount"], "words")
print(item["markdown"][:500])

LangChain (load one page into a RAG chain)

from langchain_community.utilities import ApifyWrapper
from langchain_core.documents import Document
apify = ApifyWrapper()
loader = apify.call_actor(
actor_id="entranced_gelato/ai-web-page-reader",
run_input={"url": "https://example.com/article"},
dataset_mapping_function=lambda i: Document(
page_content=i["markdown"] or i["content"] or "",
metadata={"source": i["url"], "title": i.get("title")},
),
)
docs = loader.load()

MCP — add it to Claude, Cursor, or any agent

{
"mcpServers": {
"apify": {
"command": "npx",
"args": ["-y", "@apify/actors-mcp-server", "--actors", "entranced_gelato/ai-web-page-reader"],
"env": { "APIFY_TOKEN": "<APIFY_TOKEN>" }
}
}
}

Also works with LlamaIndex, Make, Zapier, and n8n — pass a URL, get clean content back into any workflow.

Example output

{
"url": "https://en.wikipedia.org/wiki/Web_scraping",
"title": "Web scraping",
"byline": null,
"siteName": "Wikipedia",
"wordCount": 3412,
"content": "Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites...",
"markdown": "# Web scraping\n\nWeb scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites...",
"fetchedAt": "2026-07-02T07:20:00.000Z"
}

FAQ

Will it run JavaScript-heavy pages? It fetches server-rendered HTML. Pages that render entirely client-side may return limited content.

Markdown or plain text? Both — markdown for rich formatting, content for plain text. Disable Markdown with includeMarkdown: false.

How is it different from a content crawler? It reads exactly one URL, fast and cheap — ideal as an agent/automation primitive rather than a bulk crawl.

Limitations

  • Single page per run (no crawling).
  • No JS execution / interaction.
  • Public pages only (no auth).

See also