Pricing

from $5.00 / 1,000 markdown results

Try for free

Go to Apify Store

Proxy Page to Markdown scraper

Try for free

Fetches pages through Apify proxy in your chosen country (residential or datacenter). Returns clean markdown per URL; optional unique outbound domains or CTA for brand checks. Cheerio first, Playwright fallback. Social URLs → blocked_social.

Pricing

from $5.00 / 1,000 markdown results

Rating

5.0

(1)

Developer

Morph Coder

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

Features

Geo-targeted requests via Apify Proxy (country + RESIDENTIAL / DATACENTER) — useful for geo-specific landing and affiliate redirect checks
One dataset row per input URL (including failures and blocked social URLs)
Readability-based main content in markdown, html (article fragment), or text
SEO metadata object: title, meta description, canonical, Open Graph, Readability fields
CTA links: scored outbound call-to-action URLs (excludes internal, social, and technical domains)
Optional external links: one sample URL per external domain (noisier — use for broad scans)
Social profile/post URLs → blocked_social (use dedicated social Actors instead)

Input

Field	Default	Description
`country`	`US`	Two-letter ISO proxy country (required)
`urls`	`https://example.com`	Page URLs to scrape (max `maxUrls` per run)
`contentFormat`	`markdown`	`markdown`, `html`, or `text`
`extractSeo`	`true`	Return `seo` object with meta/OG fields
`extractCtaLinks`	`false`	Scored outbound CTA links for brand/leak checks
`maxCtaLinks`	`20`	Max CTA links per page
`ctaScope`	`article`	`article` (Readability) or `full_page`
`ctaDomainAllowlist`	`[]`	Only return CTA links to these domains (optional)
`ctaDomainBlocklist`	`[]`	Extra domains to exclude from CTA
`batchSize`	`50`	Internal batch size (10–100)
`maxUrls`	`1000`	Hard cap on URLs per run
`extractExternalLinks`	`false`	Unique external domains (noisy on blogs)
`maxExternalLinks`	`100`	Max domains per page when links enabled
`usePlaywrightFallback`	`true`	Retry thin Cheerio results with Chrome
`minContentLength`	`200`	Min active-format content length before Playwright
`proxyType`	`RESIDENTIAL`	`RESIDENTIAL` or `DATACENTER`

Brand / leak check preset

For landing pages and affiliate monitoring in a target country:

{
  "country": "UA",
  "urls": ["https://example-landing.com/promo"],
  "contentFormat": "markdown",
  "extractSeo": true,
  "extractCtaLinks": true,
  "extractExternalLinks": false,
  "ctaScope": "article",
  "ctaDomainAllowlist": ["partner-brand.com", "affiliate-network.com"],
  "proxyType": "RESIDENTIAL"
}

See scripts/brand-check-UA.json.

Output

Each input URL produces one dataset item.

`status`	Meaning
`success_static`	Content via HTTP/Cheerio
`success_rendered`	Content via Playwright
`blocked_social`	Social URL — not scraped
`failed_fetch`	HTTP or network error
`failed_dynamic`	Empty content after Playwright
`failed_timeout`	Request timeout

Example (success with CTA + SEO)

{
  "url": "https://landing.example.com/promo",
  "country": "UA",
  "status": "success_static",
  "contentFormat": "markdown",
  "title": "Promo Landing",
  "markdown": "# Welcome\n\n...",
  "html": null,
  "text": null,
  "seo": {
    "title": "Promo Landing",
    "metaDescription": "Best offer in UA",
    "h1": "Welcome",
    "canonical": "https://landing.example.com/promo",
    "ogTitle": "Promo Landing",
    "ogDescription": "Best offer in UA",
    "ogImage": null,
    "robots": "index, follow",
    "excerpt": null,
    "siteName": null,
    "byline": null
  },
  "externalLinks": null,
  "ctaLinks": [
    {
      "url": "https://partner-casino.com/register?ref=abc",
      "text": "Get bonus",
      "domain": "partner-casino.com",
      "score": 92,
      "reasons": ["class:cta_pattern", "rel:sponsored"]
    }
  ],
  "method": "cheerio",
  "httpStatus": 200,
  "errorMessage": null,
  "fetchedAt": "2026-05-23T12:00:00.000Z",
  "billable": true
}

CTA filtering

CTA extraction always excludes:

Internal links (same registrable domain as the page)
Social networks (Facebook, Instagram, Telegram, etc.)
Technical domains (analytics, CDN, cookie consent, app stores)
Share buttons and static asset URLs

Example (`blocked_social`)

{
  "url": "https://www.instagram.com/someprofile/",
  "country": "US",
  "status": "blocked_social",
  "platform": "instagram",
  "blockedReason": "social_network_not_supported",
  "message": "Social network URLs are blocked. Use a dedicated social scraper Actor.",
  "contentFormat": "markdown",
  "markdown": null,
  "html": null,
  "text": null,
  "seo": null,
  "ctaLinks": null,
  "billable": false
}

Pricing

When the Actor uses pay-per-event pricing on Apify Store:

Successful web scrapes (success_static, success_rendered) charge the scraped-url event (once per URL).
blocked_social and failed statuses are not charged for that event.

You also pay standard Apify platform usage (compute, proxy traffic) according to your plan. Residential proxy traffic is typically higher than datacenter.

Check the Pricing tab on this Actor's Store page for current event prices.

Run from API

POST https://api.apify.com/v2/acts/morph_coder~proxy-page-to-markdown/runs?token=YOUR_TOKEN
Content-Type: application/json
{
  "country": "US",
  "urls": ["https://example.com"],
  "extractCtaLinks": true,
  "extractSeo": true,
  "contentFormat": "markdown"
}

Read results from the run's defaultDatasetId (one item per URL). Use webhooks on ACTOR.RUN.SUCCEEDED for automation.

Limits and tips

Up to 1000 URLs per run; the Actor batches internally (batchSize, default 50).
For brand/leak checks use extractCtaLinks: true and extractExternalLinks: false on landing pages.
Use extractExternalLinks: true only when you need a broad domain inventory (blogs/news are noisy).
Set ctaDomainAllowlist to your approved partner/brand domains for precise leak detection.
Some sites block proxies or return 502 — item will be failed_fetch.
For Instagram, Facebook, LinkedIn, TikTok, X, YouTube, Reddit URLs expect blocked_social, not markdown.

Support

Open an issue from the Actor page or contact the publisher. For development and deployment notes, see DEVELOPMENT.md in the repository.

Markdown API

vivid_astronaut/markdown

Fabio Suizu

Markdown Anything — URL to Markdown

s-r/markdown-anything

Convert any URL to clean markdown using a 3-provider fallback chain. Batch input, high concurrency.

URL to markdown

apify/url-to-markdown

An Apify Actor that takes a URL as input and returns the content of the page in Markdown format.

Apify

Web Page to Markdown Extractor

fetch_cat/web-page-to-markdown-extractor

Convert public URLs into clean Markdown, text, metadata, links, images, and optional HTML for AI and automation workflows.

Hanna Nosova

🔎 LinkedIn Mass Company Profile Finder

scrapio/linkedin-mass-company-profile-finder

✨ Find LinkedIn company profile URLs in bulk from keywords, domains, or brand names. 🚀 Multi-engine search (DuckDuckGo + Bing), country-subdomain handling, smart proxy fallback (no proxy → datacenter → residential). Perfect for lead gen, B2B prospecting & sales intelligence.

Scrapio

Smart Page Fetcher — HTML, Markdown & Text

shelvick/smart-page-fetcher

Fetch a batch of URLs and get the page as HTML, Markdown, or clean text. Tries plain HTTP first, renders JavaScript in a real browser when needed, and escalates to stealth + residential proxy for Cloudflare-protected, bot-defended pages, per URL. Pay only for the difficulty each URL needed.

Scott Helvick

Web Page to Clean Markdown

consistent_tradition/web-to-markdown

Extracts clean Markdown text from any web page. Perfect for AI/RAG datasets, research corpora, and content analysis.

Peter PANG

Webpage To Clean Markdown

technicaldost/webpage-to-clean-markdown

Technical Dost Solutions

URL to Markdown for LLMs (polite, robots-respecting)

weltverbenzer/url-to-markdown-for-llms

Turn any URL into clean, LLM-ready Markdown for AI agents and RAG pipelines. Enforces robots.txt, extracts main content (Readability) and converts to Markdown. Returns title, byline and markdown.