Proxy Page to Markdown scraper avatar

Proxy Page to Markdown scraper

Pricing

from $5.00 / 1,000 markdown results

Go to Apify Store
Proxy Page to Markdown scraper

Proxy Page to Markdown scraper

Fetches pages through Apify proxy in your chosen country (residential or datacenter). Returns clean markdown per URL; optional unique outbound domains for brand checks. Cheerio first, Playwright fallback. Social URLs → blocked_social.

Pricing

from $5.00 / 1,000 markdown results

Rating

0.0

(0)

Developer

Olek Coder

Olek Coder

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

10 days ago

Last modified

Categories

Share

Fetch web pages through Apify Proxy in a country you choose (residential or datacenter) and get clean markdown per URL. Optional outbound domain extraction for brand or affiliate checks. Uses fast HTTP (Cheerio) first, then Playwright when the page needs rendering.

Actor ID: olek_automate~proxy-page-to-markdown

This Actor is not affiliated with any third-party website you scrape. You are responsible for complying with each site's terms of use and applicable laws.

Features

  • Geo-targeted requests via Apify Proxy (country + RESIDENTIAL / DATACENTER)
  • One dataset row per input URL (including failures and blocked social URLs)
  • Readability-based main content → markdown with links preserved
  • Optional external links: one sample URL per external domain (filtered share buttons, analytics, assets)
  • Social profile/post URLs → blocked_social (use dedicated social Actors instead)

Input

FieldDefaultDescription
countryUSTwo-letter ISO proxy country (required)
urlshttps://example.comPage URLs to scrape (max maxUrls per run)
batchSize50Internal batch size (10–100)
maxUrls1000Hard cap on URLs per run
extractExternalLinksfalseUnique external domains (brand/leak checks)
maxExternalLinks100Max domains per page when links enabled
usePlaywrightFallbacktrueRetry thin Cheerio results with Chrome
minContentLength200Min markdown length before Playwright
proxyTypeRESIDENTIALRESIDENTIAL or DATACENTER

Output

Each input URL produces one dataset item.

statusMeaning
success_staticMarkdown via HTTP/Cheerio
success_renderedMarkdown via Playwright
blocked_socialSocial URL — not scraped
failed_fetchHTTP or network error
failed_dynamicEmpty content after Playwright
failed_timeoutRequest timeout

Example (success)

{
"url": "https://example.com",
"country": "US",
"status": "success_static",
"title": "Example Domain",
"markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
"externalLinks": null,
"method": "cheerio",
"httpStatus": 200,
"errorMessage": null,
"fetchedAt": "2026-05-23T12:00:00.000Z",
"billable": true
}

Example (blocked_social)

{
"url": "https://www.instagram.com/someprofile/",
"country": "US",
"status": "blocked_social",
"platform": "instagram",
"blockedReason": "social_network_not_supported",
"message": "Social network URLs are blocked. Use a dedicated social scraper Actor.",
"markdown": null,
"billable": false
}

Pricing

When the Actor uses pay-per-event pricing on Apify Store:

  • Successful web scrapes (success_static, success_rendered) charge the scraped-url event (once per URL).
  • blocked_social and failed statuses are not charged for that event.

You also pay standard Apify platform usage (compute, proxy traffic) according to your plan. Residential proxy traffic is typically higher than datacenter.

Check the Pricing tab on this Actor's Store page for current event prices.

Run from API

POST https://api.apify.com/v2/acts/olek_automate~proxy-page-to-markdown/runs?token=YOUR_TOKEN
Content-Type: application/json
{
"country": "US",
"urls": ["https://example.com"],
"extractExternalLinks": false
}

Read results from the run's defaultDatasetId (one item per URL). Use webhooks on ACTOR.RUN.SUCCEEDED for automation.

Limits and tips

  • Up to 1000 URLs per run; the Actor batches internally (batchSize, default 50).
  • Use extractExternalLinks: true mainly for landing / brand pages, not large blogs or news sites.
  • Some sites block proxies or return 502 — item will be failed_fetch.
  • For Instagram, Facebook, LinkedIn, TikTok, X, YouTube, Reddit URLs expect blocked_social, not markdown.

Support

Open an issue from the Actor page or contact the publisher. For development and deployment notes, see DEVELOPMENT.md in the repository.