
Zach's "Webpage Content To Markdown" Scraper
Pricing
$19.00/month + usage

Zach's "Webpage Content To Markdown" Scraper
Scrape a webpage and parse to markdown. Packed with features to ensure high success rate and low cost. Includes 2 modes of operation so that you can optimize for either cost (as cheap as possible) or yield (as many successful results as possible).
0.0 (0)
Pricing
$19.00/month + usage
1
Total users
21
Monthly users
10
Runs succeeded
>99%
Last modified
2 months ago
This Apify actor scrapes a single webpage and parses to markdown. Includes browser-based scraping, smart retrying, anti-scrape block (e.g. cloudflare) circumvention, and smart proxy support to ensure a high success rate.
It also includes 2 modes of operation so that you can optimize for either cost (as cheap as possible) or yield (as many successful results as possible).
🤔 When To Use It
Whenever you want to reliably get a webpage's content and parse it into markdown.
(I personally mostly use it for feeding data into ChatGPT for freelance cold outreach personalization & automation tasks, which I cover in our $200k Freelancer course.)
😰 Why We Made it:
If you want to have ChatGPT interpret a webpage, it can be surprisingly difficult with current tooling.
- 😭 ChatGPT's API isn't currently web-connected
- 😿 If you try to get a page's content via a Make automation and parse it to text/markdown, it's unreliable and produces a lot of soft failures and rendering errors
- 🤢 If you try to use standalone tools for webpage scraping to markdown conversion, they're expensive and also have a lot of soft failures & markdown rendering errors
- 😣 If you use the other website-crawling-to-markdown scrapers on Apify they tend to be expensive and unreliable.
That's why we made this Actor...
💪 Why This Actor is Nifty:
😍 This actor allows you to simply plop in a big ole list of domain names, and get a huge spreadsheet of markdown content back, to do whatever you want with.
(e.g. upload to google sheets and have ChatGPT iterate through via Make automation)
🤘 Features:
- ✅ Anti-Scrape Circumvention — if you use the "Get Data Using Browser" option, we'll be able to circumvent many blocks
- ✅ Soft-Failure Reporting — e.g. if a webpage comes back blank, we'll mark it as a failure — not a lot of other solutions do this)
- ✅ Smart Proxy Support — we'll run on Datacenter proxies by default, and only revert to Residential proxies when actually necessary
- ✅ Smart Retrying — we'll auto retry on failures and rotate proxies and IPs to get you the most successful results possible
💭 Example Use Cases:
If you're a $200k Freelancer course student, be sure to check the course training area for guidance on the below use cases and more.
Website Language Detection:
- Run this actor
- Put results into a Google Sheet
- Filter out the fails
- Add the formula
=DETECTLANGUAGE(E2)
(assumingE
is the markdown column) to a new column - Extend that formula to all rows in the column
- Filter results to not show languages you don't want (e.g. filter to only show
en
for only English language websites
Cold Outreach Personalization:
(e.g. find out what kinds of products a company sells, who their audience avatar is, etc.)
- Run this actor
- Put results into a Google Sheet
- Filter out the fails
- Create a Make automation that feeds the markdown into ChatGPT for analysis
- Have ChatGPT give you its analyses back as JSON if you want multiple fields / analyses back (e.g. "type_of_products_sold," "random_product_name," etc.
- Parse the JSON and add each field to a column in the Google Sheet
- You can now feed these data into a line-writer ChatGPT prompt to have it rewrite a template line with this personalization data
Modes of Operation
Regardless of which mode you use it in, if you're exporting to a spreadsheet, be sure to choose MS Excel format, not CSV. (Markdown will often mess up the CSV file)
"Low-Hanging Fruit" Mode
The following settings are efficient and the cheapest path to data, but won't work for a lot of websites:
- "Get Data Using Browser" option disabled
- 1GB of RAM
- Residential proxies (we use datacenter by default in our code and will only use residential if actually necessary)
Estimated Costs for "Low-Hanging Fruit" Mode:
- Est. cost per result in "Low-Hanging Fruit" Mode: $0.00025
- Est. yield on results: 84.12%
"All The Damned Fruit" Mode
The following settings have very high reliability, but are more expensive:
- "Get Data Using Browser" option enabled
- 4GB of RAM (You can often get away with 2GB – or even 1GB – of RAM, which will make it much cheaper.)
- Residential proxies (we use datacenter by default in our code and will only use residential if actually necessary)
Estimated Costs for "All The Damned Fruit" Mode:
- Est. cost per result in "All The Damned Fruit" Mode: $0.0069 CPL for residential proxies ($0.0012 CPL for datacenter)
- Est. yield on results: 93.38% for residential (91.64% datacenter)
Pricing Breakdown:
Results | Valid Results | Cost | Cost Per Result (CPL) | Yield | Time | Memory | Proxy | Using Browser Build |
---|---|---|---|---|---|---|---|---|
2462 | 2071 | $0.612 | $0.0002486 | 84.12% | 36min | 1 GB | Residential | No |
2463 | 2078 | $0.914 | $0.0003711 | 84.37% | 19min | 4 GB | Residential | No |
2463 | 2257 | $2.99 | $0.0012140 | 91.64% | 96min | 4 GB | Datacenter | Yes |
2463 | 2300 | $15-46 | As high as $0.02 | 93.38% | 120min | 4 GB | If Residential | Yes |
Suggested Usage
Depending on your priorities, there are a couple ways to use this scraper. What's your priority?
"My Priority is EASE"
("...And I don't care if it costs more.")
👉 Run it with the settings from the "All The Damned Fruit" Mode from the 'Modes of Operation" instructions right from the start.
Just be aware that at 4GB of RAM + residential proxies, you will have to pay up to 100x more than if you did the "low-hanging fruit mode" first.
If you're exporting to a spreadsheet, be sure to choose MS Excel format, not CSV. (Markdown will often mess up the CSV file)
"My Priority is COST"
("...And I don't care if it means there are a couple extra steps for me.")
👉 You'll do two separate runs — first you'll get all the cheap Low-Hanging Fruit results you can, then you'll re-run all the failures in the "All The Damned Fruit" Mode.
Instructions:
- Run for your full set of URLs with the
"Low-Hanging Fruit" Mode
settings (You can find them in theModes of Operation
section at the top of this page) - After the run is finished, export the results to Excel format and filter the list to only show the failures
- Re-run these failures with the settings from the
"All The Damned Fruit" Mode
settings (You can find them in theModes of Operation
section at the top of this page) - Export the results from both runs and merge the data manually into one sheet
All Config Options
- Maximum Content Length (Characters) — This will trim each record's markdown output before we add it to the result set. Cuts down on spreadsheet filesize. (Our hard-set internal trim maximum is 10,000 characters)