Zach's "Webpage Content To Markdown" Scraper
3 days trial then $19.00/month - No credit card required now
This Actor may be unreliable while under maintenance. Would you like to try a similar Actor instead?
See alternative ActorsZach's "Webpage Content To Markdown" Scraper
3 days trial then $19.00/month - No credit card required now
Scrape a webpage and parse to markdown. Packed with features to ensure high success rate and low cost. Includes 2 modes of operation so that you can optimize for either cost (as cheap as possible) or yield (as many successful results as possible).
This Apify actor scrapes a single webpage and parses to markdown. Includes browser-based scraping, smart retrying, anti-scrape block (e.g. cloudflare) circumvention, and smart proxy support to ensure a high success rate.
It also includes 2 modes of operation so that you can optimize for either cost (as cheap as possible) or yield (as many successful results as possible).
π€ When To Use It
Whenever you want to reliably get a webpage's content and parse it into markdown.
(I personally mostly use it for feeding data into ChatGPT for freelance cold outreach personalization & automation tasks, which I cover in our $200k Freelancer course.)
π° Why We Made it:
If you want to have ChatGPT interpret a webpage, it can be surprisingly difficult with current tooling.
- π ChatGPT's API isn't currently web-connected
- πΏ If you try to get a page's content via a Make automation and parse it to text/markdown, it's unreliable and produces a lot of soft failures and rendering errors
- π€’ If you try to use standalone tools for webpage scraping to markdown conversion, they're expensive and also have a lot of soft failures & markdown rendering errors
- π£ If you use the other website-crawling-to-markdown scrapers on Apify they tend to be expensive and unreliable.
That's why we made this Actor...
πͺ Why This Actor is Nifty:
π This actor allows you to simply plop in a big ole list of domain names, and get a huge spreadsheet of markdown content back, to do whatever you want with.
(e.g. upload to google sheets and have ChatGPT iterate through via Make automation)
π€ Features:
- β Anti-Scrape Circumvention β if you use the "Get Data Using Browser" option, we'll be able to circumvent many blocks
- β Soft-Failure Reporting β e.g. if a webpage comes back blank, we'll mark it as a failure β not a lot of other solutions do this)
- β Smart Proxy Support β we'll run on Datacenter proxies by default, and only revert to Residential proxies when actually necessary
- β Smart Retrying β we'll auto retry on failures and rotate proxies and IPs to get you the most successful results possible
π Example Use Cases:
If you're a $200k Freelancer course student, be sure to check the course training area for guidance on the below use cases and more.
Website Language Detection:
- Run this actor
- Put results into a Google Sheet
- Filter out the fails
- Add the formula
=DETECTLANGUAGE(E2)
(assumingE
is the markdown column) to a new column - Extend that formula to all rows in the column
- Filter results to not show languages you don't want (e.g. filter to only show
en
for only English language websites
Cold Outreach Personalization:
(e.g. find out what kinds of products a company sells, who their audience avatar is, etc.)
- Run this actor
- Put results into a Google Sheet
- Filter out the fails
- Create a Make automation that feeds the markdown into ChatGPT for analysis
- Have ChatGPT give you its analyses back as JSON if you want multiple fields / analyses back (e.g. "type_of_products_sold," "random_product_name," etc.
- Parse the JSON and add each field to a column in the Google Sheet
- You can now feed these data into a line-writer ChatGPT prompt to have it rewrite a template line with this personalization data
Modes of Operation
Regardless of which mode you use it in, if you're exporting to a spreadsheet, be sure to choose MS Excel format, not CSV. (Markdown will often mess up the CSV file)
"Low-Hanging Fruit" Mode
The following settings are efficient and the cheapest path to data, but won't work for a lot of websites:
- "Get Data Using Browser" option disabled
- 1GB of RAM
- Residential proxies (we use datacenter by default in our code and will only use residential if actually necessary)
Estimated Costs for "Low-Hanging Fruit" Mode:
- Est. cost per result in "Low-Hanging Fruit" Mode: $0.00025
- Est. yield on results: 84.12%
"All The Damned Fruit" Mode
The following settings have very high reliability, but are more expensive:
- "Get Data Using Browser" option enabled
- 4GB of RAM (You can often get away with 2GB β or even 1GB β of RAM, which will make it much cheaper.)
- Residential proxies (we use datacenter by default in our code and will only use residential if actually necessary)
Estimated Costs for "All The Damned Fruit" Mode:
- Est. cost per result in "All The Damned Fruit" Mode: $0.0069 CPL for residential proxies ($0.0012 CPL for datacenter)
- Est. yield on results: 93.38% for residential (91.64% datacenter)
Pricing Breakdown:
Results | Valid Results | Cost | Cost Per Result (CPL) | Yield | Time | Memory | Proxy | Using Browser Build |
---|---|---|---|---|---|---|---|---|
2462 | 2071 | $0.612 | $0.0002486 | 84.12% | 36min | 1 GB | Residential | No |
2463 | 2078 | $0.914 | $0.0003711 | 84.37% | 19min | 4 GB | Residential | No |
2463 | 2257 | $2.99 | $0.0012140 | 91.64% | 96min | 4 GB | Datacenter | Yes |
2463 | 2300 | $17 | $0.0069022 | 93.38% | 120min | 4 GB | If Residential | Yes |
Suggested Usage
Depending on your priorities, there are a couple ways to use this scraper. What's your priority?
"My Priority is EASE"
("...And I don't care if it costs more.")
π Run it with the settings from the "All The Damned Fruit" Mode from the 'Modes of Operation" instructions right from the start.
If you're exporting to a spreadsheet, be sure to choose MS Excel format, not CSV. (Markdown will often mess up the CSV file)
"My Priority is COST"
("...And I don't care if it means there are a couple extra steps for me.")
π You'll do two separate runs β first you'll get all the cheap Low-Hanging Fruit results you can, then you'll re-run all the failures in the "All The Damned Fruit" Mode.
Instructions:
- Run for your full set of URLs with the
"Low-Hanging Fruit" Mode
settings (You can find them in theModes of Operation
section at the top of this page) - After the run is finished, export the results to Excel format and filter the list to only show the failures
- Re-run these failures with the settings from the
"All The Damned Fruit" Mode
settings (You can find them in theModes of Operation
section at the top of this page) - Export the results from both runs and merge the data manually into one sheet
All Config Options
- Maximum Content Length (Characters) β This will trim each record's markdown output before we add it to the result set. Cuts down on spreadsheet filesize. (Our hard-set internal trim maximum is 10,000 characters)
Actor Metrics
4 monthly users
-
1 star
18% runs succeeded
Created in Jan 2025
Modified 5 days ago