Zach's "Webpage Content To Markdown" Scraper avatar

Zach's "Webpage Content To Markdown" Scraper

Under maintenance
Try for free

3 days trial then $19.00/month - No credit card required now

Go to Store
This Actor is under maintenance.

This Actor may be unreliable while under maintenance. Would you like to try a similar Actor instead?

See alternative Actors
Zach's "Webpage Content To Markdown" Scraper

Zach's "Webpage Content To Markdown" Scraper

dyf/webpage-to-markdown
Try for free

3 days trial then $19.00/month - No credit card required now

Scrape a webpage and parse to markdown. Packed with features to ensure high success rate and low cost. Includes 2 modes of operation so that you can optimize for either cost (as cheap as possible) or yield (as many successful results as possible).

This Apify actor scrapes a single webpage and parses to markdown. Includes browser-based scraping, smart retrying, anti-scrape block (e.g. cloudflare) circumvention, and smart proxy support to ensure a high success rate.

It also includes 2 modes of operation so that you can optimize for either cost (as cheap as possible) or yield (as many successful results as possible).

πŸ€” When To Use It

Whenever you want to reliably get a webpage's content and parse it into markdown.

(I personally mostly use it for feeding data into ChatGPT for freelance cold outreach personalization & automation tasks, which I cover in our $200k Freelancer course.)

😰 Why We Made it:

If you want to have ChatGPT interpret a webpage, it can be surprisingly difficult with current tooling.

  • 😭 ChatGPT's API isn't currently web-connected
  • 😿 If you try to get a page's content via a Make automation and parse it to text/markdown, it's unreliable and produces a lot of soft failures and rendering errors
  • 🀒 If you try to use standalone tools for webpage scraping to markdown conversion, they're expensive and also have a lot of soft failures & markdown rendering errors
  • 😣 If you use the other website-crawling-to-markdown scrapers on Apify they tend to be expensive and unreliable.

That's why we made this Actor...

πŸ’ͺ Why This Actor is Nifty:

😍 This actor allows you to simply plop in a big ole list of domain names, and get a huge spreadsheet of markdown content back, to do whatever you want with.

(e.g. upload to google sheets and have ChatGPT iterate through via Make automation)

🀘 Features:

  • βœ… Anti-Scrape Circumvention β€” if you use the "Get Data Using Browser" option, we'll be able to circumvent many blocks
  • βœ… Soft-Failure Reporting β€” e.g. if a webpage comes back blank, we'll mark it as a failure β€” not a lot of other solutions do this)
  • βœ… Smart Proxy Support β€” we'll run on Datacenter proxies by default, and only revert to Residential proxies when actually necessary
  • βœ… Smart Retrying β€” we'll auto retry on failures and rotate proxies and IPs to get you the most successful results possible

πŸ’­ Example Use Cases:

If you're a $200k Freelancer course student, be sure to check the course training area for guidance on the below use cases and more.

Website Language Detection:

  1. Run this actor
  2. Put results into a Google Sheet
  3. Filter out the fails
  4. Add the formula =DETECTLANGUAGE(E2) (assuming E is the markdown column) to a new column
  5. Extend that formula to all rows in the column
  6. Filter results to not show languages you don't want (e.g. filter to only show en for only English language websites

Cold Outreach Personalization:

(e.g. find out what kinds of products a company sells, who their audience avatar is, etc.)

  1. Run this actor
  2. Put results into a Google Sheet
  3. Filter out the fails
  4. Create a Make automation that feeds the markdown into ChatGPT for analysis
  5. Have ChatGPT give you its analyses back as JSON if you want multiple fields / analyses back (e.g. "type_of_products_sold," "random_product_name," etc.
  6. Parse the JSON and add each field to a column in the Google Sheet
  7. You can now feed these data into a line-writer ChatGPT prompt to have it rewrite a template line with this personalization data

Modes of Operation

Regardless of which mode you use it in, if you're exporting to a spreadsheet, be sure to choose MS Excel format, not CSV. (Markdown will often mess up the CSV file)

"Low-Hanging Fruit" Mode

The following settings are efficient and the cheapest path to data, but won't work for a lot of websites:

  • "Get Data Using Browser" option disabled
  • 1GB of RAM
  • Residential proxies (we use datacenter by default in our code and will only use residential if actually necessary)

Estimated Costs for "Low-Hanging Fruit" Mode:

  • Est. cost per result in "Low-Hanging Fruit" Mode: $0.00025
  • Est. yield on results: 84.12%

"All The Damned Fruit" Mode

The following settings have very high reliability, but are more expensive:

  • "Get Data Using Browser" option enabled
  • 4GB of RAM (You can often get away with 2GB – or even 1GB – of RAM, which will make it much cheaper.)
  • Residential proxies (we use datacenter by default in our code and will only use residential if actually necessary)

Estimated Costs for "All The Damned Fruit" Mode:

  • Est. cost per result in "All The Damned Fruit" Mode: $0.0069 CPL for residential proxies ($0.0012 CPL for datacenter)
  • Est. yield on results: 93.38% for residential (91.64% datacenter)

Pricing Breakdown:

ResultsValid ResultsCostCost Per Result (CPL)YieldTimeMemoryProxyUsing Browser Build
24622071$0.612$0.000248684.12%36min1 GBResidentialNo
24632078$0.914$0.000371184.37%19min4 GBResidentialNo
24632257$2.99$0.001214091.64%96min4 GBDatacenterYes
24632300$17$0.006902293.38%120min4 GBIf ResidentialYes

Suggested Usage

Depending on your priorities, there are a couple ways to use this scraper. What's your priority?

"My Priority is EASE"

("...And I don't care if it costs more.")

πŸ‘‰ Run it with the settings from the "All The Damned Fruit" Mode from the 'Modes of Operation" instructions right from the start.

If you're exporting to a spreadsheet, be sure to choose MS Excel format, not CSV. (Markdown will often mess up the CSV file)

"My Priority is COST"

("...And I don't care if it means there are a couple extra steps for me.")

πŸ‘‰ You'll do two separate runs β€” first you'll get all the cheap Low-Hanging Fruit results you can, then you'll re-run all the failures in the "All The Damned Fruit" Mode.

Instructions:

  1. Run for your full set of URLs with the "Low-Hanging Fruit" Mode settings (You can find them in the Modes of Operation section at the top of this page)
  2. After the run is finished, export the results to Excel format and filter the list to only show the failures
  3. Re-run these failures with the settings from the "All The Damned Fruit" Mode settings (You can find them in the Modes of Operation section at the top of this page)
  4. Export the results from both runs and merge the data manually into one sheet

All Config Options

  • Maximum Content Length (Characters) β€” This will trim each record's markdown output before we add it to the result set. Cuts down on spreadsheet filesize. (Our hard-set internal trim maximum is 10,000 characters)
Developer
Maintained by Community

Actor Metrics

  • 4 monthly users

  • 1 star

  • 18% runs succeeded

  • Created in Jan 2025

  • Modified 5 days ago