πŸš€ Professional Universal HTML & Media Extractor avatar
πŸš€ Professional Universal HTML & Media Extractor

Pricing

$7.00/month + usage

Go to Apify Store
πŸš€ Professional Universal HTML & Media Extractor

πŸš€ Professional Universal HTML & Media Extractor

This script uses Playwright with an Apify Actor to fetch the complete HTML source of any website. The user provides a URL, the page is loaded with JavaScript execution, the full HTML is printed in the terminal, saved to an HTML file,

Pricing

$7.00/month + usage

Rating

0.0

(0)

Developer

Data Pilot

Data Pilot

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Categories

Share

This Apify Actor is a high-performance web scraping solution designed to extract the complete rendered HTML content from any website. Built on top of Python 3.12 and Playwright, it is specifically optimized to handle modern, JavaScript-heavy single-page applications (SPAs) like YouTube, TikTok, and Instagram.


πŸ’Ž Why Use This Scraper?

Standard scrapers often fail to capture content that is loaded dynamically via JavaScript. This actor solves that by:

  1. Executing JavaScript: It waits for the page to fully "hydrate" before capturing data.
  2. Bypassing Restrictions: Integrated with Residential Proxies to minimize CAPTCHAs and IP bans.
  3. Visual Verification: Automatically takes a screenshot so you can see exactly what the bot saw.

πŸ› οΈ Detailed Features

1. Advanced Browser Stealth

Utilizes Chromium with AutomationControlled disabled and custom User-Agent strings to appear as a genuine human user.

2. Residential Proxy Integration

Configured to use Apify's premium residential proxy pool by default, ensuring high success rates on sites with strict anti-bot shields.

3. Multiple Output Formats

  • Dataset: Structured JSON/CSV/Excel data including Title, URL, and Timestamp.
  • Key-Value Store (HTML): The full raw HTML source saved as a viewable .html file.
  • Key-Value Store (Image): A high-resolution .png screenshot of the page.

Output Fields

FieldTypeDescription
statusString"Success" or "Failed"
site_info.titleStringPage title
site_info.original_urlStringInput URL
site_info.final_urlStringFinal URL (after redirects)
site_info.scraped_atStringTimestamp of scraping
site_info.html_lengthIntegerHTML content length in characters
full_htmlStringComplete HTML source code (if enabled)

πŸ› οΈ How It Works

  1. Input Processing - Parses and validates URLs
  2. Browser Launch - Starts Chromium browser via Playwright
  3. Page Navigation - Visits each URL with timeout handling
  4. Content Extraction - Captures page title and HTML
  5. Data Storage - Pushes structured data to Apify dataset
  6. Error Handling - Logs failures and continues with next URL

βš™οΈ Technical Details

  • Browser: Chromium (via Playwright)
  • Memory: Minimum 512MB recommended
  • Language: Python 3.11+
  • Dependencies:
    • apify - Apify SDK
    • playwright - Browser automation
    • aiohttp - Async HTTP client

πŸ“ Use Cases

  1. Content Analysis - Extract and analyze website content
  2. SEO Auditing - Check page titles and meta information
  3. Website Monitoring - Track changes in website content
  4. Data Migration - Backup website HTML
  5. Research - Collect data from multiple websites
  6. Competitive Analysis - Compare competitor websites

πŸ’‘ Tips for Best Results

  1. Batch Processing: Process multiple URLs in one run
  2. Wait Time: Adjust based on website loading speed
  3. Proxy Usage: Enable for blocked or geo-restricted sites
  4. HTML Size: Be aware large pages will increase dataset size
  5. Rate Limiting: Add delays between requests for large batches
  • Respect website Terms of Service
  • Check robots.txt before scraping
  • Don't overload servers with requests
  • Comply with data protection laws (GDPR, etc.)
  • Use responsibly and ethically

πŸ“₯ Input Configuration

The Actor accepts the following JSON input:

FieldTypeDescription
urlStringThe specific website link you want to scrape.
urlsArray(Optional) A list of multiple URLs to process in sequence.

Example Input:

{
"url": "[https://www.youtube.com/watch?v=dQw4w9WgXcQ](https://www.youtube.com/watch?v=dQw4w9WgXcQ)"
}