π Professional Universal HTML & Media Extractor
Pricing
$7.00/month + usage
π Professional Universal HTML & Media Extractor
This script uses Playwright with an Apify Actor to fetch the complete HTML source of any website. The user provides a URL, the page is loaded with JavaScript execution, the full HTML is printed in the terminal, saved to an HTML file,
Pricing
$7.00/month + usage
Rating
0.0
(0)
Developer

Data Pilot
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
This Apify Actor is a high-performance web scraping solution designed to extract the complete rendered HTML content from any website. Built on top of Python 3.12 and Playwright, it is specifically optimized to handle modern, JavaScript-heavy single-page applications (SPAs) like YouTube, TikTok, and Instagram.
π Why Use This Scraper?
Standard scrapers often fail to capture content that is loaded dynamically via JavaScript. This actor solves that by:
- Executing JavaScript: It waits for the page to fully "hydrate" before capturing data.
- Bypassing Restrictions: Integrated with Residential Proxies to minimize CAPTCHAs and IP bans.
- Visual Verification: Automatically takes a screenshot so you can see exactly what the bot saw.
π οΈ Detailed Features
1. Advanced Browser Stealth
Utilizes Chromium with AutomationControlled disabled and custom User-Agent strings to appear as a genuine human user.
2. Residential Proxy Integration
Configured to use Apify's premium residential proxy pool by default, ensuring high success rates on sites with strict anti-bot shields.
3. Multiple Output Formats
- Dataset: Structured JSON/CSV/Excel data including Title, URL, and Timestamp.
- Key-Value Store (HTML): The full raw HTML source saved as a viewable
.htmlfile. - Key-Value Store (Image): A high-resolution
.pngscreenshot of the page.
Output Fields
| Field | Type | Description |
|---|---|---|
status | String | "Success" or "Failed" |
site_info.title | String | Page title |
site_info.original_url | String | Input URL |
site_info.final_url | String | Final URL (after redirects) |
site_info.scraped_at | String | Timestamp of scraping |
site_info.html_length | Integer | HTML content length in characters |
full_html | String | Complete HTML source code (if enabled) |
π οΈ How It Works
- Input Processing - Parses and validates URLs
- Browser Launch - Starts Chromium browser via Playwright
- Page Navigation - Visits each URL with timeout handling
- Content Extraction - Captures page title and HTML
- Data Storage - Pushes structured data to Apify dataset
- Error Handling - Logs failures and continues with next URL
βοΈ Technical Details
- Browser: Chromium (via Playwright)
- Memory: Minimum 512MB recommended
- Language: Python 3.11+
- Dependencies:
apify- Apify SDKplaywright- Browser automationaiohttp- Async HTTP client
π Use Cases
- Content Analysis - Extract and analyze website content
- SEO Auditing - Check page titles and meta information
- Website Monitoring - Track changes in website content
- Data Migration - Backup website HTML
- Research - Collect data from multiple websites
- Competitive Analysis - Compare competitor websites
π‘ Tips for Best Results
- Batch Processing: Process multiple URLs in one run
- Wait Time: Adjust based on website loading speed
- Proxy Usage: Enable for blocked or geo-restricted sites
- HTML Size: Be aware large pages will increase dataset size
- Rate Limiting: Add delays between requests for large batches
β οΈ Legal & Ethical Use
- Respect website Terms of Service
- Check robots.txt before scraping
- Don't overload servers with requests
- Comply with data protection laws (GDPR, etc.)
- Use responsibly and ethically
π₯ Input Configuration
The Actor accepts the following JSON input:
| Field | Type | Description |
|---|---|---|
url | String | The specific website link you want to scrape. |
urls | Array | (Optional) A list of multiple URLs to process in sequence. |
Example Input:
{"url": "[https://www.youtube.com/watch?v=dQw4w9WgXcQ](https://www.youtube.com/watch?v=dQw4w9WgXcQ)"}