Universal Web Extractor V8 avatar
Universal Web Extractor V8

Pricing

from $0.01 / 1,000 results

Go to Apify Store
Universal Web Extractor V8

Universal Web Extractor V8

Flexible web extractor using Python + Playwright or HTTP. Supports CSS-based field extraction, HTML snapshots, screenshots, metadata, monitoring mode, and link-following. Ideal for scraping product pages, listings, news articles, tech profiles, or universal structured data from any website.

Pricing

from $0.01 / 1,000 results

Rating

5.0

(1)

Developer

Leoncio Jr Coronado

Leoncio Jr Coronado

Maintained by Community

Actor stats

0

Bookmarked

10

Total users

3

Monthly active users

5 days ago

Last modified

Share

🟦 Universal Web Extractor V8

Python Edition — HTTPX + BeautifulSoup

A fast, lightweight universal web scraper that fetches webpages over HTTP, parses HTML using BeautifulSoup, and returns clean, structured data — including title, description, and full text — without launching a browser.

This Actor is designed for speed, low cost, and simplicity, making it ideal for APIs, SEO pipelines, metadata extraction, and content analysis.

🚀 When to Use This Actor

Use Universal Web Extractor V8 (HTTP version) when:

Pages are static HTML (no JavaScript rendering required)

You need fast, low-cost scraping

You want clean text content from webpages

You are building SEO, research, or content pipelines

For JavaScript-heavy websites, use the Playwright edition of this Actor instead.

🚀 When to Use This Actor

Use Universal Web Extractor V8 (HTTP version) when:

Pages are static HTML (no JavaScript rendering required)

You need fast, low-cost scraping

You want clean text content from webpages

You are building SEO, research, or content pipelines

For JavaScript-heavy websites, use the Playwright edition of this Actor instead.

🧠 How It Works

Actor loads start_urls from input

For each URL:

Sends an HTTP request using httpx

Parses HTML with BeautifulSoup

Extracts:

Title

Description

Cleaned full text

Pushes results to a flat JSON dataset

No browser. No JavaScript rendering. Maximum speed.

📥 Input Example { "start_urls": [ "https://example.com", "https://quotes.toscrape.com/" ] }

📤 Output Example { "url": "https://example.com", "title": "Example Domain", "description": "This domain is for use in illustrative examples.", "text_content": "Example Domain This domain is for use in illustrative examples...", "timestamp": "2025-01-01T12:00:00Z" }

🧪 Best Practices

Use for static HTML pages

Ideal for:

Articles

Blogs

Documentation

Product descriptions

SEO metadata scraping

Batch URLs for maximum efficiency

❗ Limitations

❌ Cannot render JavaScript

❌ Not suitable for SPAs (React, Vue, Angular)

❌ No auto-pagination (HTTP-only version)

❌ No selector-based structured extraction (yet)

💡 Tips

If a site requires JavaScript → use the Playwright version

Combine with downstream Actors for:

Data cleaning

NLP

Embeddings

Indexing

🔧 Changelog

v0.0.9 — Python HTTP / BeautifulSoup Edition

Added httpx + BeautifulSoup extraction core

Automatic title, description, and text extraction

clean_html() helper for readable output

Simplified input schema (start_urls only)

Flat output schema (URL + timestamp + fields)

Ready for QA, Spotlight, and Challenge evaluation

🏆 Why This Actor Exists

This Actor focuses on speed, reliability, and simplicity — doing one thing extremely well: extract clean content from webpages with minimal cost and maximum performance.