OmniStruct Cost-Optimized Scraper avatar

OmniStruct Cost-Optimized Scraper

Pricing

Pay per usage

Go to Apify Store
OmniStruct Cost-Optimized Scraper

OmniStruct Cost-Optimized Scraper

Need clean, structured data from a website without burning through your Apify compute credits? OmniStruct Cost-Optimized Scraper is built from the ground up to be the most efficient, budget-friendly universal scraper on the platform.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

velurix

velurix

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

13 days ago

Last modified

Share

Universal Web Scraper

A high-performance, cost-optimized Apify Actor designed to scrape massive lists of URLs (40,000+) efficiently. It employs a two-phase crawling strategy: a fast, cheap HTTP crawl using BeautifulSoup, followed by an automatic browser fallback (Playwright) only for pages that require JavaScript rendering.

Features

  • Cost-Optimized Two-Phase Crawling: 90% of sites are scraped using cheap HTTP requests. Only strictly necessary sites (e.g., SPAs, aggressive anti-bot <noscript> pages) trigger a headless browser.
  • Fail-Fast Logic: Instantly drops dead domains and 40X/50X errors to free up concurrency slots and prevent infinite proxy retry loops.
  • Pre-Flight Validation: The included run_and_monitor.py script validates DNS and basic HTTP connectivity before spending Apify compute units.
  • Intelligent Extraction: Uses Mozilla's Readability.js logic to extract the main article content, alongside robust email and phone number regex extraction, and standard meta tags.
  • Batch Processing: Safely splits large URL lists into manageable chunks to avoid Apify run limits.

How to Run

Option 1: Using the Apify UI

  1. Go to the run console of the Actor.
  2. Provide your list of urls in the input JSON format.
  3. Configure settings:
    • For bulk lists, set Max crawl depth (max_crawl_depth) to 0.
    • Set Max request retries (max_request_retries) to 0 or 1.

For lists > 1,000 URLs, use the provided run_and_monitor.py script locally to orchestrate the API calls.

  1. Ensure your Apify API Token is set in run_and_monitor.py.
  2. Place your URLs in a CSV file (e.g., Untitled spreadsheet - Sheet1 (2).csv or update the script path).
  3. Run the script:
    # Runs with pre-validation (DNS + HEAD check) to save costs
    python run_and_monitor.py
    # Or, to skip validation and send directly to Apify:
    python run_and_monitor.py --no-validate

Configuration Reference

FieldDefaultDescription
urls[]List of starting URLs.
max_crawl_depth0Recommended exactly 0 for lists of URLs to prevent spidering the whole site.
max_request_retries1Retries per failed request. 0 recommended for bulk runs.
http_timeout_secs10Wait limit for standard pages.
browser_timeout_secs15Wait limit for browser-rendered pages.
enable_browser_fallbacktrueAllows Playwright to step in when BeautifulSoup fails to extract content or detects an SPA.

Output Format

The Actor pushes items to the Apify dataset in the following format:

{
"url": "https://example.com",
"domain": "example.com",
"status": "success",
"title": "Example Domain",
"description": "This domain is for use in illustrative examples.",
"content_text": "Example Domain This domain is for use in illustrative examples...",
"content_length_chars": 62,
"emails": ["contact@example.com"],
"phone_numbers": ["1-800-555-1234"],
"used_browser_fallback": false,
"timestamp": "2026-02-21T00:00:00.000000"
}