OmniStruct Cost-Optimized Scraper
Pricing
Pay per usage
OmniStruct Cost-Optimized Scraper
Need clean, structured data from a website without burning through your Apify compute credits? OmniStruct Cost-Optimized Scraper is built from the ground up to be the most efficient, budget-friendly universal scraper on the platform.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
velurix
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
13 days ago
Last modified
Categories
Share
Universal Web Scraper
A high-performance, cost-optimized Apify Actor designed to scrape massive lists of URLs (40,000+) efficiently. It employs a two-phase crawling strategy: a fast, cheap HTTP crawl using BeautifulSoup, followed by an automatic browser fallback (Playwright) only for pages that require JavaScript rendering.
Features
- Cost-Optimized Two-Phase Crawling: 90% of sites are scraped using cheap HTTP requests. Only strictly necessary sites (e.g., SPAs, aggressive anti-bot
<noscript>pages) trigger a headless browser. - Fail-Fast Logic: Instantly drops dead domains and 40X/50X errors to free up concurrency slots and prevent infinite proxy retry loops.
- Pre-Flight Validation: The included
run_and_monitor.pyscript validates DNS and basic HTTP connectivity before spending Apify compute units. - Intelligent Extraction: Uses Mozilla's Readability.js logic to extract the main article content, alongside robust email and phone number regex extraction, and standard meta tags.
- Batch Processing: Safely splits large URL lists into manageable chunks to avoid Apify run limits.
How to Run
Option 1: Using the Apify UI
- Go to the run console of the Actor.
- Provide your list of
urlsin the input JSON format. - Configure settings:
- For bulk lists, set Max crawl depth (
max_crawl_depth) to0. - Set Max request retries (
max_request_retries) to0or1.
- For bulk lists, set Max crawl depth (
Option 2: Using the Orchestration Script (Recommended for Bulk)
For lists > 1,000 URLs, use the provided run_and_monitor.py script locally to orchestrate the API calls.
- Ensure your Apify API Token is set in
run_and_monitor.py. - Place your URLs in a CSV file (e.g.,
Untitled spreadsheet - Sheet1 (2).csvor update the script path). - Run the script:
# Runs with pre-validation (DNS + HEAD check) to save costspython run_and_monitor.py# Or, to skip validation and send directly to Apify:python run_and_monitor.py --no-validate
Configuration Reference
| Field | Default | Description |
|---|---|---|
urls | [] | List of starting URLs. |
max_crawl_depth | 0 | Recommended exactly 0 for lists of URLs to prevent spidering the whole site. |
max_request_retries | 1 | Retries per failed request. 0 recommended for bulk runs. |
http_timeout_secs | 10 | Wait limit for standard pages. |
browser_timeout_secs | 15 | Wait limit for browser-rendered pages. |
enable_browser_fallback | true | Allows Playwright to step in when BeautifulSoup fails to extract content or detects an SPA. |
Output Format
The Actor pushes items to the Apify dataset in the following format:
{"url": "https://example.com","domain": "example.com","status": "success","title": "Example Domain","description": "This domain is for use in illustrative examples.","content_text": "Example Domain This domain is for use in illustrative examples...","content_length_chars": 62,"emails": ["contact@example.com"],"phone_numbers": ["1-800-555-1234"],"used_browser_fallback": false,"timestamp": "2026-02-21T00:00:00.000000"}


