Scrape and Translate
Pricing
from $0.08 / actor start
Scrape and Translate
Turn any website into a structured API in any language. Extract clean data, generate leads, and monitor competitors—automatically. Scrape,Structure and translate website data for various applications including competitor monitoring, AI training, API as a service, internet as database etc...
Pricing
from $0.08 / actor start
Rating
0.0
(0)
Developer

Christo John
Actor stats
0
Bookmarked
1
Total users
1
Monthly active users
19 days ago
Last modified
Categories
Share
🧠 AI Website Scraper: The "Universal API" for the Web
Turn any website into a structured, programmatically accessible API. No selectors. No breakage. Just data.
⚡️ What is this?
Apify AI Website Scraper is an intelligent extraction engine that reads websites like a human but outputs clean JSON like a machine.
Traditional scrapers break when a website updates its CSS or layout. This Actor does not. It uses Vision-capable LLMs to understand the meaning of the page, ensuring 99.9% reliability even on the most complex, dynamic, or legacy websites.
If you can see it in a browser, this Actor can turn it into an API.
🔥 Why Use This Actor?
- Zero-Maintenance Reliability: Stop fixing broken selectors. The AI adapts to layout changes instantly.
- Universal "All-Web" Compatability: Works on SPAs (React/Next.js), old governmental portals, and everything in between.
- Global Translation Power: Extract Japanese content and get it in English (or Spanish, or German) using the integrated Lingo.dev engine.
- Structured JSON: It doesn't just "dump text". It formats data into strict JSON structures (arrays, objects) ready for your DB.
- Bypass Anti-Scraping: Uses a real headless browser with human-like interactions to glide past basic bot defenses.
- Schema Enforcement: Need strict data types? Provide a JSON Schema, and the AI will guarantee the output matches it.
- Cost-Optimized: Bring your own keys to pay $0 for the AI compute, or use ours for a simple all-in-one fee.
🚀 Top Use Cases
Click the link below for detailed user stories.
- 🤖 AI Training Datasets: Create multi-lingual parallel corpora for LLM training from any source.
- 🧠 RAG Pipelines: Feed real-time web data (news, docs) into your AI agents to prevent hallucinations.
- 🛍️ E-Commerce Aggregator: Track prices across 50+ stores without maintaining 50+ scripts.
- 📰 Global News Sentinel: Monitor & translate international news for sentiment analysis.
- 🏥 Specialized Domain Data: Gather niche legal/medical data to fine-tune vertical-specific models.
- 🏘️ Real Estate Discovery: Aggregate listings from niche, unstructured agency sites.
- 🎯 B2B Lead Gen: Extract structured contacts (email, role, LinkedIn) from directories.
👉 ./use-cases.md
⚙️ Configuration & Inputs
Configure your extraction via simple parameters.
⚙️ Configuration Format
| Parameter | Type | Required | Description |
|---|---|---|---|
urls | array | ✅ Yes | The list of webpages (URLs) you want to extract data from. |
prompt | string | ✅ Yes | A plain English description of what you want to extract (e.g., "Extract product price, rating, and all reviews"). |
enhance_prompt | boolean | ❌ No | Only valid if user_schema is provided. If enabled, the AI will rewrite your prompt into a precise instruction based on the schema to ensure higher accuracy. if disabled, it relies solely on the schema structure. Defaults to false. |
user_schema | object | ❌ No | (Optional) A strict JSON schema object. If provided, the output will faithfully adhere to this structure. If left empty, the AI will determine the optimal structure based on your prompt. |
translate_to | array | ❌ No | (Optional) A list of ISO language codes (e.g., ['es', 'de']) to translate the results into. You can find all supported languages in the Lingo.dev Platform. |
proxyConfiguration | object | ❌ No | (Optional) Proxy settings. Defaults to Apify Proxy (recommended) to avoid blocks. |
gemini_api_key | string | ❌ No | (Optional) Provide your own Gemini API Key. If you provide your own API key, you save $0.03 per URL (waiving the entire AI usage fee). This reduces costs by over 75% per run. |
lingo_api_key | string | ❌ No | (Optional) Your Lingo.dev API Key. Required only if you use the translate_to feature. |
💡 Complete Example
This example shows a fully configured run with all options enabled.
{"urls": ["https://competitor.com/pricing","https://competitor.com/products/enterprise"],"prompt": "Extract all pricing tiers, feature lists, and hidden fees. Ensure currency is standardized.","enhance_prompt": true,"user_schema": {"type": "object","properties": {"tiers": {"type": "array","items": {"type": "object","properties": {"name": { "type": "string" },"price": { "type": "number" },"features": { "type": "array", "items": { "type": "string" } }}}},"last_updated": { "type": "string" }}},"translate_to": ["fr", "de", "es"],"proxyConfiguration": {"useApifyProxy": true},"gemini_api_key": "AIzaSy...","lingo_api_key": "lng_..."}
💎 Pro Tip: Want a specific structure but don't have a complete schema? Skip the
user_schemafield. Instead, include a partial JSON example or a list of key fields directly in yourprompt. Tell the AI: "Extract data following this format, but expand it with any other relevant details you find." This gives you the best of both worlds: structure and discovery.
📦 Output & Storage
Data is stored in two locations: the Default Dataset (summary) and the Key-Value Store (full detailed artifacts).
1. The Key-Value Store (The "Single Source of Truth")
For robust applications, always use the artifacts({hash}.json files) in the Key-Value Store. This is where the complete, clean data lives.
We generate two types of files here:
A. The Master Index: data_key.json
This file maps every processed URL(from input) to its unique result filename. Use this to lookup the result file for any specific URL programmatically.
Structure:
{"https://example.com/pricing": "e16b2ab8d1.json","https://example.com/about": "a9481dcc12.json"}
B. The Result Artifact: {hash}.json
This is the most important file. Every processed URL has exactly ONE corresponding JSON file (e.g., e16b2ab8d1.json). It contains the timestamp, metadata, and the full extracted data payload.
Sample Output (with Translation Enabled):
{"url": "https://example.com/product/123","extracted_at": "2025-01-01T12:00:00Z","is_translated": true,"data": {"en": {"product_name": "Ultra Widget","price": 99.99,"stock": "In Stock"},"es": {"product_name": "Ultra Artilugio","price": 99.99,"stock": "En Stock"},"metadata": {"translated_content": true,"translated_to_locales": ["es"]}}}
💎 Pro Tip: When building an API or database integration, read
data_key.jsonfirst to get the filename, then fetch that specific{hash}.jsonfile for the data. This ensures you always get the complete, verified result.
2. The Dataset (Summary)
This provides a quick overview of the run. It is useful for checking status and high-level metrics.
| Field | Type | Description |
|---|---|---|
url | string | The source page address that was processed. |
status | string | The final outcome of the operation (e.g., success, error). |
phase_1 | string | Summary: Details about the page analysis and schema generation performance. |
phase_2 | string | Summary: Details about the extraction performance (time taken, strategy used). |
translation_status | string | Summary: Status of the translation process and target languages (if active). |
💰 Flexible Pricing
We believe in fair, transparent pricing. You only pay for the resources you consume, with no hidden monthly fees. Bring your own API keys to unlock the lowest possible rates.
| Component | Cost | Notes |
|---|---|---|
| Actor Start | $0.010 | One-time fee per run startup. |
| AI Usage | $0.035 | Per page processed. FREE if you bring your own gemini_api_key. |
| Compute | Usage-based | Standard Apify platform rates for RAM/CPU. |
❓ Frequently Asked Questions
1. "Does it break every week like traditional scrapers which rely on CSS Selectors? How is this reliable?"
Reliability is our #1 feature. Traditional scrapers rely on "CSS Selectors" (e.g., div > class="price"), which break whenever a developer updates the website.
This Actor uses Vision AI. It "looks" at the page like a human. If a price moves from the left to the right, the AI still sees it as a price. It is statistically 99.9% more robust against layout changes.
2. "Can I really turn any website into an API?"
Yes. This is the "Universal API" concept.
- Scenario: You want to build an app that shows local event listings, but the city website has no API.
- Solution: Point this Actor at the events page. It extracts the data into clean JSON. Your app consumes that JSON.
- Result: You just created a stable API for a legacy website in 30 seconds.
3. "Does it work on complex apps (React, Next.js, Infinite Scroll)?"
Absolutely. We run a full Headless Chrome browser.
- It executes JavaScript.
- It waits for network activity to settle.
- It renders the final visual state of the page (SPAs) before extracting. If you can see the data in your standard Chrome browser, this Actor can grab it.
4. "I need data for an LLM training set. Can it handle bulk?"
Yes. This system is designed for scale. You can queue over 10,000 URLs.
- Cost Efficiency: Use your own
gemini_api_keyto waive the markup. - Cleanliness: The AI output is normalized. No cleaning HTML tags or weird whitespace. It's ready for vectorization immediately.
5. "How accurate is the Lingo.dev translation?"
It is context-aware and "Web-Native". Unlike standard Google Translate, Lingo.dev is trained specifically on web content (UI elements, marketing copy, technical specs).
- It understands that "Home" in a navbar is definitely "Inicio" (Spanish), not "Casa" (House).
- It preserves the structure of your JSON, only translating the values.
6. "Can I enforce a specific database schema?"
Yes. This is crucial for Enterprise use cases.
If you provide a user_schema (e.g., forcing price to be a number and date to be ISO 8601), the AI will self-correct its output to match your rules. If it fails to match your strict schema, it will report an error rather than giving you bad data.
7. "What about anti-scraping and bot protection?"
By default, the Actor uses Apify Proxy and browser fingerprinting to appear as a legitimate user. It handles complex user simulations. For 95% of the web, this is sufficient to bypass blocks.
🔌 Integrations
- Make.com / Zapier: Trigger workflows when a run finishes.
- Google Drive / Sheets: export clean JSON directly to spreadsheets.
- LangChain / LLMs: Feed extracted text directly into your RAG pipelines.