Under maintenance

Pricing

from $0.08 / actor start

Try for free

Go to Apify Store

Scrape and Translate

Under maintenance

Try for free

Turn any website into a structured API in any language. Extract clean data, generate leads, and monitor competitors—automatically. Scrape,Structure and translate website data for various applications including competitor monitoring, AI training, API as a service, internet as database etc...

Pricing

from $0.08 / actor start

Rating

0.0

(0)

Developer

Christo John

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

🧠 AI Website Scraper: The "Universal API" for the Web

Turn any website into a structured, programmatically accessible API. No selectors. No breakage. Just data.

⚡️ What is this?

Apify AI Website Scraper is an intelligent extraction engine that reads websites like a human but outputs clean JSON like a machine.

Traditional scrapers break when a website updates its CSS or layout. This Actor does not. It uses Vision-capable LLMs to understand the meaning of the page, ensuring 99.9% reliability even on the most complex, dynamic, or legacy websites.

If you can see it in a browser, this Actor can turn it into an API.

🔥 Why Use This Actor?

Zero-Maintenance Reliability: Stop fixing broken selectors. The AI adapts to layout changes instantly.
Universal "All-Web" Compatability: Works on SPAs (React/Next.js), old governmental portals, and everything in between.
Global Translation Power: Extract Japanese content and get it in English (or Spanish, or German) using the integrated Lingo.dev engine.
Structured JSON: It doesn't just "dump text". It formats data into strict JSON structures (arrays, objects) ready for your DB.
Bypass Anti-Scraping: Uses a real headless browser with human-like interactions to glide past basic bot defenses.
Schema Enforcement: Need strict data types? Provide a JSON Schema, and the AI will guarantee the output matches it.
Cost-Optimized: Bring your own keys to pay $0 for the AI compute, or use ours for a simple all-in-one fee.

🚀 Top Use Cases

Click the link below for detailed user stories.

🤖 AI Training Datasets: Create multi-lingual parallel corpora for LLM training from any source.
🧠 RAG Pipelines: Feed real-time web data (news, docs) into your AI agents to prevent hallucinations.
🛍️ E-Commerce Aggregator: Track prices across 50+ stores without maintaining 50+ scripts.
📰 Global News Sentinel: Monitor & translate international news for sentiment analysis.
🏥 Specialized Domain Data: Gather niche legal/medical data to fine-tune vertical-specific models.
🏘️ Real Estate Discovery: Aggregate listings from niche, unstructured agency sites.
🎯 B2B Lead Gen: Extract structured contacts (email, role, LinkedIn) from directories.

👉 ./use-cases.md

⚙️ Configuration & Inputs

Configure your extraction via simple parameters.

⚙️ Configuration Format

Parameter	Type	Required	Description
`urls`	`array`	✅ Yes	The list of webpages (URLs) you want to extract data from.
`prompt`	`string`	✅ Yes	A plain English description of what you want to extract (e.g., "Extract product price, rating, and all reviews").
`enhance_prompt`	`boolean`	❌ No	Only valid if `user_schema` is provided. If enabled, the AI will rewrite your prompt into a precise instruction based on the schema to ensure higher accuracy. if disabled, it relies solely on the schema structure. Defaults to `false`.
`user_schema`	`object`	❌ No	(Optional) A strict JSON schema object. If provided, the output will faithfully adhere to this structure. If left empty, the AI will determine the optimal structure based on your prompt.
`translate_to`	`array`	❌ No	(Optional) A list of ISO language codes (e.g., `['es', 'de']`) to translate the results into. You can find all supported languages in the Lingo.dev Platform.
`proxyConfiguration`	`object`	❌ No	(Optional) Proxy settings. Defaults to Apify Proxy (recommended) to avoid blocks.
`gemini_api_key`	`string`	❌ No	(Optional) Provide your own Gemini API Key. If you provide your own API key, you save $0.03 per URL (waiving the entire AI usage fee). This reduces costs by over 75% per run.
`lingo_api_key`	`string`	❌ No	(Optional) Your Lingo.dev API Key. Required only if you use the `translate_to` feature.

💡 Complete Example

This example shows a fully configured run with all options enabled.

{
  "urls": [
    "https://competitor.com/pricing",
    "https://competitor.com/products/enterprise"
  ],
  "prompt": "Extract all pricing tiers, feature lists, and hidden fees. Ensure currency is standardized.",
  "enhance_prompt": true,
  "user_schema": {
    "type": "object",
    "properties": {
      "tiers": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name": { "type": "string" },
            "price": { "type": "number" },
            "features": { "type": "array", "items": { "type": "string" } }
          }
        }
      },
      "last_updated": { "type": "string" }
    }
  },
  "translate_to": ["fr", "de", "es"],
  "proxyConfiguration": {
    "useApifyProxy": true
  },
  "gemini_api_key": "AIzaSy...",
  "lingo_api_key": "lng_..."
}

💎 Pro Tip: Want a specific structure but don't have a complete schema? Skip the user_schema field. Instead, include a partial JSON example or a list of key fields directly in your prompt. Tell the AI: "Extract data following this format, but expand it with any other relevant details you find." This gives you the best of both worlds: structure and discovery.

📦 Output & Storage

Data is stored in two locations: the Default Dataset (summary) and the Key-Value Store (full detailed artifacts).

1. The Key-Value Store (The "Single Source of Truth")

For robust applications, always use the artifacts({hash}.json files) in the Key-Value Store. This is where the complete, clean data lives.

We generate two types of files here:

A. The Master Index: `data_key.json`

This file maps every processed URL(from input) to its unique result filename. Use this to lookup the result file for any specific URL programmatically.

Structure:

{
  "https://example.com/pricing": "e16b2ab8d1.json",
  "https://example.com/about": "a9481dcc12.json"
}

B. The Result Artifact: `{hash}.json`

This is the most important file. Every processed URL has exactly ONE corresponding JSON file (e.g., e16b2ab8d1.json). It contains the timestamp, metadata, and the full extracted data payload.

Sample Output (with Translation Enabled):

{
  "url": "https://example.com/product/123",
  "extracted_at": "2025-01-01T12:00:00Z",
  "is_translated": true,
  "data": {
    "en": {
      "product_name": "Ultra Widget",
      "price": 99.99,
      "stock": "In Stock"
    },
    "es": {
      "product_name": "Ultra Artilugio",
      "price": 99.99,
      "stock": "En Stock"
    },
    "metadata": {
      "translated_content": true,
      "translated_to_locales": ["es"]
    }
  }
}

💎 Pro Tip: When building an API or database integration, read data_key.json first to get the filename, then fetch that specific {hash}.json file for the data. This ensures you always get the complete, verified result.

2. The Dataset (Summary)

This provides a quick overview of the run. It is useful for checking status and high-level metrics.

Field	Type	Description
`url`	`string`	The source page address that was processed.
`status`	`string`	The final outcome of the operation (e.g., `success`, `error`).
`phase_1`	`string`	Summary: Details about the page analysis and schema generation performance.
`phase_2`	`string`	Summary: Details about the extraction performance (time taken, strategy used).
`translation_status`	`string`	Summary: Status of the translation process and target languages (if active).

💰 Flexible Pricing

We believe in fair, transparent pricing. You only pay for the resources you consume, with no hidden monthly fees. Bring your own API keys to unlock the lowest possible rates.

Component	Cost	Notes
Actor Start	$0.010	One-time fee per run startup.
AI Usage	$0.035	Per page processed. FREE if you bring your own `gemini_api_key`.
Compute	Usage-based	Standard Apify platform rates for RAM/CPU.

❓ Frequently Asked Questions

1. "Does it break every week like traditional scrapers which rely on CSS Selectors? How is this reliable?"

Reliability is our #1 feature. Traditional scrapers rely on "CSS Selectors" (e.g., div > class="price"), which break whenever a developer updates the website.
This Actor uses Vision AI. It "looks" at the page like a human. If a price moves from the left to the right, the AI still sees it as a price. It is statistically 99.9% more robust against layout changes.

2. "Can I really turn any website into an API?"

Yes. This is the "Universal API" concept.

Scenario: You want to build an app that shows local event listings, but the city website has no API.
Solution: Point this Actor at the events page. It extracts the data into clean JSON. Your app consumes that JSON.
Result: You just created a stable API for a legacy website in 30 seconds.

3. "Does it work on complex apps (React, Next.js, Infinite Scroll)?"

Absolutely. We run a full Headless Chrome browser.

It executes JavaScript.
It waits for network activity to settle.
It renders the final visual state of the page (SPAs) before extracting. If you can see the data in your standard Chrome browser, this Actor can grab it.

4. "I need data for an LLM training set. Can it handle bulk?"

Yes. This system is designed for scale. You can queue over 10,000 URLs.

Cost Efficiency: Use your own gemini_api_key to waive the markup.
Cleanliness: The AI output is normalized. No cleaning HTML tags or weird whitespace. It's ready for vectorization immediately.

5. "How accurate is the Lingo.dev translation?"

It is context-aware and "Web-Native". Unlike standard Google Translate, Lingo.dev is trained specifically on web content (UI elements, marketing copy, technical specs).

It understands that "Home" in a navbar is definitely "Inicio" (Spanish), not "Casa" (House).
It preserves the structure of your JSON, only translating the values.

6. "Can I enforce a specific database schema?"

Yes. This is crucial for Enterprise use cases. If you provide a user_schema (e.g., forcing price to be a number and date to be ISO 8601), the AI will self-correct its output to match your rules. If it fails to match your strict schema, it will report an error rather than giving you bad data.

7. "What about anti-scraping and bot protection?"

By default, the Actor uses Apify Proxy and browser fingerprinting to appear as a legitimate user. It handles complex user simulations. For 95% of the web, this is sufficient to bypass blocks.

🔌 Integrations

Make.com / Zapier: Trigger workflows when a run finishes.
Google Drive / Sheets: export clean JSON directly to spreadsheets.
LangChain / LLMs: Feed extracted text directly into your RAG pipelines.

Start Scraping Now

Scrape Overflow

ayuxy027/scrape-overflow

Search & Translate Developer Content in ANY Language like a Pro ;)

Ayush Yadav

Website → Events

digitalpio/website-to-events

Turn any website into clean, reliable event data ⚡📅

Peter Newton

Google Translator

web.harvester/google-translator

Translate any text to any of the supported languages using https://translate.google.com/

Web Harvester

Google Translate Scraper Pro

hello.datawizards/google-translate-scraper-pro

Google Translate Scraper Pro lets you bulk-translate text using Google Translate. Simply provide source and target languages along with text input, and get clean JSON output with original and translated text. Ideal for localization, automation, or NLP workflows.

datawizards

Yet Another Dataset Translator

mvolfik/yet-another-dataset-translator

Actor to translate datasets with field selection and source language detection. Requires Google Translate API Key.

Matěj Volf

Web Scraper

agreeable_romance/web-scraper

Scrapes any website into a structured format.

Abhay p

Google Translate Scraper

maged120/google-translate-scraper

translates text using Google Translate. and no rate limit at all It supports a wide range of languages and can automatically detect the source language if not specified. Use version 1.1 it's way faster

Maged

5.0

Google Dataset Items Translator

web.harvester/google-dataset-items-translator

Translate any dataset field(s) to any of the supported languages using the Google Translate website, it goes through all the items in the dataset and translates all of the selected fields

Web Harvester

AI Website Content Localizer & Scraper

eunit/ai-website-content-localizer-scraper

Scrape any website and instantly translate the content into 83+ languages using Lingo.dev AI. Build multilingual datasets, localize competitor data, and power global RAG pipelines with clean, context-aware translations. The ultimate tool for shipping global apps fast!

Emmanuel Uchenna

5.0

Advanced Website Crawling Actor

techforce.global/advanced-website-crawling-actor

A fast and reliable scraper for any website that extracts clean HTML, Markdown, and text content. Provides clean, structured data with support for dynamic rendering, recursive sitemap discovery, SSL bypass, and easy API integration for your applications.