Pricing

from $0.50 / 1,000 results

Try for free

Go to Apify Store

AI Website Content Crawler

Try for free

A super fast website crawler for AI training

Pricing

from $0.50 / 1,000 results

Rating

5.0

(1)

Developer

Fabio Borsotti

Actor stats

Bookmarked

Total users

Monthly active users

17 days ago

Last modified

What it does

Accepts a list of page URLs.
Automatically adds https:// when missing from input URLs.
Deduplicates normalized URLs before processing.
Uses Playwright to open pages in a browser and extract the main text content.
Falls back to HTTP requests if Playwright fails.
Sends preferred language headers using the langs array.
Extracts cleaned plain text from HTML pages.
Detects the page language from the html lang attribute when available.
Processes multiple URLs in parallel using maxConcurrency.
Logs progress for each processed URL and prints a final summary.

Input

Input fields

Field	Type	Required	Description
`url`	array of strings	Yes	List of page URLs to open and convert into clean plain text.
`maxConcurrency`	integer	No	Maximum number of pages opened simultaneously in the Playwright browser. This value represents the maximum number of browser pages or tabs processed in parallel, not the number of system threads. Max value is capped to 50
`langs`	array of strings	No	Preferred languages used to build the `Accept-Language` header.

Example input

{
  "url": ["example.com", "https://news.ycombinator.com"],
  "maxConcurrency": 5,
  "langs": ["it", "en"]
}

How concurrency works

The actor uses a shared Playwright browser and opens multiple pages in parallel up to the limit defined by maxConcurrency. In practice, the maximum number of active "threads" corresponds to the maximum number of pages opened simultaneously in the Playwright browser.

How language handling works

The actor converts the langs array into a standard Accept-Language header ordered by priority. For example, ["it", "en", "fr"] becomes a header similar to it,en;q=0.9,fr;q=0.8.

This does not guarantee that every website returns content in the requested language, because the final response depends on how each target site handles language negotiation.

Output

Each dataset item contains the processing result for a single URL.

Successful result example

{
  "url": "https://example.com",
  "success": true,
  "statusCode": 200,
  "langs": ["it", "en"],
  "acceptLanguage": "it,en;q=0.9",
  "htmlLang": "en",
  "text": "Example Domain ...",
  "error": null,
  "extractor": "playwright"
}

HTTP fallback result example

{
  "url": "https://example.com",
  "success": true,
  "statusCode": 200,
  "langs": ["it", "en"],
  "acceptLanguage": "it,en;q=0.9",
  "htmlLang": "en",
  "text": "Example Domain ...",
  "error": "Playwright failed: ...",
  "extractor": "httpx-fallback"
}

Notes and limitations

The main extraction is performed with Playwright, so client-side rendered content can be captured more reliably than with a simple HTTP request.
maxConcurrency does not refer to real system threads, but to the maximum number of pages opened at the same time in the Playwright browser.
The extracted text depends on the HTML returned by the target website and on how the site responds to the Accept-Language header.
Some websites may still block browser automation or limit accessible content.

Client code example

NodeJS

import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with API token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "url": [
        "https://www.ttalbuzzano.it/",
        "https://www.mondomobileweb.it/317712-jannik-sinner-in-arrivo-concorso-fastwebup-e-vodafone-happy-per-incontrarlo",
        "https://www.nittoatpfinals.com/it/news/alcaraz-sinner-pif-atp-live-race-to-turin-post-miami-2026"
    ],
    "maxConcurrency": 20,
    "langs": [
        "it",
        "en"
    ]
};

(async () => {
    // Run the Actor and wait for it to finish
    const run = await client.actor("FBgOh24p4TEhFodKx").call(input);

    // Fetch and print Actor results from the run's dataset (if any)
    console.log('Results from dataset');
    const { items } = await client.dataset(run.defaultDatasetId).listItems();
    items.forEach((item) => {
        console.dir(item);
    });
})();

Phyton

from apify_client import ApifyClient

# Initialize the ApifyClient with your API token
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "url": [
        "https://www.ttalbuzzano.it/",
        "https://www.mondomobileweb.it/317712-jannik-sinner-in-arrivo-concorso-fastwebup-e-vodafone-happy-per-incontrarlo",
        "https://www.nittoatpfinals.com/it/news/alcaraz-sinner-pif-atp-live-race-to-turin-post-miami-2026",
    ],
    "maxConcurrency": 20,
    "langs": [
        "it",
        "en",
    ],
}

# Run the Actor and wait for it to finish
run = client.actor("FBgOh24p4TEhFodKx").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

cURL

# Set API token
API_TOKEN=<YOUR_API_TOKEN>

# Prepare Actor input
cat > input.json <<'EOF'
{
  "url": [
    "https://www.ttalbuzzano.it/",
    "https://www.mondomobileweb.it/317712-jannik-sinner-in-arrivo-concorso-fastwebup-e-vodafone-happy-per-incontrarlo",
    "https://www.nittoatpfinals.com/it/news/alcaraz-sinner-pif-atp-live-race-to-turin-post-miami-2026"
  ],
  "maxConcurrency": 20,
  "langs": [
    "it",
    "en"
  ]
}
EOF

# Run the Actor
curl "https://api.apify.com/v2/acts/FBgOh24p4TEhFodKx/runs?token=$API_TOKEN" \
  -X POST \
  -d @input.json \
  -H 'Content-Type: application/json'

OneRoof Scraper With Agents | $5 / 1k

fatihtahta/oneroof-nz-scraper

Scrape New Zealand property listings from OneRoof.co.nz including prices, addresses, specs, agent info and more. Perfect for real estate analytics, lead generation, or dashboards. Fast, structured, reliable.

Fatih Tahta

5.0

Contact Details Scraper Standby

compass/contact-details-scraper-standby

Simple version Contact Details Scraper that allows using Standby mode to get contact data in a few seconds

Compass

Propertyfinder Scraper

dz_omar/propertyfinder-scraper

Extract property listings from PropertyFinder across the UAE, Saudi Arabia, Bahrain, Egypt & Qatar. Get prices, locations, agent contacts, amenities & specs. Fast, reliable scraping with multi-country support. Perfect for real estate pros & market analysis.

FlowExtract API

5.0

AI Lead Generator – Apollo Alternative (Google + Maps Scraper)

redoubtable_bubble/ai-lead-generator---apollo-alternative-google-maps-scraper

Generate high-quality B2B leads from Google & Google Maps with emails, phone numbers, websites, and lead scoring — without expensive tools like Apollo.

Fahad Waheed Khan

Mahally Scraper Saudi Arabia (2026) | Real-Time API

zen-studio/mahally-scraper

Real-time product data from Mahally's 10M+ listings across 50K+ Saudi stores. Extract seller WhatsApp numbers, prices, reviews, and store badges. Geo-filter by 50+ cities with radius search. Supports Arabic queries for better results. Built for lead gen, pricing intelligence, and B2B outreach.

Zen Studio

Zillow Scraper - FREE TO USE

dz_omar/zillow-scraper

Scrape property listings from Zillow search pages. Extracts structured data for every property on any search result for sale, for rent, recently sold across any US city, ZIP code, or custom map region.

FlowExtract API

5.0

Lead Finder With Emails | $1.4 / 1k

fatihtahta/lead-finder-with-emails-scraper

Get a clean, deduped list of B2B contacts with verified emails and rich company context, sourced from multiple places. Filter by title, seniority, location, industry, headcount and more. It validates, enrich, removes duplicates and provides the list. Alternative to Apollo, ZoomInfo and Lusha.

Fatih Tahta

582

3.0

Dubai Real Estate Scraper: PropertyFinder, Bayut & Dubizzle

redoubtable_bubble/dubai-real-estate-scraper-propertyfinder-bayut-dubizzle

Extract high-intent real estate leads & hidden unit numbers from Dubai's top portals. Engineered with 2026 Anti-Bot bypass for PropertyFinder, Bayut, and Dubizzle. Delivers clean, LLM-ready data for investors and agents with Trakheesi permit verification. Best for lead gen and market analysis.

Fahad Waheed Khan

118

5.0

Property Finder Scraper

crawlerbros/propertyfinder-scraper

Scrape property listings from Property Finder (UAE, Saudi Arabia, Qatar, Bahrain, Egypt). Extract prices, locations, photos, agent info, and 50+ fields per listing. No proxy required.

Crawler Bros

5.0

Contact Details Scraper — Emails, Social Media & Marketing Tags

rp_openpro.ai/contact-details-scraper

Scrape contact information from any website — emails, phone numbers, social media profiles, tracking tags, and tech stack. Built-in lead scoring, email verification, WHOIS lookup, and industry detection. Turn raw company URLs into enriched, sales-ready B2B leads with a single click.