AI Website Content Crawler avatar

AI Website Content Crawler

Pricing

from $0.50 / 1,000 results

Go to Apify Store
AI Website Content Crawler

AI Website Content Crawler

A super fast website crawler for AI training

Pricing

from $0.50 / 1,000 results

Rating

5.0

(1)

Developer

Fabio Borsotti

Fabio Borsotti

Maintained by Community

Actor stats

0

Bookmarked

4

Total users

2

Monthly active users

17 days ago

Last modified

Share

This Apify actor downloads a list of web pages, extracts clean text from each page, and stores one result per URL in the dataset. The actor uses Playwright to open and analyze pages in a browser, with automatic HTTP fallback if browser rendering fails.

What it does

  • Accepts a list of page URLs.
  • Automatically adds https:// when missing from input URLs.
  • Deduplicates normalized URLs before processing.
  • Uses Playwright to open pages in a browser and extract the main text content.
  • Falls back to HTTP requests if Playwright fails.
  • Sends preferred language headers using the langs array.
  • Extracts cleaned plain text from HTML pages.
  • Detects the page language from the html lang attribute when available.
  • Processes multiple URLs in parallel using maxConcurrency.
  • Logs progress for each processed URL and prints a final summary.

Input

Input fields

FieldTypeRequiredDescription
urlarray of stringsYesList of page URLs to open and convert into clean plain text.
maxConcurrencyintegerNoMaximum number of pages opened simultaneously in the Playwright browser. This value represents the maximum number of browser pages or tabs processed in parallel, not the number of system threads. Max value is capped to 50
langsarray of stringsNoPreferred languages used to build the Accept-Language header.

Example input

{
"url": ["example.com", "https://news.ycombinator.com"],
"maxConcurrency": 5,
"langs": ["it", "en"]
}

How concurrency works

The actor uses a shared Playwright browser and opens multiple pages in parallel up to the limit defined by maxConcurrency. In practice, the maximum number of active "threads" corresponds to the maximum number of pages opened simultaneously in the Playwright browser.

How language handling works

The actor converts the langs array into a standard Accept-Language header ordered by priority. For example, ["it", "en", "fr"] becomes a header similar to it,en;q=0.9,fr;q=0.8.

This does not guarantee that every website returns content in the requested language, because the final response depends on how each target site handles language negotiation.

Output

Each dataset item contains the processing result for a single URL.

Successful result example

{
"url": "https://example.com",
"success": true,
"statusCode": 200,
"langs": ["it", "en"],
"acceptLanguage": "it,en;q=0.9",
"htmlLang": "en",
"text": "Example Domain ...",
"error": null,
"extractor": "playwright"
}

HTTP fallback result example

{
"url": "https://example.com",
"success": true,
"statusCode": 200,
"langs": ["it", "en"],
"acceptLanguage": "it,en;q=0.9",
"htmlLang": "en",
"text": "Example Domain ...",
"error": "Playwright failed: ...",
"extractor": "httpx-fallback"
}

Notes and limitations

  • The main extraction is performed with Playwright, so client-side rendered content can be captured more reliably than with a simple HTTP request.
  • maxConcurrency does not refer to real system threads, but to the maximum number of pages opened at the same time in the Playwright browser.
  • The extracted text depends on the HTML returned by the target website and on how the site responds to the Accept-Language header.
  • Some websites may still block browser automation or limit accessible content.

Client code example

NodeJS

import { ApifyClient } from 'apify-client';
// Initialize the ApifyClient with API token
const client = new ApifyClient({
token: '<YOUR_API_TOKEN>',
});
// Prepare Actor input
const input = {
"url": [
"https://www.ttalbuzzano.it/",
"https://www.mondomobileweb.it/317712-jannik-sinner-in-arrivo-concorso-fastwebup-e-vodafone-happy-per-incontrarlo",
"https://www.nittoatpfinals.com/it/news/alcaraz-sinner-pif-atp-live-race-to-turin-post-miami-2026"
],
"maxConcurrency": 20,
"langs": [
"it",
"en"
]
};
(async () => {
// Run the Actor and wait for it to finish
const run = await client.actor("FBgOh24p4TEhFodKx").call(input);
// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
console.dir(item);
});
})();

Phyton

from apify_client import ApifyClient
# Initialize the ApifyClient with your API token
client = ApifyClient("<YOUR_API_TOKEN>")
# Prepare the Actor input
run_input = {
"url": [
"https://www.ttalbuzzano.it/",
"https://www.mondomobileweb.it/317712-jannik-sinner-in-arrivo-concorso-fastwebup-e-vodafone-happy-per-incontrarlo",
"https://www.nittoatpfinals.com/it/news/alcaraz-sinner-pif-atp-live-race-to-turin-post-miami-2026",
],
"maxConcurrency": 20,
"langs": [
"it",
"en",
],
}
# Run the Actor and wait for it to finish
run = client.actor("FBgOh24p4TEhFodKx").call(run_input=run_input)
# Fetch and print Actor results from the run's dataset (if there are any)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item)

cURL

# Set API token
API_TOKEN=<YOUR_API_TOKEN>
# Prepare Actor input
cat > input.json <<'EOF'
{
"url": [
"https://www.ttalbuzzano.it/",
"https://www.mondomobileweb.it/317712-jannik-sinner-in-arrivo-concorso-fastwebup-e-vodafone-happy-per-incontrarlo",
"https://www.nittoatpfinals.com/it/news/alcaraz-sinner-pif-atp-live-race-to-turin-post-miami-2026"
],
"maxConcurrency": 20,
"langs": [
"it",
"en"
]
}
EOF
# Run the Actor
curl "https://api.apify.com/v2/acts/FBgOh24p4TEhFodKx/runs?token=$API_TOKEN" \
-X POST \
-d @input.json \
-H 'Content-Type: application/json'