AI Website Content Crawler
Pricing
from $0.50 / 1,000 results
Pricing
from $0.50 / 1,000 results
Rating
5.0
(1)
Developer
Fabio Borsotti
Maintained by CommunityActor stats
0
Bookmarked
4
Total users
2
Monthly active users
17 days ago
Last modified
Categories
Share
This Apify actor downloads a list of web pages, extracts clean text from each page, and stores one result per URL in the dataset. The actor uses Playwright to open and analyze pages in a browser, with automatic HTTP fallback if browser rendering fails.
What it does
- Accepts a list of page URLs.
- Automatically adds
https://when missing from input URLs. - Deduplicates normalized URLs before processing.
- Uses Playwright to open pages in a browser and extract the main text content.
- Falls back to HTTP requests if Playwright fails.
- Sends preferred language headers using the
langsarray. - Extracts cleaned plain text from HTML pages.
- Detects the page language from the
html langattribute when available. - Processes multiple URLs in parallel using
maxConcurrency. - Logs progress for each processed URL and prints a final summary.
Input
Input fields
| Field | Type | Required | Description |
|---|---|---|---|
url | array of strings | Yes | List of page URLs to open and convert into clean plain text. |
maxConcurrency | integer | No | Maximum number of pages opened simultaneously in the Playwright browser. This value represents the maximum number of browser pages or tabs processed in parallel, not the number of system threads. Max value is capped to 50 |
langs | array of strings | No | Preferred languages used to build the Accept-Language header. |
Example input
{"url": ["example.com", "https://news.ycombinator.com"],"maxConcurrency": 5,"langs": ["it", "en"]}
How concurrency works
The actor uses a shared Playwright browser and opens multiple pages in parallel up to the limit defined by maxConcurrency. In practice, the maximum number of active "threads" corresponds to the maximum number of pages opened simultaneously in the Playwright browser.
How language handling works
The actor converts the langs array into a standard Accept-Language header ordered by priority. For example, ["it", "en", "fr"] becomes a header similar to it,en;q=0.9,fr;q=0.8.
This does not guarantee that every website returns content in the requested language, because the final response depends on how each target site handles language negotiation.
Output
Each dataset item contains the processing result for a single URL.
Successful result example
{"url": "https://example.com","success": true,"statusCode": 200,"langs": ["it", "en"],"acceptLanguage": "it,en;q=0.9","htmlLang": "en","text": "Example Domain ...","error": null,"extractor": "playwright"}
HTTP fallback result example
{"url": "https://example.com","success": true,"statusCode": 200,"langs": ["it", "en"],"acceptLanguage": "it,en;q=0.9","htmlLang": "en","text": "Example Domain ...","error": "Playwright failed: ...","extractor": "httpx-fallback"}
Notes and limitations
- The main extraction is performed with Playwright, so client-side rendered content can be captured more reliably than with a simple HTTP request.
maxConcurrencydoes not refer to real system threads, but to the maximum number of pages opened at the same time in the Playwright browser.- The extracted text depends on the HTML returned by the target website and on how the site responds to the
Accept-Languageheader. - Some websites may still block browser automation or limit accessible content.
Client code example
NodeJS
import { ApifyClient } from 'apify-client';// Initialize the ApifyClient with API tokenconst client = new ApifyClient({token: '<YOUR_API_TOKEN>',});// Prepare Actor inputconst input = {"url": ["https://www.ttalbuzzano.it/","https://www.mondomobileweb.it/317712-jannik-sinner-in-arrivo-concorso-fastwebup-e-vodafone-happy-per-incontrarlo","https://www.nittoatpfinals.com/it/news/alcaraz-sinner-pif-atp-live-race-to-turin-post-miami-2026"],"maxConcurrency": 20,"langs": ["it","en"]};(async () => {// Run the Actor and wait for it to finishconst run = await client.actor("FBgOh24p4TEhFodKx").call(input);// Fetch and print Actor results from the run's dataset (if any)console.log('Results from dataset');const { items } = await client.dataset(run.defaultDatasetId).listItems();items.forEach((item) => {console.dir(item);});})();
Phyton
from apify_client import ApifyClient# Initialize the ApifyClient with your API tokenclient = ApifyClient("<YOUR_API_TOKEN>")# Prepare the Actor inputrun_input = {"url": ["https://www.ttalbuzzano.it/","https://www.mondomobileweb.it/317712-jannik-sinner-in-arrivo-concorso-fastwebup-e-vodafone-happy-per-incontrarlo","https://www.nittoatpfinals.com/it/news/alcaraz-sinner-pif-atp-live-race-to-turin-post-miami-2026",],"maxConcurrency": 20,"langs": ["it","en",],}# Run the Actor and wait for it to finishrun = client.actor("FBgOh24p4TEhFodKx").call(run_input=run_input)# Fetch and print Actor results from the run's dataset (if there are any)for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item)
cURL
# Set API tokenAPI_TOKEN=<YOUR_API_TOKEN># Prepare Actor inputcat > input.json <<'EOF'{"url": ["https://www.ttalbuzzano.it/","https://www.mondomobileweb.it/317712-jannik-sinner-in-arrivo-concorso-fastwebup-e-vodafone-happy-per-incontrarlo","https://www.nittoatpfinals.com/it/news/alcaraz-sinner-pif-atp-live-race-to-turin-post-miami-2026"],"maxConcurrency": 20,"langs": ["it","en"]}EOF# Run the Actorcurl "https://api.apify.com/v2/acts/FBgOh24p4TEhFodKx/runs?token=$API_TOKEN" \-X POST \-d @input.json \-H 'Content-Type: application/json'