AI Website Content Crawler
Pricing
from $0.50 / 1,000 results
Pricing
from $0.50 / 1,000 results
Rating
0.0
(0)
Developer
Fabio Borsotti
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
This Apify actor downloads a list of web pages, extracts clean text from each page, and stores one result per URL in the dataset.
The actor is optimized for plain HTTP fetching with concurrent requests, automatic URL normalization, and language preference headers based on the langs input field.
What the actor does
For each input URL, the actor normalizes the address, adding https:// automatically when the scheme is missing, so values like example.com are still processed correctly.[
It then performs HTTP requests in parallel using asynchronous execution controlled by maxConcurrency, which helps speed up large batches of URLs.
After downloading the page, the actor parses the HTML with BeautifulSoup, removes non-content elements such as script, style, noscript, header, footer, svg, img, meta, and link, and converts the remaining content into clean plain text.
The actor also reads the lang attribute from the HTML document when available and includes it in the output as htmlLang.
Features
- Accepts a list of page URLs.
- Automatically adds
https://when missing from input URLs. - Deduplicates normalized URLs before processing.
- Sends preferred language headers using the
langsarray. - Extracts cleaned plain text from HTML pages.
- Detects the page language from the HTML
langattribute when present. - Processes requests concurrently using
maxConcurrency. - Logs progress for each processed URL and prints a final success/error summary.
Input
{"url": ["example.com","https://news.ycombinator.com"],"maxConcurrency": 20,"langs": ["it", "en"]}
Input fields
| Field | Type | Required | Description |
|---|---|---|---|
url | array of strings | Yes | List of page URLs to download and convert into clean plain text. |
maxConcurrency | integer | No | Maximum number of parallel HTTP requests. |
langs | array of strings | No | Preferred languages used to build the Accept-Language header for HTTP requests. |
How language handling works
The actor converts the langs array into a standard Accept-Language header ordered by priority, so langs: ["it", "en", "fr"] becomes a header similar to it,en;q=0.9,fr;q=0.8.
This does not guarantee that every website returns content in the requested language, because the final response depends on how each target site handles language negotiation.
Output
Each dataset item contains the processing result for a single URL.
Successful result
{"url": "https://example.com","success": true,"statusCode": 200,"langs": ["it", "en"],"acceptLanguage": "it,en;q=0.9","htmlLang": "en","text": "Example Domain\nThis domain is for use in illustrative examples in documents...","error": null}
Error result
{"url": "https://example.com/missing-page","success": false,"statusCode": null,"langs": ["it", "en"],"acceptLanguage": "it,en;q=0.9","htmlLang": null,"text": null,"error": "404 Client Error or another request exception"}
Notes and limitations
- The actor uses plain HTTP requests and does not render JavaScript in a browser, so text that appears only after client-side rendering may be missing.
- The extracted text depends on the HTML returned by the target website and on the site's response to the
Accept-Languageheader. - The actor is intended for HTML pages; non-HTML resources may fail or return unusable output depending on server behavior.