AI Website Content Crawler avatar

AI Website Content Crawler

Pricing

from $0.50 / 1,000 results

Go to Apify Store
AI Website Content Crawler

AI Website Content Crawler

A super fast website crawler for AI training

Pricing

from $0.50 / 1,000 results

Rating

0.0

(0)

Developer

Fabio Borsotti

Fabio Borsotti

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

This Apify actor downloads a list of web pages, extracts clean text from each page, and stores one result per URL in the dataset.

The actor is optimized for plain HTTP fetching with concurrent requests, automatic URL normalization, and language preference headers based on the langs input field.

What the actor does

For each input URL, the actor normalizes the address, adding https:// automatically when the scheme is missing, so values like example.com are still processed correctly.[

It then performs HTTP requests in parallel using asynchronous execution controlled by maxConcurrency, which helps speed up large batches of URLs.

After downloading the page, the actor parses the HTML with BeautifulSoup, removes non-content elements such as script, style, noscript, header, footer, svg, img, meta, and link, and converts the remaining content into clean plain text.

The actor also reads the lang attribute from the HTML document when available and includes it in the output as htmlLang.

Features

  • Accepts a list of page URLs.
  • Automatically adds https:// when missing from input URLs.
  • Deduplicates normalized URLs before processing.
  • Sends preferred language headers using the langs array.
  • Extracts cleaned plain text from HTML pages.
  • Detects the page language from the HTML lang attribute when present.
  • Processes requests concurrently using maxConcurrency.
  • Logs progress for each processed URL and prints a final success/error summary.

Input

{
"url": [
"example.com",
"https://news.ycombinator.com"
],
"maxConcurrency": 20,
"langs": ["it", "en"]
}

Input fields

FieldTypeRequiredDescription
urlarray of stringsYesList of page URLs to download and convert into clean plain text.
maxConcurrencyintegerNoMaximum number of parallel HTTP requests.
langsarray of stringsNoPreferred languages used to build the Accept-Language header for HTTP requests.

How language handling works

The actor converts the langs array into a standard Accept-Language header ordered by priority, so langs: ["it", "en", "fr"] becomes a header similar to it,en;q=0.9,fr;q=0.8.

This does not guarantee that every website returns content in the requested language, because the final response depends on how each target site handles language negotiation.

Output

Each dataset item contains the processing result for a single URL.

Successful result

{
"url": "https://example.com",
"success": true,
"statusCode": 200,
"langs": ["it", "en"],
"acceptLanguage": "it,en;q=0.9",
"htmlLang": "en",
"text": "Example Domain\nThis domain is for use in illustrative examples in documents...",
"error": null
}

Error result

{
"url": "https://example.com/missing-page",
"success": false,
"statusCode": null,
"langs": ["it", "en"],
"acceptLanguage": "it,en;q=0.9",
"htmlLang": null,
"text": null,
"error": "404 Client Error or another request exception"
}

Notes and limitations

  • The actor uses plain HTTP requests and does not render JavaScript in a browser, so text that appears only after client-side rendering may be missing.
  • The extracted text depends on the HTML returned by the target website and on the site's response to the Accept-Language header.
  • The actor is intended for HTML pages; non-HTML resources may fail or return unusable output depending on server behavior.