Website Content Crawler avatar

Website Content Crawler

Under maintenance

Pricing

from $1.00 / 1,000 results

Go to Apify Store
Website Content Crawler

Website Content Crawler

Under maintenance

Crawls websites and extracts clean text, markdown, or HTML content. Ideal for LLM training data, RAG pipelines, and knowledge base building.

Pricing

from $1.00 / 1,000 results

Rating

5.0

(10)

Developer

Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

11

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

Crawl any website and extract clean text, markdown, or HTML content. Built for feeding data into LLMs, building RAG pipelines, creating knowledge bases, and powering AI-driven search.

What does Website Content Crawler do?

This actor takes one or more URLs and crawls entire websites by following links. It extracts clean, readable content from every page — stripping navigation, scripts, footers, and other non-content elements. The output is optimized for use with large language models (LLMs) and retrieval-augmented generation (RAG) systems.

Features

  • Deep website crawling — Follows links across pages up to a configurable depth
  • Multiple output formats — Get content as Markdown, plain text, or cleaned HTML
  • Smart content extraction — Automatically removes navigation, scripts, footers, cookie banners, and other boilerplate
  • URL filtering — Include or exclude pages using glob patterns (e.g., https://example.com/blog/**)
  • Configurable limits — Control max pages, crawl depth, and concurrency
  • JavaScript rendering — Uses a headless browser (Chromium or Firefox) to handle dynamic websites
  • Fast HTTP mode — Optional raw HTTP mode for static sites that don't need JavaScript
  • No login required — Works with publicly accessible pages

Input

FieldTypeRequiredDefaultDescription
startUrlsArray of URLsYesOne or more URLs to begin crawling from
crawlerTypeStringNoplaywright:chromiumCrawler engine: playwright:chromium, playwright:firefox, or http
maxCrawlDepthIntegerNo10Maximum link-following depth (0 = start URLs only)
maxCrawlPagesIntegerNo100Total page limit
maxConcurrencyIntegerNo5Number of pages loaded in parallel
includeUrlGlobsArray of stringsNo[]Only crawl URLs matching these glob patterns
excludeUrlGlobsArray of stringsNo[]Skip URLs matching these glob patterns
outputFormatStringNomarkdownOutput format: markdown, text, or html
proxyConfigurationObjectNoOptional proxy settings

Example Input

{
"startUrls": [
{ "url": "https://docs.apify.com" }
],
"maxCrawlPages": 50,
"maxCrawlDepth": 3,
"outputFormat": "markdown"
}

URL Filtering Example

{
"startUrls": [
{ "url": "https://example.com" }
],
"includeUrlGlobs": ["https://example.com/blog/**"],
"excludeUrlGlobs": ["**login**", "**signup**"]
}

Output

Each crawled page produces a result with the following fields.

FieldTypeDescription
urlStringOriginal URL that was requested
loadedUrlStringFinal URL after any redirects
titleStringPage title
descriptionStringMeta description of the page
languageCodeStringLanguage code from the HTML lang attribute
textStringClean plain text extracted from the page
markdownStringPage content converted to Markdown
htmlStringCleaned HTML content (when output format is HTML)
depthIntegerCrawl depth (0 = start URL)
httpStatusCodeIntegerHTTP response status code
loadedTimeStringISO 8601 timestamp when the page was loaded
referrerUrlStringURL of the page that linked to this one

Example Output

{
"url": "https://docs.apify.com/academy/web-scraping-for-beginners",
"loadedUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
"title": "Web scraping for beginners | Apify Documentation",
"description": "Learn how to build web scrapers from scratch.",
"languageCode": "en",
"text": "Web scraping for beginners\n\nThis course teaches you the basics of web scraping...",
"markdown": "# Web scraping for beginners\n\nThis course teaches you the basics of web scraping...",
"html": "",
"depth": 1,
"httpStatusCode": 200,
"loadedTime": "2025-01-15T10:30:00.000Z",
"referrerUrl": "https://docs.apify.com"
}

Use Cases

  • LLM Training Data — Crawl documentation sites, blogs, or knowledge bases to build training datasets
  • RAG Pipelines — Extract and index website content for retrieval-augmented generation
  • Knowledge Base Building — Convert entire websites into structured, searchable content
  • Content Migration — Export website content as Markdown for migration to a new platform
  • Competitive Analysis — Extract and compare content across competitor websites
  • SEO Auditing — Crawl your site to analyze content, titles, and meta descriptions
  • Documentation Archival — Create offline copies of documentation in clean text format

FAQ

What output format should I use for LLMs?

Markdown is the recommended format for LLM use cases. It preserves document structure (headings, lists, links) while remaining clean and readable. Use text if you need the simplest possible format with no formatting.

How many pages can I crawl?

You can crawl up to 100,000 pages per run. The default limit is 100 pages. Adjust the maxCrawlPages setting based on your needs and available compute.

What's the difference between Chromium and HTTP mode?

Chromium (default) uses a headless browser that renders JavaScript, making it work with modern dynamic websites. HTTP mode fetches raw HTML without running JavaScript — it's much faster but only works with static (server-rendered) pages.

No. The crawler only follows links within the same domain as the start URL. This prevents accidentally crawling the entire internet.

How does URL filtering work?

Use includeUrlGlobs to restrict crawling to specific sections of a site (e.g., https://example.com/docs/**). Use excludeUrlGlobs to skip certain pages (e.g., **login**, **.pdf). Glob patterns use standard wildcard matching where * matches anything within a path segment and ** matches across segments.

Does it handle JavaScript-rendered content?

Yes. The default Chromium mode renders JavaScript before extracting content. This means Single Page Applications (SPAs), React sites, and other dynamic pages are fully supported.

How clean is the extracted content?

The crawler automatically removes navigation menus, headers, footers, scripts, styles, cookie banners, and other non-content elements. The resulting text or markdown contains only the main page content.

Can I crawl password-protected pages?

No. This crawler works with publicly accessible pages only. It does not support login or authentication.

What happens if a page fails to load?

Failed pages are logged and skipped. The crawler continues with the remaining URLs. Check the run log for details on any failures.

Does the crawler respect robots.txt?

The headless browser mode does not explicitly check robots.txt. If you need to respect robots.txt restrictions, review the site's rules before crawling.

What output formats are available for export?

Results can be exported as JSON, CSV, Excel (XLSX), HTML, RSS, or XML directly from the Apify platform.