Website Content Crawler avatar

Website Content Crawler

Pricing

from $1.50 / 1,000 results

Go to Apify Store
Website Content Crawler

Website Content Crawler

Pricing

from $1.50 / 1,000 results

Rating

0.0

(0)

Developer

Syed Rupom

Syed Rupom

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

2

Monthly active users

20 days ago

Last modified

Categories

Share

Crawl any website and extract clean, structured content. Outputs plain text, Markdown, or raw HTML — optimized for AI/LLM applications, RAG pipelines, documentation indexing, chatbot training, and content analysis.

Features

  • Full site crawling: Follows internal links up to configurable depth
  • Smart content extraction: Auto-detects main content, strips nav/header/footer/ads
  • Multiple output formats: Markdown (AI-ready), plain text, or raw HTML
  • JavaScript rendering: Full Puppeteer-based crawling handles React, Vue, and dynamic sites
  • JSON-LD extraction: Structured data schemas embedded in pages
  • Configurable depth & page limits: Control exactly how much to crawl
  • Custom selectors: Target specific content areas or remove specific elements
  • Subdomain support: Optionally follow links to subdomains

Output Fields Per Page

FieldDescription
urlOriginal URL
loaded_urlFinal URL after redirects
titlePage <title>
descriptionMeta description
authorAuthor meta tag
keywordsMeta keywords
og_imageOpen Graph image URL
canonicalCanonical URL
langPage language code
h1Main heading
h2sTop subheadings (up to 10)
textClean plain text (format=text)
markdownMarkdown-formatted content (format=markdown)
htmlContent HTML (format=html)
json_ldJSON-LD structured data objects
depthCrawl depth from start URL
referrerPage that linked here
load_time_msPage load time in ms
status_codeHTTP status code
links_foundNumber of links on the page
crawled_atISO timestamp

Input

{
"startUrls": [{"url": "https://docs.example.com"}],
"maxPages": 100,
"maxDepth": 3,
"includeSubdomains": false,
"outputFormat": "markdown",
"extractSelector": "article",
"removeSelectors": [".sidebar", ".related-posts"],
"proxyConfiguration": {"useApifyProxy": false}
}

Use Cases

  • AI Training Data: Extract clean, structured web content at scale
  • RAG Pipelines: Feed documentation sites into vector databases (Pinecone, Qdrant, Weaviate)
  • Custom ChatGPT: Build knowledge bases from product documentation
  • Content Auditing: Extract and analyze all text across a website
  • Competitive Research: Extract competitor content for analysis
  • Documentation Indexing: Index technical docs for search

Tips

  • Set maxDepth: 0 to only scrape the start URLs without following links
  • Use extractSelector: "main" to target only the main content area
  • Set outputFormat: "markdown" for best results with AI/LLM ingestion
  • Most public sites work without proxies; enable proxies for rate-limited sites