Website Content Crawler avatar

Website Content Crawler

Under maintenance

Pricing

Pay per usage

Go to Apify Store
Website Content Crawler

Website Content Crawler

Under maintenance

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Syed Rupom

Syed Rupom

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

1

Monthly active users

3 hours ago

Last modified

Categories

Share

Crawl any website and extract clean, structured content. Outputs plain text, Markdown, or raw HTML — optimized for AI/LLM applications, RAG pipelines, documentation indexing, chatbot training, and content analysis.

Features

  • Full site crawling: Follows internal links up to configurable depth
  • Smart content extraction: Auto-detects main content, strips nav/header/footer/ads
  • Multiple output formats: Markdown (AI-ready), plain text, or raw HTML
  • JavaScript rendering: Full Puppeteer-based crawling handles React, Vue, and dynamic sites
  • JSON-LD extraction: Structured data schemas embedded in pages
  • Configurable depth & page limits: Control exactly how much to crawl
  • Custom selectors: Target specific content areas or remove specific elements
  • Subdomain support: Optionally follow links to subdomains

Output Fields Per Page

FieldDescription
urlOriginal URL
loaded_urlFinal URL after redirects
titlePage <title>
descriptionMeta description
authorAuthor meta tag
keywordsMeta keywords
og_imageOpen Graph image URL
canonicalCanonical URL
langPage language code
h1Main heading
h2sTop subheadings (up to 10)
textClean plain text (format=text)
markdownMarkdown-formatted content (format=markdown)
htmlContent HTML (format=html)
json_ldJSON-LD structured data objects
depthCrawl depth from start URL
referrerPage that linked here
load_time_msPage load time in ms
status_codeHTTP status code
links_foundNumber of links on the page
crawled_atISO timestamp

Input

{
"startUrls": [{"url": "https://docs.example.com"}],
"maxPages": 100,
"maxDepth": 3,
"includeSubdomains": false,
"outputFormat": "markdown",
"extractSelector": "article",
"removeSelectors": [".sidebar", ".related-posts"],
"proxyConfiguration": {"useApifyProxy": false}
}

Use Cases

  • AI Training Data: Extract clean, structured web content at scale
  • RAG Pipelines: Feed documentation sites into vector databases (Pinecone, Qdrant, Weaviate)
  • Custom ChatGPT: Build knowledge bases from product documentation
  • Content Auditing: Extract and analyze all text across a website
  • Competitive Research: Extract competitor content for analysis
  • Documentation Indexing: Index technical docs for search

Tips

  • Set maxDepth: 0 to only scrape the start URLs without following links
  • Use extractSelector: "main" to target only the main content area
  • Set outputFormat: "markdown" for best results with AI/LLM ingestion
  • Most public sites work without proxies; enable proxies for rate-limited sites