Website Content Crawler avatar

Website Content Crawler

Pricing

Pay per usage

Go to Apify Store
Website Content Crawler

Website Content Crawler

Crawl any website and extract clean text content, headings, links, and metadata. Configurable depth, domain restriction, and output formats. Ideal for AI/LLM training data preparation and content analysis.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Stephan Corbeil

Stephan Corbeil

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

6 hours ago

Last modified

Categories

Share

Crawl websites and extract clean, structured content for AI training, knowledge bases, and content analysis. Extracts text, headings, links, and metadata while respecting site structure and domain boundaries.

Features

  • Full-Site Crawling: Discover and extract content from all accessible pages
  • Clean Text Extraction: Removes boilerplate, navigation, and extracts main content
  • Structured Data: Captures headings hierarchy, links, metadata, and descriptions
  • Configurable Depth: Control crawl depth from single page to unlimited recursion
  • Domain Boundaries: Stay within domain or crawl subdomains as needed
  • Pattern Exclusion: Skip URLs matching exclude patterns (PDFs, archives, etc.)
  • Multiple Formats: Output as JSON, CSV, or HTML for downstream processing
  • AI/LLM Ready: Perfect for training data collection and fine-tuning datasets

Input Parameters

ParameterTypeRequiredDescriptionDefault
start_urlsArrayYesURLs to start crawling from[]
max_pagesIntegerNoMaximum pages to crawl (1-10000)100
max_depthIntegerNoMaximum crawl depth (0 = single page only)3
same_domainBooleanNoOnly crawl links within same domaintrue
exclude_patternsArrayNoRegex patterns to exclude (e.g., [".*\.pdf$", ".downloads."])[]
output_formatStringNoOutput format: json, csv, or htmljson

Output

FieldTypeDescription
urlStringPage URL
titleStringPage title from
descriptionStringMeta description
h1ArrayAll H1 headings on page
h2ArrayAll H2 headings on page
headingsArrayAll headings (h1-h6) with hierarchy
text_contentStringClean extracted text content
linksArrayAll links found on page
links_internalArrayLinks to pages within same domain
links_externalArrayLinks to external domains
word_countIntegerTotal words in text content
crawl_depthIntegerDepth at which page was discovered

Use Cases

  • AI Training Data: Prepare domain-specific training datasets for LLM fine-tuning
  • Knowledge Base Creation: Build searchable knowledge bases from public documentation
  • Content Analysis: Extract and analyze website content for SEO, structure, and quality
  • Competitive Intelligence: Monitor competitor website content and changes
  • Documentation Archival: Create offline backups of documentation sites
  • Data Enrichment: Combine with other data sources for comprehensive analysis
  • Research: Collect structured data from multiple related websites

Example Output

{
"url": "https://example.com/about",
"title": "About Example Corporation",
"description": "Learn about Example Corp's mission, values, and team.",
"h1": ["About Example Corporation"],
"h2": ["Our Mission", "Our Values", "Our Team"],
"headings": [
{"level": 1, "text": "About Example Corporation"},
{"level": 2, "text": "Our Mission"},
{"level": 2, "text": "Our Values"}
],
"text_content": "Example Corporation is a leading provider of innovative solutions...",
"links": [
{"text": "Home", "url": "https://example.com/"},
{"text": "Contact", "url": "https://example.com/contact"}
],
"links_internal": ["https://example.com/", "https://example.com/contact"],
"links_external": ["https://linkedin.com/company/example"],
"word_count": 2847,
"crawl_depth": 0
}

Limitations

  • JavaScript-rendered content may not be captured (static HTML only)
  • Very large sites (10000+ pages) may take extended time
  • Password-protected pages cannot be accessed
  • Some sites may block crawler user-agents in robots.txt
  • PDF and binary files are extracted as links only, not content
  • Rate limiting on target domain may slow crawl speed
  • Exclude patterns require regex knowledge

Cost & Performance

Typical runs cost $0.20-$2.00 in platform credits depending on site size. Processing time: ~1-5 minutes for 100 pages, scales linearly with page count.


Built by nexgendata. Questions or issues? Check the documentation or open an issue.