Website Content Crawler
Pricing
Pay per usage
Website Content Crawler
Crawl any website and extract clean text content, headings, links, and metadata. Configurable depth, domain restriction, and output formats. Ideal for AI/LLM training data preparation and content analysis.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Stephan Corbeil
Actor stats
0
Bookmarked
2
Total users
0
Monthly active users
6 hours ago
Last modified
Categories
Share
Crawl websites and extract clean, structured content for AI training, knowledge bases, and content analysis. Extracts text, headings, links, and metadata while respecting site structure and domain boundaries.
Features
- Full-Site Crawling: Discover and extract content from all accessible pages
- Clean Text Extraction: Removes boilerplate, navigation, and extracts main content
- Structured Data: Captures headings hierarchy, links, metadata, and descriptions
- Configurable Depth: Control crawl depth from single page to unlimited recursion
- Domain Boundaries: Stay within domain or crawl subdomains as needed
- Pattern Exclusion: Skip URLs matching exclude patterns (PDFs, archives, etc.)
- Multiple Formats: Output as JSON, CSV, or HTML for downstream processing
- AI/LLM Ready: Perfect for training data collection and fine-tuning datasets
Input Parameters
| Parameter | Type | Required | Description | Default |
|---|---|---|---|---|
| start_urls | Array | Yes | URLs to start crawling from | [] |
| max_pages | Integer | No | Maximum pages to crawl (1-10000) | 100 |
| max_depth | Integer | No | Maximum crawl depth (0 = single page only) | 3 |
| same_domain | Boolean | No | Only crawl links within same domain | true |
| exclude_patterns | Array | No | Regex patterns to exclude (e.g., [".*\.pdf$", ".downloads."]) | [] |
| output_format | String | No | Output format: json, csv, or html | json |
Output
| Field | Type | Description |
|---|---|---|
| url | String | Page URL |
| title | String | Page title from |
| description | String | Meta description |
| h1 | Array | All H1 headings on page |
| h2 | Array | All H2 headings on page |
| headings | Array | All headings (h1-h6) with hierarchy |
| text_content | String | Clean extracted text content |
| links | Array | All links found on page |
| links_internal | Array | Links to pages within same domain |
| links_external | Array | Links to external domains |
| word_count | Integer | Total words in text content |
| crawl_depth | Integer | Depth at which page was discovered |
Use Cases
- AI Training Data: Prepare domain-specific training datasets for LLM fine-tuning
- Knowledge Base Creation: Build searchable knowledge bases from public documentation
- Content Analysis: Extract and analyze website content for SEO, structure, and quality
- Competitive Intelligence: Monitor competitor website content and changes
- Documentation Archival: Create offline backups of documentation sites
- Data Enrichment: Combine with other data sources for comprehensive analysis
- Research: Collect structured data from multiple related websites
Example Output
{"url": "https://example.com/about","title": "About Example Corporation","description": "Learn about Example Corp's mission, values, and team.","h1": ["About Example Corporation"],"h2": ["Our Mission", "Our Values", "Our Team"],"headings": [{"level": 1, "text": "About Example Corporation"},{"level": 2, "text": "Our Mission"},{"level": 2, "text": "Our Values"}],"text_content": "Example Corporation is a leading provider of innovative solutions...","links": [{"text": "Home", "url": "https://example.com/"},{"text": "Contact", "url": "https://example.com/contact"}],"links_internal": ["https://example.com/", "https://example.com/contact"],"links_external": ["https://linkedin.com/company/example"],"word_count": 2847,"crawl_depth": 0}
Limitations
- JavaScript-rendered content may not be captured (static HTML only)
- Very large sites (10000+ pages) may take extended time
- Password-protected pages cannot be accessed
- Some sites may block crawler user-agents in robots.txt
- PDF and binary files are extracted as links only, not content
- Rate limiting on target domain may slow crawl speed
- Exclude patterns require regex knowledge
Cost & Performance
Typical runs cost $0.20-$2.00 in platform credits depending on site size. Processing time: ~1-5 minutes for 100 pages, scales linearly with page count.
Built by nexgendata. Questions or issues? Check the documentation or open an issue.