Website Content Crawler for LLM's
Pricing
Pay per usage
Website Content Crawler for LLM's
Extract contact information + turn any website into clean, structured content ready for LLM's (e.g. AI lead magnets, RAG pipelines, and outbound personalization). Most web scrapers dump raw HTML or unstructured text. This crawler is purpose-built for LLM's, and optimized for lead generation.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

SalesBlaster AI
Actor stats
0
Bookmarked
1
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
LLM-Optimized Website Content Crawler
Extract contact information + turn any website into clean, structured content ready for LLM's (AI lead magnets, RAG pipelines, and outbound personalization).
Why This Actor?
Most web scrapers dump raw HTML or unstructured text. This crawler is purpose-built for AI workflows — it extracts only the meaningful content, splits it into semantically coherent chunks with heading context, and scores each chunk for quality. The result: content your LLM can actually use without drowning in nav menus, cookie banners, and boilerplate.
Built for agency owners and outbound teams who use AI lead magnets to start conversations with prospects.
Use Cases
AI Lead Magnets
Crawl a prospect's website before generating a personalized audit, report, or strategy doc. Feed the chunks directly into your LLM to produce a lead magnet that references real details from their site — not generic filler.
- AI Automation Agency: Crawl their site and generate a custom n8n workflow or automation map personalized to their business processes
- Paid Ads Agency: Crawl their brand and product pages to generate AI video/picture Meta ad creatives tailored to their offer
- Web Design Agency: Crawl their existing site and generate a fully custom landing page based on their real content and messaging
- SEO Agency: Crawl their site to produce a personalized SEO audit and competitor analysis with page-level recommendations
- Lead Gen Agency: Crawl their offer and ICP pages to generate sample cold email scripts and LinkedIn outbound sequences
- Sales Agency: Crawl their sales pages to build a free AI voice mock call agent or custom sales scripts for their offer
- Content Agency: Crawl their brand voice and existing content to generate a custom content calendar with sample carousel posts
RAG Knowledge Bases
Build a searchable knowledge base from any website. Chunks come pre-tagged with heading paths and content types, so you can filter by topic before stuffing your context window.
Outbound Personalization
Extract key details from a prospect's website to personalize cold outreach at scale. The contact extraction feature pulls emails, phone numbers, and social profiles automatically.
How It Works
Website URL → Sitemap Discovery → Page Crawling → Content Extraction → Semantic Chunking → Quality Scoring→ Contact Extraction (optional)
- Discover pages — Finds pages via sitemap.xml or by following links (configurable strategy)
- Extract content — Uses Mozilla Readability to strip nav, footer, ads, and boilerplate from each page
- Chunk by headings — Splits content along the heading hierarchy so each chunk has semantic context (e.g., "About > Team > Leadership")
- Score quality — Assigns a quality score, content type, and link density metric to each chunk
- Extract contacts — Deduplicates emails, phone numbers, and social links across all crawled pages
Input
| Field | Type | Default | Description |
|---|---|---|---|
startUrl | string | required | Website URL to crawl |
maxPages | number | 20 | Maximum pages to crawl |
maxConcurrency | number | 5 | Concurrent page requests |
sitemapStrategy | enum | "AUTO" | "AUTO" / "SITEMAP_FIRST" / "CRAWL_LINKS" |
includePaths | string[] | [] | Only crawl URLs matching these path prefixes (e.g., ["/blog"]) |
excludePaths | string[] | common defaults | Skip URLs matching these path prefixes |
excludeUrlRegex | string | media/binary files | Regex pattern to exclude URLs |
chunkingOptions.maxChars | number | 2000 | Max characters per chunk |
chunkingOptions.overlapChars | number | 200 | Overlap between consecutive chunks |
extractContacts | boolean | true | Extract emails, phones, and social links |
datasetName | string | "default" | Name for the output dataset |
Output
Content Chunks (Dataset)
Each crawled page produces one or more chunk records:
{"site": "example.com","url": "https://example.com/about","title": "About Us","chunkIndex": 0,"chunkCount": 3,"headingPath": "About > Team > Leadership","markdown": "# Team\n\nOur leadership team...","contentType": "marketing","quality": {"score": 85,"textLength": 1500,"linkDensity": 0.03,"hasStructure": true},"crawledAt": "2026-01-09T12:00:00Z","datasetName": "my-crawl"}
Content types: blog, docs, legal, product, marketing, other
The headingPath field gives your LLM the section context without needing to process the entire page — useful for filtering chunks by topic or building hierarchical summaries.
Contact Summary (Key-Value Store)
Aggregated contact info across all crawled pages, stored under the OUTPUT key:
{"summary": {"totalEmails": 5,"totalPhones": 3,"totalSocialLinks": 8,"socialBreakdown": {"linkedin": 3,"twitter": 2,"facebook": 3}},"contacts": {"emails": ["contact@example.com", "support@example.com"],"phones": ["+14155552671", "+14155552672"],"social": [{"platform": "linkedin","url": "https://linkedin.com/company/example"}]},"crawlStats": {"pagesVisited": 20,"pagesSkipped": 0,"errors": 0}}
Examples
Lead Magnet: Crawl a Prospect's Blog
Crawl their blog content to generate a personalized content audit.
{"startUrl": "https://prospect-company.com/blog","maxPages": 50,"includePaths": ["/blog"],"chunkingOptions": {"maxChars": 3000,"overlapChars": 300}}
Lead Magnet: Full Site Audit
Crawl their entire site for a comprehensive UX or SEO review.
{"startUrl": "https://prospect-company.com","maxPages": 100,"sitemapStrategy": "SITEMAP_FIRST","chunkingOptions": {"maxChars": 1500,"overlapChars": 150}}
Outbound: Extract Contact Info
Quick crawl focused on finding emails and social profiles.
{"startUrl": "https://prospect-company.com","maxPages": 20,"extractContacts": true,"chunkingOptions": {"maxChars": 500,"overlapChars": 0}}
Tips
- Start small: Set
maxPagesto 10-20 for your first run, then increase once you see the output quality - Use
includePathsto focus on the most valuable sections (e.g.,/blog,/services,/case-studies) - Larger chunks (3000+ chars) work better for lead magnet generation; smaller chunks (1000-1500) work better for RAG retrieval
SITEMAP_FIRSTis faster and more complete for well-structured sites;CRAWL_LINKSis better for sites with missing or incomplete sitemaps- Quality scores above 70 generally indicate high-value content worth including in your LLM prompts
Contact
For more information or help, feel free to reach out to the creator: