Website Content Crawler for LLM's avatar

Website Content Crawler for LLM's

Pricing

Pay per usage

Go to Apify Store
Website Content Crawler for LLM's

Website Content Crawler for LLM's

Extract contact information + turn any website into clean, structured content ready for LLM's (e.g. AI lead magnets, RAG pipelines, and outbound personalization). Most web scrapers dump raw HTML or unstructured text. This crawler is purpose-built for LLM's, and optimized for lead generation.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

SalesBlaster AI

SalesBlaster AI

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

1

Monthly active users

3 days ago

Last modified

Share

LLM-Optimized Website Content Crawler

Extract contact information + turn any website into clean, structured content ready for LLM's (AI lead magnets, RAG pipelines, and outbound personalization).

Why This Actor?

Most web scrapers dump raw HTML or unstructured text. This crawler is purpose-built for AI workflows — it extracts only the meaningful content, splits it into semantically coherent chunks with heading context, and scores each chunk for quality. The result: content your LLM can actually use without drowning in nav menus, cookie banners, and boilerplate.

Built for agency owners and outbound teams who use AI lead magnets to start conversations with prospects.

Use Cases

AI Lead Magnets

Crawl a prospect's website before generating a personalized audit, report, or strategy doc. Feed the chunks directly into your LLM to produce a lead magnet that references real details from their site — not generic filler.

  • AI Automation Agency: Crawl their site and generate a custom n8n workflow or automation map personalized to their business processes
  • Paid Ads Agency: Crawl their brand and product pages to generate AI video/picture Meta ad creatives tailored to their offer
  • Web Design Agency: Crawl their existing site and generate a fully custom landing page based on their real content and messaging
  • SEO Agency: Crawl their site to produce a personalized SEO audit and competitor analysis with page-level recommendations
  • Lead Gen Agency: Crawl their offer and ICP pages to generate sample cold email scripts and LinkedIn outbound sequences
  • Sales Agency: Crawl their sales pages to build a free AI voice mock call agent or custom sales scripts for their offer
  • Content Agency: Crawl their brand voice and existing content to generate a custom content calendar with sample carousel posts

RAG Knowledge Bases

Build a searchable knowledge base from any website. Chunks come pre-tagged with heading paths and content types, so you can filter by topic before stuffing your context window.

Outbound Personalization

Extract key details from a prospect's website to personalize cold outreach at scale. The contact extraction feature pulls emails, phone numbers, and social profiles automatically.

How It Works

Website URL → Sitemap Discovery → Page Crawling → Content Extraction → Semantic Chunking → Quality Scoring
→ Contact Extraction (optional)
  1. Discover pages — Finds pages via sitemap.xml or by following links (configurable strategy)
  2. Extract content — Uses Mozilla Readability to strip nav, footer, ads, and boilerplate from each page
  3. Chunk by headings — Splits content along the heading hierarchy so each chunk has semantic context (e.g., "About > Team > Leadership")
  4. Score quality — Assigns a quality score, content type, and link density metric to each chunk
  5. Extract contacts — Deduplicates emails, phone numbers, and social links across all crawled pages

Input

FieldTypeDefaultDescription
startUrlstringrequiredWebsite URL to crawl
maxPagesnumber20Maximum pages to crawl
maxConcurrencynumber5Concurrent page requests
sitemapStrategyenum"AUTO""AUTO" / "SITEMAP_FIRST" / "CRAWL_LINKS"
includePathsstring[][]Only crawl URLs matching these path prefixes (e.g., ["/blog"])
excludePathsstring[]common defaultsSkip URLs matching these path prefixes
excludeUrlRegexstringmedia/binary filesRegex pattern to exclude URLs
chunkingOptions.maxCharsnumber2000Max characters per chunk
chunkingOptions.overlapCharsnumber200Overlap between consecutive chunks
extractContactsbooleantrueExtract emails, phones, and social links
datasetNamestring"default"Name for the output dataset

Output

Content Chunks (Dataset)

Each crawled page produces one or more chunk records:

{
"site": "example.com",
"url": "https://example.com/about",
"title": "About Us",
"chunkIndex": 0,
"chunkCount": 3,
"headingPath": "About > Team > Leadership",
"markdown": "# Team\n\nOur leadership team...",
"contentType": "marketing",
"quality": {
"score": 85,
"textLength": 1500,
"linkDensity": 0.03,
"hasStructure": true
},
"crawledAt": "2026-01-09T12:00:00Z",
"datasetName": "my-crawl"
}

Content types: blog, docs, legal, product, marketing, other

The headingPath field gives your LLM the section context without needing to process the entire page — useful for filtering chunks by topic or building hierarchical summaries.

Contact Summary (Key-Value Store)

Aggregated contact info across all crawled pages, stored under the OUTPUT key:

{
"summary": {
"totalEmails": 5,
"totalPhones": 3,
"totalSocialLinks": 8,
"socialBreakdown": {
"linkedin": 3,
"twitter": 2,
"facebook": 3
}
},
"contacts": {
"emails": ["contact@example.com", "support@example.com"],
"phones": ["+14155552671", "+14155552672"],
"social": [
{
"platform": "linkedin",
"url": "https://linkedin.com/company/example"
}
]
},
"crawlStats": {
"pagesVisited": 20,
"pagesSkipped": 0,
"errors": 0
}
}

Examples

Lead Magnet: Crawl a Prospect's Blog

Crawl their blog content to generate a personalized content audit.

{
"startUrl": "https://prospect-company.com/blog",
"maxPages": 50,
"includePaths": ["/blog"],
"chunkingOptions": {
"maxChars": 3000,
"overlapChars": 300
}
}

Lead Magnet: Full Site Audit

Crawl their entire site for a comprehensive UX or SEO review.

{
"startUrl": "https://prospect-company.com",
"maxPages": 100,
"sitemapStrategy": "SITEMAP_FIRST",
"chunkingOptions": {
"maxChars": 1500,
"overlapChars": 150
}
}

Outbound: Extract Contact Info

Quick crawl focused on finding emails and social profiles.

{
"startUrl": "https://prospect-company.com",
"maxPages": 20,
"extractContacts": true,
"chunkingOptions": {
"maxChars": 500,
"overlapChars": 0
}
}

Tips

  • Start small: Set maxPages to 10-20 for your first run, then increase once you see the output quality
  • Use includePaths to focus on the most valuable sections (e.g., /blog, /services, /case-studies)
  • Larger chunks (3000+ chars) work better for lead magnet generation; smaller chunks (1000-1500) work better for RAG retrieval
  • SITEMAP_FIRST is faster and more complete for well-structured sites; CRAWL_LINKS is better for sites with missing or incomplete sitemaps
  • Quality scores above 70 generally indicate high-value content worth including in your LLM prompts

Contact

For more information or help, feel free to reach out to the creator: