Under maintenance

Pricing

Pay per usage

Try for free

Go to Apify Store

Website Content Crawler for LLM's

Under maintenance

Try for free

Extract contact information + turn any website into clean, structured content ready for LLM's (e.g. AI lead magnets, RAG pipelines, and outbound personalization). Most web scrapers dump raw HTML or unstructured text. This crawler is purpose-built for LLM's, and optimized for lead generation.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

SalesBlaster AI

Actor stats

Bookmarked

Total users

Monthly active users

5 months ago

Last modified

LLM-Optimized Website Content Crawler

Extract contact information + turn any website into clean, structured content ready for LLM's (AI lead magnets, RAG pipelines, and outbound personalization).

Why This Actor?

Most web scrapers dump raw HTML or unstructured text. This crawler is purpose-built for AI workflows — it extracts only the meaningful content, splits it into semantically coherent chunks with heading context, and scores each chunk for quality. The result: content your LLM can actually use without drowning in nav menus, cookie banners, and boilerplate.

Built for agency owners and outbound teams who use AI lead magnets to start conversations with prospects.

Use Cases

AI Lead Magnets

Crawl a prospect's website before generating a personalized audit, report, or strategy doc. Feed the chunks directly into your LLM to produce a lead magnet that references real details from their site — not generic filler.

AI Automation Agency: Crawl their site and generate a custom n8n workflow or automation map personalized to their business processes
Paid Ads Agency: Crawl their brand and product pages to generate AI video/picture Meta ad creatives tailored to their offer
Web Design Agency: Crawl their existing site and generate a fully custom landing page based on their real content and messaging
SEO Agency: Crawl their site to produce a personalized SEO audit and competitor analysis with page-level recommendations
Lead Gen Agency: Crawl their offer and ICP pages to generate sample cold email scripts and LinkedIn outbound sequences
Sales Agency: Crawl their sales pages to build a free AI voice mock call agent or custom sales scripts for their offer
Content Agency: Crawl their brand voice and existing content to generate a custom content calendar with sample carousel posts

RAG Knowledge Bases

Build a searchable knowledge base from any website. Chunks come pre-tagged with heading paths and content types, so you can filter by topic before stuffing your context window.

Outbound Personalization

Extract key details from a prospect's website to personalize cold outreach at scale. The contact extraction feature pulls emails, phone numbers, and social profiles automatically.

How It Works

Website URL → Sitemap Discovery → Page Crawling → Content Extraction → Semantic Chunking → Quality Scoring
                                                  → Contact Extraction (optional)

Discover pages — Finds pages via sitemap.xml or by following links (configurable strategy)
Extract content — Uses Mozilla Readability to strip nav, footer, ads, and boilerplate from each page
Chunk by headings — Splits content along the heading hierarchy so each chunk has semantic context (e.g., "About > Team > Leadership")
Score quality — Assigns a quality score, content type, and link density metric to each chunk
Extract contacts — Deduplicates emails, phone numbers, and social links across all crawled pages

Input

Field	Type	Default	Description
`startUrl`	string	required	Website URL to crawl
`maxPages`	number	20	Maximum pages to crawl
`maxConcurrency`	number	5	Concurrent page requests
`sitemapStrategy`	enum	`"AUTO"`	`"AUTO"` / `"SITEMAP_FIRST"` / `"CRAWL_LINKS"`
`includePaths`	string[]	`[]`	Only crawl URLs matching these path prefixes (e.g., `["/blog"]`)
`excludePaths`	string[]	common defaults	Skip URLs matching these path prefixes
`excludeUrlRegex`	string	media/binary files	Regex pattern to exclude URLs
`chunkingOptions.maxChars`	number	2000	Max characters per chunk
`chunkingOptions.overlapChars`	number	200	Overlap between consecutive chunks
`extractContacts`	boolean	`true`	Extract emails, phones, and social links
`datasetName`	string	`"default"`	Name for the output dataset

Output

Content Chunks (Dataset)

Each crawled page produces one or more chunk records:

{
  "site": "example.com",
  "url": "https://example.com/about",
  "title": "About Us",
  "chunkIndex": 0,
  "chunkCount": 3,
  "headingPath": "About > Team > Leadership",
  "markdown": "# Team\n\nOur leadership team...",
  "contentType": "marketing",
  "quality": {
    "score": 85,
    "textLength": 1500,
    "linkDensity": 0.03,
    "hasStructure": true
  },
  "crawledAt": "2026-01-09T12:00:00Z",
  "datasetName": "my-crawl"
}

Content types: blog, docs, legal, product, marketing, other

The headingPath field gives your LLM the section context without needing to process the entire page — useful for filtering chunks by topic or building hierarchical summaries.

Contact Summary (Key-Value Store)

Aggregated contact info across all crawled pages, stored under the OUTPUT key:

{
  "summary": {
    "totalEmails": 5,
    "totalPhones": 3,
    "totalSocialLinks": 8,
    "socialBreakdown": {
      "linkedin": 3,
      "twitter": 2,
      "facebook": 3
    }
  },
  "contacts": {
    "emails": ["contact@example.com", "support@example.com"],
    "phones": ["+14155552671", "+14155552672"],
    "social": [
      {
        "platform": "linkedin",
        "url": "https://linkedin.com/company/example"
      }
    ]
  },
  "crawlStats": {
    "pagesVisited": 20,
    "pagesSkipped": 0,
    "errors": 0
  }
}

Examples

Lead Magnet: Crawl a Prospect's Blog

Crawl their blog content to generate a personalized content audit.

{
  "startUrl": "https://prospect-company.com/blog",
  "maxPages": 50,
  "includePaths": ["/blog"],
  "chunkingOptions": {
    "maxChars": 3000,
    "overlapChars": 300
  }
}

Lead Magnet: Full Site Audit

Crawl their entire site for a comprehensive UX or SEO review.

{
  "startUrl": "https://prospect-company.com",
  "maxPages": 100,
  "sitemapStrategy": "SITEMAP_FIRST",
  "chunkingOptions": {
    "maxChars": 1500,
    "overlapChars": 150
  }
}

Outbound: Extract Contact Info

Quick crawl focused on finding emails and social profiles.

{
  "startUrl": "https://prospect-company.com",
  "maxPages": 20,
  "extractContacts": true,
  "chunkingOptions": {
    "maxChars": 500,
    "overlapChars": 0
  }
}

Tips

Start small: Set maxPages to 10-20 for your first run, then increase once you see the output quality
Use includePaths to focus on the most valuable sections (e.g., /blog, /services, /case-studies)
Larger chunks (3000+ chars) work better for lead magnet generation; smaller chunks (1000-1500) work better for RAG retrieval
SITEMAP_FIRST is faster and more complete for well-structured sites; CRAWL_LINKS is better for sites with missing or incomplete sitemaps
Quality scores above 70 generally indicate high-value content worth including in your LLM prompts

Contact

For more information or help, feel free to reach out to the creator:

Website Content Crawler — Text, Markdown & HTML for AI/LLM

hichemdev/website-content-crawler

Crawl any website and extract clean text, Markdown, and HTML from every page — ready for LLM, RAG, and AI ingestion.

Hichem Ben Moussa

AI Web Extractor: URL → Clean Markdown + JSON for LLM/RAG

boxbox10/ai-web-extractor

Turn any URL into clean, LLM-ready Markdown + structured JSON (title, headings, main content, links, metadata, token count). Perfect for RAG pipelines, AI agents, and LLM context.

Marvin Eguilos

Website Content Crawler

crawlerbros/website-content-crawler

Crawls websites and extracts clean text, markdown, or HTML content. Ideal for LLM training data, RAG pipelines, and knowledge base building.

Crawler Bros

Website Content Crawler for AI & LLM Data

your_scraper_guy/website-content-crawler-lite

Crawl any website from a seed URL and extract clean Markdown content, ready for LLM training data, RAG pipelines, and vector databases. Set crawl depth, page limits, and domain scope.

Code With Aqib

Website Content Crawler

rupom888/website-content-crawler

Syed Rupom

RAG Web Browser

travelmonitorlab/rag-web-browser

Search the web and extract content for AI/RAG pipelines. Returns clean text ready for LLM ingestion.

Travel Monitor Lab

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Logiover

AI-Ready Content Extractor — Structured Web Data for LLM & MCP

yuchiaoniu/ai-content-extractor

Extract structured JSON from any URL for LLM, RAG, and MCP integration. Outputs title, sections, contact info, links, structured data, and clean plain text.

Niu Yuchiao

Website Crawler API — Markdown for RAG

tugelbay/website-content-crawler

Website crawler API for public pages and clean Markdown, text, or HTML output for RAG pipelines, AI agents, documentation indexing, and monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler

Tugelbay Konabayev

RAG-Ready Website Crawler — Clean Content for LLMs & Vector DBs

yourwingman/rag-ready-crawler

Crawl websites and output clean, chunked content optimized for RAG pipelines, LLM training data, and vector databases. Built for AI knowledge bases and semantic search.