Pricing

from $5.00 / 1,000 results

AI-Powered Smart Web Scraper

Intelligent content extraction from any website using Crawlee + AI. Auto-detects structure, adapts to layout changes, handles JavaScript rendering. No custom code needed. Extract articles, products, listings from 1000s of pages.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

cloud9

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

AI Web Scraper

Extract AI-ready content from any website. Clean Markdown output, smart chunking for RAG/embeddings, and structured metadata — optimized for LLM data pipelines.

Features

Clean Markdown Output — Automatically removes navigation, ads, footers, sidebars, and cookie banners. Extracts only the main content.
Smart Chunking — Paragraph-aware text splitting with configurable chunk size and overlap. Perfect for vector databases and embedding models.
Token Estimation — Each chunk includes an estimated token count, compatible with OpenAI, Cohere, and other tokenizers.
Structured Metadata — Extracts title, description, language, author, publish date, OG images, headings, links, and images.
Multi-page Crawling — Follow links within the same domain with configurable depth. Process entire documentation sites or blogs.
Multiple Output Formats — Markdown (default), plain text, or raw HTML.

Use Cases

RAG Pipelines — Feed clean, chunked content into retrieval-augmented generation systems
Vector Database Ingestion — Ready-to-embed chunks for Pinecone, Weaviate, Qdrant, ChromaDB, Milvus
LLM Fine-tuning Data — Extract structured training data from web sources
Knowledge Base Building — Crawl documentation sites and create searchable knowledge bases
Content Analysis — Extract and analyze web content at scale

Input

Parameter	Type	Default	Description
`urls`	string[]	(required)	URLs to scrape
`maxPages`	integer	10	Maximum pages to crawl
`outputFormat`	string	"markdown"	Output format: "markdown", "text", or "html"
`chunkSize`	integer	1000	Target chunk size in tokens
`chunkOverlap`	integer	100	Overlap between chunks in tokens
`excludeSelectors`	string[]	[]	Additional CSS selectors to exclude
`includeLinks`	boolean	true	Include extracted links in metadata
`includeImages`	boolean	true	Include extracted images in metadata
`maxDepth`	integer	0	Crawl depth (0 = provided URLs only)
`respectRobotsTxt`	boolean	true	Respect robots.txt rules

Output

Each page produces a dataset item with:

{
  "url": "https://example.com/page",
  "metadata": {
    "title": "Page Title",
    "description": "Meta description",
    "language": "en",
    "author": "Author Name",
    "publishedDate": "2025-01-15",
    "ogImage": "https://example.com/image.jpg",
    "headings": [{ "level": 1, "text": "Main Heading" }],
    "links": [{ "text": "Link Text", "href": "https://..." }],
    "images": [{ "alt": "Image description", "src": "https://..." }]
  },
  "content": "# Main Heading\n\nClean markdown content...",
  "chunks": [
    {
      "index": 0,
      "text": "First chunk of content...",
      "tokenEstimate": 245,
      "charCount": 980
    }
  ],
  "totalTokenEstimate": 1520,
  "scrapedAt": "2025-01-15T10:30:00.000Z"
}

Integration Examples

Pinecone / Vector DB

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("your-username/ai-web-scraper").call(
    run_input={"urls": ["https://docs.example.com"], "maxDepth": 2, "chunkSize": 512}
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    for chunk in item["chunks"]:
        # Embed and upsert to your vector database
        embedding = embed(chunk["text"])
        index.upsert([(f"{item['url']}_{chunk['index']}", embedding, {
            "text": chunk["text"],
            "url": item["url"],
            "title": item["metadata"]["title"],
        })])

LangChain

from langchain.document_loaders import ApifyDatasetLoader
from langchain.schema import Document

loader = ApifyDatasetLoader(
    dataset_id=run["defaultDatasetId"],
    dataset_mapping_function=lambda item: [
        Document(
            page_content=chunk["text"],
            metadata={"source": item["url"], "chunk_index": chunk["index"]},
        )
        for chunk in item["chunks"]
    ],
)
docs = loader.load()

Chunk Size Recommendations

Embedding Model	Recommended Chunk Size
OpenAI text-embedding-3-small	500–1000
OpenAI text-embedding-3-large	1000–2000
Cohere embed-v3	256–512
Sentence Transformers	256–512
Google Gecko	500–1000

Pricing

This actor uses pay-per-event pricing at approximately $0.005 per page processed.

License

MIT

AI Web Extractor

uxinfra/uxinfra-web-extractor

Intelligent web content extraction with AI-powered structuring. Extracts articles, products, reviews, and structured data from any website.

UXINFRA

Agentic Crawler

hpix/agentic-crawler

An intelligent AI web scraper that navigates websites like a human. Just describe the data you need in plain English. Adapts to layout changes, handles dynamic JavaScript sites, and gets smarter with every run.

Hpix

AI Smart Scraper — Extract Data from Any Website

flreey/ai-smart-scraper

AI web scraper: describe the data you want in plain English, get clean JSON from any webpage. No CSS selectors needed. For lead gen, price monitoring, RAG, and AI agents. Powered by Gemini AI.

亲晖林

Universal AI Web Scraper

stanvanrooy6/universal-ai-web-scraper

Turn any website into an API. Extract structured data using plain English. Features anti-bot bypass, dynamic rendering, and web search. No coding needed.

Stan Van Rooy

5.0

AI Lead Extractor

dz_omar/ai-lead-extractor

Extract any information from websites using intelligent AI - from contact details to custom data fields, summaries, and creative content. Free tier: basic contact extraction. Paid tier: AI-powered dynamic extraction with natural language instructions.

FlowExtract API

4.3

Crawlee HTML Scraper

ellustar/my-actor-28

Crawlee HTML Scraper is a fast, lightweight web scraping actor built with JavaScript, Crawlee, and Cheerio. It efficiently extracts structured data from static HTML pages, supports custom selectors, pagination, and scalable crawling for reliable web data collection.

Ellustar

Crawlee Scraper

ellustar/my-actor-62

Crawlee Scraper** is a lightweight JavaScript actor for fast and reliable web scraping using Crawlee and Cheerio. It efficiently crawls pages, extracts structured data, and supports scalable, customizable scraping workflows.

Ellustar

Smart Url Extractor

diao-bah-timbi/smart-url-extractor

Intelligent web scraping Actor that automatically detects content types (products, jobs, articles, profiles) and extracts structured data with 15+ fields. Perfect for e-commerce monitoring, job aggregation, and content curation.

Mamadou Diao Bah

Universal AI GPT Scraper

louisdeconinck/ai-gpt-scraper

Transform any website into structured data with AI-powered extraction. This versatile tool combines advanced web scraping with intelligent content analysis to deliver clean, customized JSON output - perfect for automating data collection from any web source.