AI-Powered Smart Web Scraper avatar

AI-Powered Smart Web Scraper

Pricing

Pay per usage

Go to Apify Store
AI-Powered Smart Web Scraper

AI-Powered Smart Web Scraper

Intelligent content extraction from any website using Crawlee + AI. Auto-detects structure, adapts to layout changes, handles JavaScript rendering. No custom code needed. Extract articles, products, listings from 1000s of pages.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

cloud9

cloud9

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Categories

Share

AI Web Scraper

Extract AI-ready content from any website. Clean Markdown output, smart chunking for RAG/embeddings, and structured metadata — optimized for LLM data pipelines.

Features

  • Clean Markdown Output — Automatically removes navigation, ads, footers, sidebars, and cookie banners. Extracts only the main content.
  • Smart Chunking — Paragraph-aware text splitting with configurable chunk size and overlap. Perfect for vector databases and embedding models.
  • Token Estimation — Each chunk includes an estimated token count, compatible with OpenAI, Cohere, and other tokenizers.
  • Structured Metadata — Extracts title, description, language, author, publish date, OG images, headings, links, and images.
  • Multi-page Crawling — Follow links within the same domain with configurable depth. Process entire documentation sites or blogs.
  • Multiple Output Formats — Markdown (default), plain text, or raw HTML.

Use Cases

  • RAG Pipelines — Feed clean, chunked content into retrieval-augmented generation systems
  • Vector Database Ingestion — Ready-to-embed chunks for Pinecone, Weaviate, Qdrant, ChromaDB, Milvus
  • LLM Fine-tuning Data — Extract structured training data from web sources
  • Knowledge Base Building — Crawl documentation sites and create searchable knowledge bases
  • Content Analysis — Extract and analyze web content at scale

Input

ParameterTypeDefaultDescription
urlsstring[](required)URLs to scrape
maxPagesinteger10Maximum pages to crawl
outputFormatstring"markdown"Output format: "markdown", "text", or "html"
chunkSizeinteger1000Target chunk size in tokens
chunkOverlapinteger100Overlap between chunks in tokens
excludeSelectorsstring[][]Additional CSS selectors to exclude
includeLinksbooleantrueInclude extracted links in metadata
includeImagesbooleantrueInclude extracted images in metadata
maxDepthinteger0Crawl depth (0 = provided URLs only)
respectRobotsTxtbooleantrueRespect robots.txt rules

Output

Each page produces a dataset item with:

{
"url": "https://example.com/page",
"metadata": {
"title": "Page Title",
"description": "Meta description",
"language": "en",
"author": "Author Name",
"publishedDate": "2025-01-15",
"ogImage": "https://example.com/image.jpg",
"headings": [{ "level": 1, "text": "Main Heading" }],
"links": [{ "text": "Link Text", "href": "https://..." }],
"images": [{ "alt": "Image description", "src": "https://..." }]
},
"content": "# Main Heading\n\nClean markdown content...",
"chunks": [
{
"index": 0,
"text": "First chunk of content...",
"tokenEstimate": 245,
"charCount": 980
}
],
"totalTokenEstimate": 1520,
"scrapedAt": "2025-01-15T10:30:00.000Z"
}

Integration Examples

Pinecone / Vector DB

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("your-username/ai-web-scraper").call(
run_input={"urls": ["https://docs.example.com"], "maxDepth": 2, "chunkSize": 512}
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
for chunk in item["chunks"]:
# Embed and upsert to your vector database
embedding = embed(chunk["text"])
index.upsert([(f"{item['url']}_{chunk['index']}", embedding, {
"text": chunk["text"],
"url": item["url"],
"title": item["metadata"]["title"],
})])

LangChain

from langchain.document_loaders import ApifyDatasetLoader
from langchain.schema import Document
loader = ApifyDatasetLoader(
dataset_id=run["defaultDatasetId"],
dataset_mapping_function=lambda item: [
Document(
page_content=chunk["text"],
metadata={"source": item["url"], "chunk_index": chunk["index"]},
)
for chunk in item["chunks"]
],
)
docs = loader.load()

Chunk Size Recommendations

Embedding ModelRecommended Chunk Size
OpenAI text-embedding-3-small500–1000
OpenAI text-embedding-3-large1000–2000
Cohere embed-v3256–512
Sentence Transformers256–512
Google Gecko500–1000

Pricing

This actor uses pay-per-event pricing at approximately $0.005 per page processed.

License

MIT