AI-Powered Smart Web Scraper avatar

AI-Powered Smart Web Scraper

Pricing

from $5.00 / 1,000 results

Go to Apify Store
AI-Powered Smart Web Scraper

AI-Powered Smart Web Scraper

Intelligent content extraction from any website using Crawlee + AI. Auto-detects structure, adapts to layout changes, handles JavaScript rendering. No custom code needed. Extract articles, products, listings from 1000s of pages.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

cloud9

cloud9

Maintained by Community

Actor stats

0

Bookmarked

5

Total users

2

Monthly active users

a month ago

Last modified

Categories

Share

AI Web Scraper

Extract AI-ready content from any website. Clean Markdown output, smart chunking for RAG/embeddings, and structured metadata — optimized for LLM data pipelines.

Features

  • Clean Markdown Output — Automatically removes navigation, ads, footers, sidebars, and cookie banners. Extracts only the main content.
  • Smart Chunking — Paragraph-aware text splitting with configurable chunk size and overlap. Perfect for vector databases and embedding models.
  • Token Estimation — Each chunk includes an estimated token count, compatible with OpenAI, Cohere, and other tokenizers.
  • Structured Metadata — Extracts title, description, language, author, publish date, OG images, headings, links, and images.
  • Multi-page Crawling — Follow links within the same domain with configurable depth. Process entire documentation sites or blogs.
  • Multiple Output Formats — Markdown (default), plain text, or raw HTML.

Use Cases

  • RAG Pipelines — Feed clean, chunked content into retrieval-augmented generation systems
  • Vector Database Ingestion — Ready-to-embed chunks for Pinecone, Weaviate, Qdrant, ChromaDB, Milvus
  • LLM Fine-tuning Data — Extract structured training data from web sources
  • Knowledge Base Building — Crawl documentation sites and create searchable knowledge bases
  • Content Analysis — Extract and analyze web content at scale

Input

ParameterTypeDefaultDescription
urlsstring[](required)URLs to scrape
maxPagesinteger10Maximum pages to crawl
outputFormatstring"markdown"Output format: "markdown", "text", or "html"
chunkSizeinteger1000Target chunk size in tokens
chunkOverlapinteger100Overlap between chunks in tokens
excludeSelectorsstring[][]Additional CSS selectors to exclude
includeLinksbooleantrueInclude extracted links in metadata
includeImagesbooleantrueInclude extracted images in metadata
maxDepthinteger0Crawl depth (0 = provided URLs only)
respectRobotsTxtbooleantrueRespect robots.txt rules

Output

Each page produces a dataset item with:

{
"url": "https://example.com/page",
"metadata": {
"title": "Page Title",
"description": "Meta description",
"language": "en",
"author": "Author Name",
"publishedDate": "2025-01-15",
"ogImage": "https://example.com/image.jpg",
"headings": [{ "level": 1, "text": "Main Heading" }],
"links": [{ "text": "Link Text", "href": "https://..." }],
"images": [{ "alt": "Image description", "src": "https://..." }]
},
"content": "# Main Heading\n\nClean markdown content...",
"chunks": [
{
"index": 0,
"text": "First chunk of content...",
"tokenEstimate": 245,
"charCount": 980
}
],
"totalTokenEstimate": 1520,
"scrapedAt": "2025-01-15T10:30:00.000Z"
}

Integration Examples

Pinecone / Vector DB

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("your-username/ai-web-scraper").call(
run_input={"urls": ["https://docs.example.com"], "maxDepth": 2, "chunkSize": 512}
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
for chunk in item["chunks"]:
# Embed and upsert to your vector database
embedding = embed(chunk["text"])
index.upsert([(f"{item['url']}_{chunk['index']}", embedding, {
"text": chunk["text"],
"url": item["url"],
"title": item["metadata"]["title"],
})])

LangChain

from langchain.document_loaders import ApifyDatasetLoader
from langchain.schema import Document
loader = ApifyDatasetLoader(
dataset_id=run["defaultDatasetId"],
dataset_mapping_function=lambda item: [
Document(
page_content=chunk["text"],
metadata={"source": item["url"], "chunk_index": chunk["index"]},
)
for chunk in item["chunks"]
],
)
docs = loader.load()

Chunk Size Recommendations

Embedding ModelRecommended Chunk Size
OpenAI text-embedding-3-small500–1000
OpenAI text-embedding-3-large1000–2000
Cohere embed-v3256–512
Sentence Transformers256–512
Google Gecko500–1000

Pricing

This actor uses pay-per-event pricing at approximately $0.005 per page processed.

License

MIT