AI-Powered Smart Web Scraper
Pricing
from $5.00 / 1,000 results
Go to Apify Store

AI-Powered Smart Web Scraper
Intelligent content extraction from any website using Crawlee + AI. Auto-detects structure, adapts to layout changes, handles JavaScript rendering. No custom code needed. Extract articles, products, listings from 1000s of pages.
Pricing
from $5.00 / 1,000 results
Rating
0.0
(0)
Developer

cloud9
Maintained by Community
Actor stats
0
Bookmarked
5
Total users
2
Monthly active users
a month ago
Last modified
Categories
Share
AI Web Scraper
Extract AI-ready content from any website. Clean Markdown output, smart chunking for RAG/embeddings, and structured metadata — optimized for LLM data pipelines.
Features
- Clean Markdown Output — Automatically removes navigation, ads, footers, sidebars, and cookie banners. Extracts only the main content.
- Smart Chunking — Paragraph-aware text splitting with configurable chunk size and overlap. Perfect for vector databases and embedding models.
- Token Estimation — Each chunk includes an estimated token count, compatible with OpenAI, Cohere, and other tokenizers.
- Structured Metadata — Extracts title, description, language, author, publish date, OG images, headings, links, and images.
- Multi-page Crawling — Follow links within the same domain with configurable depth. Process entire documentation sites or blogs.
- Multiple Output Formats — Markdown (default), plain text, or raw HTML.
Use Cases
- RAG Pipelines — Feed clean, chunked content into retrieval-augmented generation systems
- Vector Database Ingestion — Ready-to-embed chunks for Pinecone, Weaviate, Qdrant, ChromaDB, Milvus
- LLM Fine-tuning Data — Extract structured training data from web sources
- Knowledge Base Building — Crawl documentation sites and create searchable knowledge bases
- Content Analysis — Extract and analyze web content at scale
Input
| Parameter | Type | Default | Description |
|---|---|---|---|
urls | string[] | (required) | URLs to scrape |
maxPages | integer | 10 | Maximum pages to crawl |
outputFormat | string | "markdown" | Output format: "markdown", "text", or "html" |
chunkSize | integer | 1000 | Target chunk size in tokens |
chunkOverlap | integer | 100 | Overlap between chunks in tokens |
excludeSelectors | string[] | [] | Additional CSS selectors to exclude |
includeLinks | boolean | true | Include extracted links in metadata |
includeImages | boolean | true | Include extracted images in metadata |
maxDepth | integer | 0 | Crawl depth (0 = provided URLs only) |
respectRobotsTxt | boolean | true | Respect robots.txt rules |
Output
Each page produces a dataset item with:
{"url": "https://example.com/page","metadata": {"title": "Page Title","description": "Meta description","language": "en","author": "Author Name","publishedDate": "2025-01-15","ogImage": "https://example.com/image.jpg","headings": [{ "level": 1, "text": "Main Heading" }],"links": [{ "text": "Link Text", "href": "https://..." }],"images": [{ "alt": "Image description", "src": "https://..." }]},"content": "# Main Heading\n\nClean markdown content...","chunks": [{"index": 0,"text": "First chunk of content...","tokenEstimate": 245,"charCount": 980}],"totalTokenEstimate": 1520,"scrapedAt": "2025-01-15T10:30:00.000Z"}
Integration Examples
Pinecone / Vector DB
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("your-username/ai-web-scraper").call(run_input={"urls": ["https://docs.example.com"], "maxDepth": 2, "chunkSize": 512})for item in client.dataset(run["defaultDatasetId"]).iterate_items():for chunk in item["chunks"]:# Embed and upsert to your vector databaseembedding = embed(chunk["text"])index.upsert([(f"{item['url']}_{chunk['index']}", embedding, {"text": chunk["text"],"url": item["url"],"title": item["metadata"]["title"],})])
LangChain
from langchain.document_loaders import ApifyDatasetLoaderfrom langchain.schema import Documentloader = ApifyDatasetLoader(dataset_id=run["defaultDatasetId"],dataset_mapping_function=lambda item: [Document(page_content=chunk["text"],metadata={"source": item["url"], "chunk_index": chunk["index"]},)for chunk in item["chunks"]],)docs = loader.load()
Chunk Size Recommendations
| Embedding Model | Recommended Chunk Size |
|---|---|
| OpenAI text-embedding-3-small | 500–1000 |
| OpenAI text-embedding-3-large | 1000–2000 |
| Cohere embed-v3 | 256–512 |
| Sentence Transformers | 256–512 |
| Google Gecko | 500–1000 |
Pricing
This actor uses pay-per-event pricing at approximately $0.005 per page processed.
License
MIT

