AI-Powered Smart Web Scraper
Pricing
Pay per usage
Go to Apify Store

AI-Powered Smart Web Scraper
Intelligent content extraction from any website using Crawlee + AI. Auto-detects structure, adapts to layout changes, handles JavaScript rendering. No custom code needed. Extract articles, products, listings from 1000s of pages.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

cloud9
Maintained by Community
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
AI Web Scraper
Extract AI-ready content from any website. Clean Markdown output, smart chunking for RAG/embeddings, and structured metadata — optimized for LLM data pipelines.
Features
- Clean Markdown Output — Automatically removes navigation, ads, footers, sidebars, and cookie banners. Extracts only the main content.
- Smart Chunking — Paragraph-aware text splitting with configurable chunk size and overlap. Perfect for vector databases and embedding models.
- Token Estimation — Each chunk includes an estimated token count, compatible with OpenAI, Cohere, and other tokenizers.
- Structured Metadata — Extracts title, description, language, author, publish date, OG images, headings, links, and images.
- Multi-page Crawling — Follow links within the same domain with configurable depth. Process entire documentation sites or blogs.
- Multiple Output Formats — Markdown (default), plain text, or raw HTML.
Use Cases
- RAG Pipelines — Feed clean, chunked content into retrieval-augmented generation systems
- Vector Database Ingestion — Ready-to-embed chunks for Pinecone, Weaviate, Qdrant, ChromaDB, Milvus
- LLM Fine-tuning Data — Extract structured training data from web sources
- Knowledge Base Building — Crawl documentation sites and create searchable knowledge bases
- Content Analysis — Extract and analyze web content at scale
Input
| Parameter | Type | Default | Description |
|---|---|---|---|
urls | string[] | (required) | URLs to scrape |
maxPages | integer | 10 | Maximum pages to crawl |
outputFormat | string | "markdown" | Output format: "markdown", "text", or "html" |
chunkSize | integer | 1000 | Target chunk size in tokens |
chunkOverlap | integer | 100 | Overlap between chunks in tokens |
excludeSelectors | string[] | [] | Additional CSS selectors to exclude |
includeLinks | boolean | true | Include extracted links in metadata |
includeImages | boolean | true | Include extracted images in metadata |
maxDepth | integer | 0 | Crawl depth (0 = provided URLs only) |
respectRobotsTxt | boolean | true | Respect robots.txt rules |
Output
Each page produces a dataset item with:
{"url": "https://example.com/page","metadata": {"title": "Page Title","description": "Meta description","language": "en","author": "Author Name","publishedDate": "2025-01-15","ogImage": "https://example.com/image.jpg","headings": [{ "level": 1, "text": "Main Heading" }],"links": [{ "text": "Link Text", "href": "https://..." }],"images": [{ "alt": "Image description", "src": "https://..." }]},"content": "# Main Heading\n\nClean markdown content...","chunks": [{"index": 0,"text": "First chunk of content...","tokenEstimate": 245,"charCount": 980}],"totalTokenEstimate": 1520,"scrapedAt": "2025-01-15T10:30:00.000Z"}
Integration Examples
Pinecone / Vector DB
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("your-username/ai-web-scraper").call(run_input={"urls": ["https://docs.example.com"], "maxDepth": 2, "chunkSize": 512})for item in client.dataset(run["defaultDatasetId"]).iterate_items():for chunk in item["chunks"]:# Embed and upsert to your vector databaseembedding = embed(chunk["text"])index.upsert([(f"{item['url']}_{chunk['index']}", embedding, {"text": chunk["text"],"url": item["url"],"title": item["metadata"]["title"],})])
LangChain
from langchain.document_loaders import ApifyDatasetLoaderfrom langchain.schema import Documentloader = ApifyDatasetLoader(dataset_id=run["defaultDatasetId"],dataset_mapping_function=lambda item: [Document(page_content=chunk["text"],metadata={"source": item["url"], "chunk_index": chunk["index"]},)for chunk in item["chunks"]],)docs = loader.load()
Chunk Size Recommendations
| Embedding Model | Recommended Chunk Size |
|---|---|
| OpenAI text-embedding-3-small | 500–1000 |
| OpenAI text-embedding-3-large | 1000–2000 |
| Cohere embed-v3 | 256–512 |
| Sentence Transformers | 256–512 |
| Google Gecko | 500–1000 |
Pricing
This actor uses pay-per-event pricing at approximately $0.005 per page processed.
License
MIT