Website Markdown Crawler
Pricing
from $2.00 / 1,000 website analyzeds
Go to Apify Store
Website Markdown Crawler
Crawls a website and converts every page to clean Markdown optimized for LLM ingestion.
Pricing
from $2.00 / 1,000 website analyzeds
Rating
0.0
(0)
Developer
Ziad Tarik
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 hours ago
Last modified
Categories
Share
Crawls a website starting from a seed URL and converts every page to clean Markdown optimized for LLM ingestion (LlamaIndex, LangChain, OpenAI, Pinecone). Output includes structured metadata per page: title, language detected, publication date, headings outline, word count, and chunked content ready for vector store upsert.
Features
- Clean Markdown Extraction: Strips noise (navigation, footers) to extract just the main content.
- Smart Chunking: Splits content into token chunks respecting paragraph boundaries.
- Language Filtering: Can automatically detect and filter pages by language (e.g., only
enorfr). - Domain Control: Keeps the crawler scoped to the seed URL's domain.
- Regex Exclusions: Skip non-valuable URLs like tags or author pages.
Output Example
Each crawled page yields a structured JSON record:
{"url": "https://docs.example.com/getting-started","title": "Getting Started — Example Docs","description": "Learn how to set up Example in 5 minutes.","language": "en","wordCount": 842,"tokenEstimate": 1120,"headings": [{ "level": 1, "text": "Getting Started" },{ "level": 2, "text": "Installation" }],"markdown": "# Getting Started\n\nLearn how to...","chunks": [{ "index": 0, "content": "# Getting Started\n\nLearn how to...", "tokenEstimate": 498 }],"chunkCount": 1,"depth": 1,"crawledAt": "2026-06-10T14:32:00.000Z"}
Integrations
Connect the crawler directly into your RAG stack.
LlamaIndex
from llama_index.core import Document# After running the Actor, download dataset as JSONdocs = [Document(text=chunk['content'], metadata={'url': item['url'], 'chunk': chunk['index']})for item in dataset_itemsfor chunk in item['chunks']]
LangChain
from langchain.docstore.document import Document as LCDoclc_docs = [LCDoc(page_content=chunk['content'], metadata={'source': item['url']})for item in dataset_itemsfor chunk in item['chunks']]