Jina Reader Cloud Wrapper
Pricing
$0.50 / 1,000 results
Jina Reader Cloud Wrapper
Under maintenanceConvert web pages to markdown for RAG/LLM. Batch URL processor extracts clean content from websites, PDFs, documentation. Web scraping for AI training data, knowledge bases, research. Jina AI Reader wrapper: auto-retry, ReaderLM-v2, cost tracking, image alt-text. $0.50/1K URLs
0.0 (0)
Pricing
$0.50 / 1,000 results
0
1
0
Last modified
2 days ago
Convert any URL to clean, LLM-ready markdown instantly. Perfect for RAG pipelines, AI agents, research automation, and content extraction workflows.
π What This Does
This Apify actor wraps the powerful Jina AI Reader API into a cloud-hosted, batch-processing service with automatic retries, rate limit management, and cost tracking. Instead of managing API calls yourself, just provide URLs and get clean markdown optimized for language models.
Powered by Jina AI Reader:
- β Automatic content extraction - Removes nav, ads, footers, sidebars
- β Clean markdown output - LLM-ready with preserved structure
- β PDF support - Extracts text from PDFs automatically
- β Image captioning - Optional AI-generated alt text for images
- β ReaderLM-v2 - Advanced 1.5B parameter model for complex pages
- β Blazing fast - Cached responses in milliseconds
π Perfect For
RAG Pipeline Data Collection
Build knowledge bases for retrieval-augmented generation systems. Clean markdown with semantic structure enables better chunking and embedding quality.
AI Agent Research
Enable autonomous agents to gather information from the web. Process multiple URLs in parallel, handle retries automatically, track token usage.
Content Aggregation
Collect articles, documentation, blog posts from diverse sources. Normalize into consistent markdown format for downstream processing.
Documentation Extraction
Extract clean content from technical docs, API references, tutorials. Preserve code blocks, tables, headings for AI-powered code assistants.
Web Monitoring
Track content changes on specific pages. Disable caching for real-time monitoring, compare snapshots over time.
π Input Configuration
Required
URLs (array of strings)
- List of URLs to process
- Supports web pages and PDFs
- Example:
["https://en.wikipedia.org/wiki/AI", "https://arxiv.org/pdf/2310.19923"]
Optional
| Parameter | Default | Description |
|---|---|---|
returnFormat | markdown | Output format: markdown, json, html, text, screenshot |
useReaderLM | false | Use ReaderLM-v2 for higher quality (3x token cost) |
generateImageAlt | false | Generate AI descriptions for images |
timeout | 30000 | Page load timeout in milliseconds |
noCache | false | Force fresh content fetching |
cacheTolerance | 3600 | Max age of cached content (seconds) |
targetSelector | - | CSS selector to limit extraction |
waitForSelector | - | Wait for element before extracting |
jinaApiKey | - | Your Jina API key (500 RPM vs 20 RPM) |
batchSize | 5 | Concurrent URLs to process |
maxRetries | 3 | Retry attempts for failed URLs |
π― Example Inputs
Basic Usage - Convert 10 URLs to Markdown
{"urls": ["https://docs.python.org/3/tutorial/index.html","https://en.wikipedia.org/wiki/Machine_learning","https://github.com/jina-ai/reader"],"returnFormat": "markdown"}
RAG Pipeline - High Quality Extraction
{"urls": ["https://platform.openai.com/docs/introduction","https://docs.anthropic.com/claude/docs"],"useReaderLM": true,"generateImageAlt": true,"timeout": 45000}
Real-Time Monitoring - No Cache
{"urls": ["https://news.ycombinator.com"],"noCache": true,"cacheTolerance": 0}
PDF Extraction
{"urls": ["https://arxiv.org/pdf/2310.19923","https://example.com/whitepaper.pdf"],"returnFormat": "markdown"}
Specific Content Selection
{"urls": ["https://blog.example.com/article"],"targetSelector": "article.post-content","waitForSelector": ".article-loaded"}
π Output Format
Dataset Results
Each processed URL returns:
{"url": "https://example.com/article","title": "Article Title","content": "# Article Title\n\nClean markdown content here...","metadata": {"processingTime": 2341,"contentLength": 15234,"estimatedTokens": 3809,"tokenCost": 3809,"processedAt": "2025-11-06T10:30:00.000Z","returnFormat": "markdown","usedReaderLM": false},"status": "success"}
Failed URLs
{"url": "https://blocked-site.com","title": null,"content": null,"error": "Request failed with status code 403","metadata": {"processedAt": "2025-11-06T10:30:00.000Z","returnFormat": "markdown"},"status": "error"}
Summary Statistics
Available in Key-Value Store as OUTPUT:
{"stats": {"totalUrls": 10,"successful": 9,"failed": 1,"successRate": "90.0%","totalTokens": 45234,"effectiveTokens": 45234,"totalTimeSeconds": "12.3","avgTimePerUrlSeconds": "1.23"}}
π° Pricing & Cost Control
Jina API Costs
Free Tier:
- 20 requests/minute (no API key)
- 200 requests/minute (with free API key)
- 10M free tokens for new users
Token Consumption:
- Standard mode: 1x tokens (response size)
- ReaderLM-v2 mode: 3x tokens (higher quality)
Rate Limits:
- Free: 200 RPM
- Premium: 500 RPM (read), 1000 RPM (search)
This Actor's Pricing
$0.50 per 1,000 URL conversions
Cost Examples:
- 10 URLs: $0.005 (half a cent)
- 100 URLs: $0.05 (5 cents)
- 1,000 URLs: $0.50
- 10,000 URLs: $5.00
Combined Cost Example
Processing 100 documentation pages:
- Jina API cost: ~$0.00 (within free tier)
- This actor cost: $0.05
- Total cost: $0.05
Plus Apify compute: ~$0.02 (varies by runtime)
Grand total: ~$0.07 for 100 clean markdown pages
Cost Control Features
β Batch processing - Process multiple URLs efficiently β Automatic retries - Don't waste runs on temporary failures β Token tracking - Real-time cost estimates in logs β Cache support - Reuse previous results (default: 1 hour) β Rate limit management - Automatic delays between batches
π§ Advanced Features
ReaderLM-v2: Higher Quality Extraction
Enable useReaderLM: true for complex pages with:
- Code blocks with syntax highlighting
- Complex HTML tables
- Deeply nested lists
- Mathematical equations (LaTeX)
- Sophisticated document structures
Trade-off: 3x token cost for superior quality
Image Captioning
Enable generateImageAlt: true to:
- Generate descriptive alt text using vision models
- Enable LLMs to reason about visual content
- Improve accessibility and SEO
CSS Selector Targeting
Use targetSelector for precise extraction:
"article.main-content"- Specific article"#post-body"- Element by ID".documentation-content"- Class-based selection
Dynamic Content Handling
Use waitForSelector for JavaScript-heavy sites:
- Wait for specific elements to load
- Handle single-page applications
- Capture dynamically rendered content
Caching Strategies
Default: 1-hour cache
- Fast responses for repeated URLs
- Good for static content
No cache: Real-time monitoring
noCache: true- Always fetch fresh content
- Best for news, dashboards, live data
Custom tolerance: Balance freshness and speed
cacheTolerance: 600(10 minutes)- Configure per use case
π Use Case Examples
1. Build RAG Knowledge Base
{"urls": ["https://docs.company.com/api/overview","https://docs.company.com/api/authentication","https://docs.company.com/api/endpoints"],"useReaderLM": true,"generateImageAlt": true,"jinaApiKey": "your_key_here"}
Process documentation into clean markdown, then:
- Chunk into paragraphs/sections
- Generate embeddings (use Jina Embeddings v2)
- Store in vector database (Pinecone, Weaviate, ChromaDB)
- Query with LLM for Q&A
2. Research Agent Workflow
{"urls": ["https://arxiv.org/abs/2310.19923","https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)","https://blog.research.google/2017/08/transformer-novel-neural-network.html"],"useReaderLM": true,"timeout": 60000}
AI agent gathers information, then:
- Extracts key concepts from each source
- Synthesizes findings across papers
- Generates comprehensive summary
- Cites specific passages from sources
3. Content Aggregation Pipeline
{"urls": ["https://news.ycombinator.com","https://techcrunch.com/ai","https://www.theverge.com/artificial-intelligence"],"batchSize": 3,"cacheTolerance": 300}
Daily aggregation workflow:
- Fetch latest articles
- Extract clean content
- Classify by topic (using Jina Classifier)
- Generate daily digest email
4. Competitive Intelligence
{"urls": ["https://competitor.com/pricing","https://competitor.com/features","https://competitor.com/blog/latest"],"noCache": true}
Weekly monitoring:
- Capture current state
- Compare with previous snapshots
- Detect pricing changes
- Alert on new features
π οΈ Integration Examples
LangChain
from langchain.document_loaders import ApifyDatasetLoaderfrom langchain.text_splitter import RecursiveCharacterTextSplitter# Load data from this actorloader = ApifyDatasetLoader(dataset_id="your_dataset_id",dataset_mapping_function=lambda item: Document(page_content=item["content"],metadata={"source": item["url"],"title": item["title"],"tokens": item["metadata"]["estimatedTokens"]}))docs = loader.load()# Split into chunkstext_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200)splits = text_splitter.split_documents(docs)# Use in RAG pipeline
LlamaIndex
from llama_index import download_loaderApifyLoader = download_loader("ApifyDataset")loader = ApifyLoader("your_dataset_id")documents = loader.load_data()# Build vector indexfrom llama_index import VectorStoreIndexindex = VectorStoreIndex.from_documents(documents)# Queryresponse = index.query("What are the key features?")
Jina Ecosystem
// Use with Jina Embeddings v2const response = await fetch('https://api.jina.ai/v1/embeddings', {method: 'POST',headers: {'Authorization': 'Bearer YOUR_KEY','Content-Type': 'application/json'},body: JSON.stringify({model: 'jina-embeddings-v2-base-en',input: [data.content] // From this actor's output})});// Use with Jina Rerankerconst reranked = await fetch('https://api.jina.ai/v1/rerank', {method: 'POST',headers: {'Authorization': 'Bearer YOUR_KEY','Content-Type': 'application/json'},body: JSON.stringify({model: 'jina-reranker-v2-base-multilingual',query: 'How do I authenticate?',documents: results.map(r => r.content) // From this actor})});
π Troubleshooting
No Content Extracted
Problem: Empty or very short content returned
Solutions:
- Increase
timeout(try 45000-60000ms) - Check if site blocks automated access
- Try
targetSelectorto specify content area - Use
waitForSelectorfor dynamic content
Rate Limit Errors
Problem: 429 Too Many Requests
Solutions:
- Reduce
batchSize(try 2-3) - Provide
jinaApiKeyfor higher limits (500 RPM) - Add delays between large batches
- Upgrade to Jina premium ($40/month for 500 RPM)
High Costs
Problem: Token usage higher than expected
Solutions:
- Disable
useReaderLM(uses 3x tokens) - Disable
generateImageAlt(adds tokens) - Use
targetSelectorto extract only needed content - Enable caching (
noCache: false) - Monitor token counts in logs
PDF Extraction Failed
Problem: PDF URLs return empty content
Solutions:
- Verify PDF URL is publicly accessible
- Increase
timeoutfor large PDFs - Check if PDF requires authentication
- Try downloading and hosting elsewhere if needed
Blocked by Website
Problem: 403 Forbidden or similar errors
Solutions:
- Some sites block Jina's user agent
- Try adding specific
targetSelector - Consider using Apify Web Scraper for complex sites
- Check site's robots.txt for restrictions
π Performance Tips
For Speed:
- Use default engine (not ReaderLM-v2)
- Enable caching
- Increase
batchSizeto 10-20 - Use low
timeoutvalues (15000ms)
For Quality:
- Enable
useReaderLM: true - Enable
generateImageAlt: true - Increase
timeoutto 45000-60000ms - Use
targetSelectorfor precision
For Cost:
- Disable ReaderLM-v2
- Disable image alt generation
- Use aggressive caching
- Filter URLs before processing
π Privacy & Security
- β No data stored by Jina beyond cache period (default: 1 hour)
- β Open-source Jina Reader (self-host if needed)
- β API keys encrypted in Apify
- β No tracking or analytics by this actor
- β οΈ Publicly accessible URLs only (no authenticated content)
π Resources
Jina AI
- Jina Reader: https://jina.ai/reader/
- API Docs: https://github.com/jina-ai/reader
- ReaderLM-v2: https://jina.ai/models/ReaderLM-v2/
- Get API Key: https://jina.ai/api-dashboard/
Related Jina Services
- Embeddings: https://jina.ai/embeddings/ (for RAG pipelines)
- Reranker: https://jina.ai/reranker/ (improve search quality)
- Classifier: https://jina.ai/classifier/ (content categorization)
Complementary Actors
- AI Training Data Collector: Full-site crawling with auto-categorization
- Apify Web Scraper: Complex scraping with custom logic
- Cheerio Scraper: Fast, lightweight HTML parsing
π License
This actor: Apache-2.0 Jina Reader API: Apache-2.0 (open-source)
Built by DarkzOGx | GitHub | More Actors
Convert URLs to clean markdown. Build better AI systems. π
On this page
Share Actor:
