Arxiv Semantic Search
Pricing
from $5.00 / 1,000 relevant paper founds
Arxiv Semantic Search
Scrape arXiv papers by category and find relevant research using AI-powered semantic search. Get papers from any field (AI, physics, biology, economics, etc.) with embeddings for RAG systems. Find your categories at: https://arxiv.org/category_taxonomy
Pricing
from $5.00 / 1,000 relevant paper founds
Rating
0.0
(0)
Developer

Mohamed Aouad
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
6 days ago
Last modified
Categories
Share
ArXiv Scraper & Semantic Search
Scrape academic papers from arXiv.org by category and perform semantic search on abstracts using AI-powered embeddings. Perfect for researchers, literature reviews, and building AI agents that need access to scientific papers.
Features
- π Scrape arXiv papers by category (quantum physics, AI, condensed matter, etc.)
- π Semantic search using Sentence-BERT embeddings (384-dimensional vectors)
- π― Ranked results by cosine similarity to your query
- π Date filtering to get papers from specific time periods
- β‘ Fast & efficient - processes 50 papers in ~30 seconds
- π Automatic retries with exponential backoff for API failures
Use Cases
- Literature reviews - Find papers similar to your research topic
- Research discovery - Explore related work in your field
- AI agents - Build RAG systems with scientific knowledge
- Citation analysis - Track papers in specific domains
- Trend monitoring - Monitor new papers in your categories
How It Works
- Fetch papers from arXiv API using category filters
- Generate embeddings for paper abstracts using Sentence-BERT (
all-MiniLM-L6-v2) - Search semantically by comparing query embedding to paper embeddings
- Return ranked results sorted by similarity score
π Quick Start for Your Research Domain
New to arXiv categories? We've got you covered:
- π CATEGORY_GUIDE.md - Find categories for your field
- π QUICK_START_EXAMPLES.md - Copy-paste ready configurations
- π USER_GUIDE.md - Complete usage guide with workflows
Popular Research Domains
| Your Field | Categories to Use | Example |
|---|---|---|
| AI/Machine Learning | cs.AI, cs.LG, cs.CL | LLMs, transformers, neural networks |
| Physics | quant-ph, cond-mat | Quantum computing, semiconductors |
| Biology | q-bio.NC, q-bio.BM | Neuroscience, protein folding |
| Economics | econ.EM, stat.ME | Econometrics, causal inference |
| Math/Statistics | stat.ML, math.ST | Statistical methods, probability |
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
categories | array | ["cs.AI"] | arXiv categories - CATEGORY_GUIDE.md |
maxPapers | integer | 100 | Papers per category (1-1000) |
startDate | string | null | Start date (YYYY-MM-DD), e.g., "2024-01-01" |
endDate | string | null | End date (YYYY-MM-DD), e.g., "2024-12-31" |
enableSemanticSearch | boolean | false | Rank papers by relevance to your query |
searchQuery | string | null | Plain English query, e.g., "quantum computing" |
topK | integer | 10 | Number of top results (1-100) |
Full category list: arXiv category taxonomy or CATEGORY_GUIDE.md
Output Format
Each paper in the dataset contains:
{"id": "2512.05101v1","title": "Decoy-state quantum key distribution over 227 km...","abstract": "We demonstrate quantum key distribution using...","authors": ["John Doe", "Jane Smith"],"published": "2024-12-05T10:30:00Z","updated": "2024-12-05T10:30:00Z","categories": ["quant-ph", "physics.optics"],"pdf_url": "https://arxiv.org/pdf/2512.05101v1","arxiv_url": "http://arxiv.org/abs/2512.05101v1","embedding": [0.123, -0.456, ...], // 384-dim vector (if enableSemanticSearch=true)"similarity_score": 0.87 // Only present if searchQuery was provided}
Usage Examples
Example 1: Scrape Recent Quantum Physics Papers
{"categories": ["quant-ph"],"maxPapers": 50,"startDate": "2024-12-01","endDate": "2024-12-05"}
Example 2: Semantic Search for Topological Quantum Computing
{"categories": ["quant-ph", "cond-mat.mes-hall"],"maxPapers": 100,"enableSemanticSearch": true,"searchQuery": "topological quantum computing and anyons","topK": 10}
Example 3: Find AI Papers on Transformers
{"categories": ["cs.AI", "cs.LG", "cs.CL"],"maxPapers": 200,"enableSemanticSearch": true,"searchQuery": "transformer architecture attention mechanisms","topK": 20}
Getting Started
Run Locally
# Install Apify CLInpm install -g apify-cli# Clone or create the Actorapify run
Deploy to Apify
# Login to Apifyapify login# Deploy the Actorapify push
Performance
- Scraping: ~100 papers/minute from arXiv API
- Embeddings: ~30 seconds for 50 papers (first run downloads model)
- Search: <1 second for 100 papers (in-memory cosine similarity)
- Memory: ~500MB (includes PyTorch + Sentence-BERT model)
Technical Details
- Embedding Model:
sentence-transformers/all-MiniLM-L6-v2- Dimensions: 384
- Speed: ~1000 sentences/second
- Quality: High semantic similarity accuracy
- Search Algorithm: Cosine similarity (scikit-learn)
- API: arXiv Atom API (no authentication required)
- Rate Limiting: Automatic retry with exponential backoff
Error Handling
The Actor gracefully handles:
- β Network failures (3 retries with exponential backoff)
- β Invalid categories (warns but continues)
- β Empty results (returns empty dataset)
- β Missing embeddings (falls back to scraping only)
- β Invalid input parameters (clear error messages)
Limitations
- arXiv API rate limit: ~1 request/3 seconds (handled automatically)
- Maximum papers per request: 1000
- Embedding generation requires ~500MB RAM
- Search is in-memory (not suitable for >10,000 papers)
Resources
Support
- π¬ Apify Discord Community
- π§ Report issues on GitHub
- π Apify Academy
License
Apache 2.0
Built with β€οΈ for researchers and AI developers