Pricing

from $5.00 / 1,000 relevant paper founds

Arxiv Semantic Search

Scrape arXiv papers by category and find relevant research using AI-powered semantic search. Get papers from any field (AI, physics, biology, economics, etc.) with embeddings for RAG systems. Find your categories at: https://arxiv.org/category_taxonomy

Pricing

from $5.00 / 1,000 relevant paper founds

Rating

0.0

(0)

Developer

Mohamed Aouad

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

ArXiv Scraper & Semantic Search

Scrape academic papers from arXiv.org by category and perform semantic search on abstracts using AI-powered embeddings. Perfect for researchers, literature reviews, and building AI agents that need access to scientific papers.

Features

📚 Scrape arXiv papers by category (quantum physics, AI, condensed matter, etc.)
🔍 Semantic search using Sentence-BERT embeddings (384-dimensional vectors)
🎯 Ranked results by cosine similarity to your query
📅 Date filtering to get papers from specific time periods
⚡ Fast & efficient - processes 50 papers in ~30 seconds
🔄 Automatic retries with exponential backoff for API failures

Use Cases

Literature reviews - Find papers similar to your research topic
Research discovery - Explore related work in your field
AI agents - Build RAG systems with scientific knowledge
Citation analysis - Track papers in specific domains
Trend monitoring - Monitor new papers in your categories

How It Works

Fetch papers from arXiv API using category filters
Generate embeddings for paper abstracts using Sentence-BERT (all-MiniLM-L6-v2)
Search semantically by comparing query embedding to paper embeddings
Return ranked results sorted by similarity score

🚀 Quick Start for Your Research Domain

New to arXiv categories? We've got you covered:

📖 CATEGORY_GUIDE.md - Find categories for your field
📋 QUICK_START_EXAMPLES.md - Copy-paste ready configurations
📚 USER_GUIDE.md - Complete usage guide with workflows

Popular Research Domains

Your Field	Categories to Use	Example
AI/Machine Learning	`cs.AI`, `cs.LG`, `cs.CL`	LLMs, transformers, neural networks
Physics	`quant-ph`, `cond-mat`	Quantum computing, semiconductors
Biology	`q-bio.NC`, `q-bio.BM`	Neuroscience, protein folding
Economics	`econ.EM`, `stat.ME`	Econometrics, causal inference
Math/Statistics	`stat.ML`, `math.ST`	Statistical methods, probability

Input Parameters

Parameter	Type	Default	Description
`categories`	array	`["cs.AI"]`	arXiv categories - CATEGORY_GUIDE.md
`maxPapers`	integer	`100`	Papers per category (1-1000)
`startDate`	string	`null`	Start date (YYYY-MM-DD), e.g., "2024-01-01"
`endDate`	string	`null`	End date (YYYY-MM-DD), e.g., "2024-12-31"
`enableSemanticSearch`	boolean	`false`	Rank papers by relevance to your query
`searchQuery`	string	`null`	Plain English query, e.g., "quantum computing"
`topK`	integer	`10`	Number of top results (1-100)

Full category list: arXiv category taxonomy or CATEGORY_GUIDE.md

Output Format

Each paper in the dataset contains:

{
  "id": "2512.05101v1",
  "title": "Decoy-state quantum key distribution over 227 km...",
  "abstract": "We demonstrate quantum key distribution using...",
  "authors": ["John Doe", "Jane Smith"],
  "published": "2024-12-05T10:30:00Z",
  "updated": "2024-12-05T10:30:00Z",
  "categories": ["quant-ph", "physics.optics"],
  "pdf_url": "https://arxiv.org/pdf/2512.05101v1",
  "arxiv_url": "http://arxiv.org/abs/2512.05101v1",
  "embedding": [0.123, -0.456, ...],  // 384-dim vector (if enableSemanticSearch=true)
  "similarity_score": 0.87  // Only present if searchQuery was provided
}

Usage Examples

Example 1: Scrape Recent Quantum Physics Papers

{
  "categories": ["quant-ph"],
  "maxPapers": 50,
  "startDate": "2024-12-01",
  "endDate": "2024-12-05"
}

Example 2: Semantic Search for Topological Quantum Computing

{
  "categories": ["quant-ph", "cond-mat.mes-hall"],
  "maxPapers": 100,
  "enableSemanticSearch": true,
  "searchQuery": "topological quantum computing and anyons",
  "topK": 10
}

Example 3: Find AI Papers on Transformers

{
  "categories": ["cs.AI", "cs.LG", "cs.CL"],
  "maxPapers": 200,
  "enableSemanticSearch": true,
  "searchQuery": "transformer architecture attention mechanisms",
  "topK": 20
}

Getting Started

Run Locally

# Install Apify CLI
npm install -g apify-cli

# Clone or create the Actor
apify run

Deploy to Apify

# Login to Apify
apify login

# Deploy the Actor
apify push

Performance

Scraping: ~100 papers/minute from arXiv API
Embeddings: ~30 seconds for 50 papers (first run downloads model)
Search: <1 second for 100 papers (in-memory cosine similarity)
Memory: ~500MB (includes PyTorch + Sentence-BERT model)

Technical Details

Embedding Model: sentence-transformers/all-MiniLM-L6-v2
- Dimensions: 384
- Speed: ~1000 sentences/second
- Quality: High semantic similarity accuracy
Search Algorithm: Cosine similarity (scikit-learn)
API: arXiv Atom API (no authentication required)
Rate Limiting: Automatic retry with exponential backoff

Error Handling

The Actor gracefully handles:

✅ Network failures (3 retries with exponential backoff)
✅ Invalid categories (warns but continues)
✅ Empty results (returns empty dataset)
✅ Missing embeddings (falls back to scraping only)
✅ Invalid input parameters (clear error messages)

Limitations

arXiv API rate limit: ~1 request/3 seconds (handled automatically)
Maximum papers per request: 1000
Embedding generation requires ~500MB RAM
Search is in-memory (not suitable for >10,000 papers)

Resources

Support

💬 Apify Discord Community
📧 Report issues on GitHub
📚 Apify Academy

License

Apache 2.0

Built with ❤️ for researchers and AI developers

arXiv Paper Scraper

cloud9_ai/arxiv-paper-scraper

Scrape academic papers from arXiv.org. Search by keyword, browse categories, or get latest papers. Extract titles, abstracts, authors, PDF links, and citation data via arXiv API.

cloud9

arXiv Scraper

artificially/arxiv-scraper

Search and extract academic papers from arXiv.org. Get paper titles, authors, abstracts, categories, and PDF links for AI/ML, physics, math, and more.

Artificially

Arxiv Paper Intelligence

viralanalyzer/arxiv-paper-intelligence

Search and extract ArXiv papers, abstracts, authors, and citations. Track research trends across any scientific field. AI-powered analysis.

viralanalyzer

5.0

arXiv Scraper

parseforge/arxiv-scraper

Comprehensive arXiv scraper for extracting scholarly article data across physics, math, CS, biology, finance, statistics, engineering, and economics. Automates access to arXiv’s large preprint archive, providing structured metadata for researchers, academics, and data scientists.

ParseForge

5.0

ArXiv Scraper

automation-lab/arxiv-scraper

Scrape ArXiv research papers — titles, authors, abstracts, subjects, submission dates, and PDF links.

Stas Persiianenko

ArXiv Paper Scraper

nexgendata/arxiv-scraper

Extract research papers, abstracts, authors, and citations from arXiv.org. Perfect for academic research monitoring, literature reviews, and scientific trend analysis.

Stephan Corbeil

ArXiv Academic Paper Scraper

fortuitous_pirate/arxiv-scraper

Scrape academic papers from ArXiv. Extract titles, authors, abstracts, categories, and PDF links. Essential for research and literature reviews.

Fortuitous Pirate

ArXiv MCP server

jakub.kopecky/arxiv-mcp-server

The ArXiv MCP server provides a bridge between AI assistants and arXiv's research repository through the Model Context Protocol (MCP). It allows AI models to search for papers and access their content in a programmatic way.

Jakub Kopecký

Arxiv Paper Scraper

technicaldost/arxiv-paper-scraper

Technical Dost Solutions

Semantic Scholar Paper Search

ryanclinton/semantic-scholar-search

Search 200M+ academic papers via Semantic Scholar. Filter by keyword, year, venue, field, citations, open access. Returns titles, abstracts, AI summaries, authors, DOIs, ArXiv IDs, PDFs. Free API, no key needed.