Arxiv Semantic Search avatar
Arxiv Semantic Search

Pricing

from $5.00 / 1,000 relevant paper founds

Go to Apify Store
Arxiv Semantic Search

Arxiv Semantic Search

Scrape arXiv papers by category and find relevant research using AI-powered semantic search. Get papers from any field (AI, physics, biology, economics, etc.) with embeddings for RAG systems. Find your categories at: https://arxiv.org/category_taxonomy

Pricing

from $5.00 / 1,000 relevant paper founds

Rating

0.0

(0)

Developer

Mohamed Aouad

Mohamed Aouad

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

6 days ago

Last modified

Share

ArXiv Scraper & Semantic Search

Scrape academic papers from arXiv.org by category and perform semantic search on abstracts using AI-powered embeddings. Perfect for researchers, literature reviews, and building AI agents that need access to scientific papers.

Features

  • πŸ“š Scrape arXiv papers by category (quantum physics, AI, condensed matter, etc.)
  • πŸ” Semantic search using Sentence-BERT embeddings (384-dimensional vectors)
  • 🎯 Ranked results by cosine similarity to your query
  • πŸ“… Date filtering to get papers from specific time periods
  • ⚑ Fast & efficient - processes 50 papers in ~30 seconds
  • πŸ”„ Automatic retries with exponential backoff for API failures

Use Cases

  • Literature reviews - Find papers similar to your research topic
  • Research discovery - Explore related work in your field
  • AI agents - Build RAG systems with scientific knowledge
  • Citation analysis - Track papers in specific domains
  • Trend monitoring - Monitor new papers in your categories

How It Works

  1. Fetch papers from arXiv API using category filters
  2. Generate embeddings for paper abstracts using Sentence-BERT (all-MiniLM-L6-v2)
  3. Search semantically by comparing query embedding to paper embeddings
  4. Return ranked results sorted by similarity score

πŸš€ Quick Start for Your Research Domain

New to arXiv categories? We've got you covered:

  • πŸ“– CATEGORY_GUIDE.md - Find categories for your field
  • πŸ“‹ QUICK_START_EXAMPLES.md - Copy-paste ready configurations
  • πŸ“š USER_GUIDE.md - Complete usage guide with workflows
Your FieldCategories to UseExample
AI/Machine Learningcs.AI, cs.LG, cs.CLLLMs, transformers, neural networks
Physicsquant-ph, cond-matQuantum computing, semiconductors
Biologyq-bio.NC, q-bio.BMNeuroscience, protein folding
Economicsecon.EM, stat.MEEconometrics, causal inference
Math/Statisticsstat.ML, math.STStatistical methods, probability

Input Parameters

ParameterTypeDefaultDescription
categoriesarray["cs.AI"]arXiv categories - CATEGORY_GUIDE.md
maxPapersinteger100Papers per category (1-1000)
startDatestringnullStart date (YYYY-MM-DD), e.g., "2024-01-01"
endDatestringnullEnd date (YYYY-MM-DD), e.g., "2024-12-31"
enableSemanticSearchbooleanfalseRank papers by relevance to your query
searchQuerystringnullPlain English query, e.g., "quantum computing"
topKinteger10Number of top results (1-100)

Full category list: arXiv category taxonomy or CATEGORY_GUIDE.md

Output Format

Each paper in the dataset contains:

{
"id": "2512.05101v1",
"title": "Decoy-state quantum key distribution over 227 km...",
"abstract": "We demonstrate quantum key distribution using...",
"authors": ["John Doe", "Jane Smith"],
"published": "2024-12-05T10:30:00Z",
"updated": "2024-12-05T10:30:00Z",
"categories": ["quant-ph", "physics.optics"],
"pdf_url": "https://arxiv.org/pdf/2512.05101v1",
"arxiv_url": "http://arxiv.org/abs/2512.05101v1",
"embedding": [0.123, -0.456, ...], // 384-dim vector (if enableSemanticSearch=true)
"similarity_score": 0.87 // Only present if searchQuery was provided
}

Usage Examples

Example 1: Scrape Recent Quantum Physics Papers

{
"categories": ["quant-ph"],
"maxPapers": 50,
"startDate": "2024-12-01",
"endDate": "2024-12-05"
}

Example 2: Semantic Search for Topological Quantum Computing

{
"categories": ["quant-ph", "cond-mat.mes-hall"],
"maxPapers": 100,
"enableSemanticSearch": true,
"searchQuery": "topological quantum computing and anyons",
"topK": 10
}

Example 3: Find AI Papers on Transformers

{
"categories": ["cs.AI", "cs.LG", "cs.CL"],
"maxPapers": 200,
"enableSemanticSearch": true,
"searchQuery": "transformer architecture attention mechanisms",
"topK": 20
}

Getting Started

Run Locally

# Install Apify CLI
npm install -g apify-cli
# Clone or create the Actor
apify run

Deploy to Apify

# Login to Apify
apify login
# Deploy the Actor
apify push

Performance

  • Scraping: ~100 papers/minute from arXiv API
  • Embeddings: ~30 seconds for 50 papers (first run downloads model)
  • Search: <1 second for 100 papers (in-memory cosine similarity)
  • Memory: ~500MB (includes PyTorch + Sentence-BERT model)

Technical Details

  • Embedding Model: sentence-transformers/all-MiniLM-L6-v2
    • Dimensions: 384
    • Speed: ~1000 sentences/second
    • Quality: High semantic similarity accuracy
  • Search Algorithm: Cosine similarity (scikit-learn)
  • API: arXiv Atom API (no authentication required)
  • Rate Limiting: Automatic retry with exponential backoff

Error Handling

The Actor gracefully handles:

  • βœ… Network failures (3 retries with exponential backoff)
  • βœ… Invalid categories (warns but continues)
  • βœ… Empty results (returns empty dataset)
  • βœ… Missing embeddings (falls back to scraping only)
  • βœ… Invalid input parameters (clear error messages)

Limitations

  • arXiv API rate limit: ~1 request/3 seconds (handled automatically)
  • Maximum papers per request: 1000
  • Embedding generation requires ~500MB RAM
  • Search is in-memory (not suitable for >10,000 papers)

Resources

Support

License

Apache 2.0


Built with ❀️ for researchers and AI developers