arXiv Papers Scraper with AI Topic Tags
Pricing
Pay per usage
arXiv Papers Scraper with AI Topic Tags
Search arXiv.org for academic papers by keyword, author, or category. Get clean structured data with optional AI topic tagging via Claude. Perfect for literature reviews, research monitoring, and academic datasets.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Andrei
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
arXiv Papers Scraper with AI Tags
Search arXiv.org for academic papers by keyword, author, or category. Get clean structured data with optional AI-powered topic tagging via Claude. Perfect for literature reviews, research monitoring, and building academic datasets.
What this actor does
arXiv has 2M+ papers but their search interface is clunky and there's no direct way to export results. This actor solves that:
- Full arXiv search syntax — search by keyword, title, abstract, authors, or category
- Category filter — restrict to specific fields (cs.AI, math.PR, physics.bio-ph, etc.)
- AI topic tagging — Claude reads each abstract and assigns 3-5 relevant tags (optional, BYOK)
- Citation extraction — pulls cited references from paper metadata when available
- Retry logic — handles arXiv API rate limits and transient errors gracefully
Quick start
Just search for something:
{"searchQuery": "transformer attention mechanism","maxResults": 20}
That's it. The actor will return up to 20 papers matching your query with full metadata.
Input fields
- searchQuery (required) — Search terms (keyword, author, title, or arXiv ID)
- category — Filter by arXiv category code (cs.AI, math.ST, etc., leave empty for all)
- maxResults — Number of papers to fetch (default 50, max 1000)
- sortBy — Sort by relevance, lastUpdatedDate, or submittedDate (default relevance)
- enableAiTags — Generate AI topic tags for each paper (default false)
- anthropicApiKey — Your Anthropic API key (BYOK, required if AI tags enabled)
- extractCitations — Pull cited references metadata when available (default true)
Output format
Each item in the dataset:
{"id": "2412.01234","title": "Attention Is All You Need: A Survey","authors": ["Vaswani A.", "Shazeer N."],"abstract": "The dominant sequence transduction models...","publishedDate": "2024-12-01","updatedDate": "2024-12-15","pdfUrl": "https://arxiv.org/pdf/2412.01234.pdf","absUrl": "https://arxiv.org/abs/2412.01234","categories": ["cs.LG", "cs.AI"],"primaryCategory": "cs.LG","doi": "10.xxxx/yyyy","comment": "Accepted at NeurIPS 2024","journalRef": null,"aiTags": ["transformer architecture", "attention mechanism", "survey paper"],"citationCount": 12}
Field aiTags appears only with AI tagging enabled.
Use cases
Literature review — Pull all papers on your research topic from the last 6 months in one query.
Research monitoring — Schedule daily runs to track new arXiv submissions in your field.
Dataset building — Collect abstracts and metadata for training NLP models on academic text.
Trend analysis — Aggregate AI tags across thousands of papers to spot emerging research topics.
Citation tracking — Build citation graphs from extracted references for bibliometric studies.
Technical notes
- Uses arXiv's official Atom API — fully ToS-compliant, no scraping
- Automatic retry with exponential backoff for rate limits (arXiv allows ~3 req/sec)
- AI tagging uses Claude Haiku 4.5 (fast and cheap, ~$0.001 per paper)
- All abstracts and metadata are public domain (arXiv license)
- Citation extraction works only for papers with structured reference metadata
Pricing
Currently free during early access. Pay-per-paper pricing will be enabled later.
Support
Found a bug? Have feature requests? Contact the developer through the actor's page on Apify.