arXiv Paper Scraper - AI ML Research Papers
Pricing
$5.00 / 1,000 paper scrapeds
arXiv Paper Scraper - AI ML Research Papers
Scrape arXiv research papers by keyword, category, or author. Extracts titles, abstracts, authors, citations, and metadata. Perfect for AI/ML research monitoring, literature reviews, and LLM training data collection.
Pricing
$5.00 / 1,000 paper scrapeds
Rating
0.0
(0)
Developer
OpenClaw Mara
Actor stats
0
Bookmarked
3
Total users
1
Monthly active users
18 days ago
Last modified
Categories
Share
📄 arXiv Paper Scraper — Academic Research at Scale
Structured data from 2.4M+ open-access papers across physics, math, CS, bio, and more. $0.005 per paper.
Scrape arXiv — the largest preprint server on the internet — for papers, abstracts, authors, and metadata. Uses the official arXiv API for fast, structured extraction. No authentication required.
Perfect for research monitoring, LLM training corpora, literature review automation, citation network building, and keeping up with cutting-edge AI/ML research as it drops.
🚀 What does this Actor do?
arXiv is where researchers post first, before (or instead of) journals — every major ML/AI paper hits arXiv before it hits a conference. This Actor turns arXiv into a programmable source in four modes:
- search — Full-text search with category filters and date/relevance sorting.
- new_submissions — Today's freshly submitted papers in a given category — ideal for daily research-trend feeds.
- paper_details — Fetch full metadata for specific papers by arXiv ID.
- author — All publications by a specific researcher.
Every paper comes back with title, abstract, authors, categories, DOI, PDF URL, publication dates, and journal reference when available — ready to drop into a vector DB, a research dashboard, or a fine-tuning corpus.
💡 Use Cases
1. Daily AI research feed
Monitor cs.AI, cs.LG, or cs.CL for new papers as they drop. Push to Slack, a newsletter, or a vector DB for daily retrieval.
{"mode": "new_submissions","category": "cs.AI","maxResults": 100}
2. Literature review automation
Pull every paper on a topic across the last N years and feed abstracts into an LLM for summarization or clustering.
{"mode": "search","searchQuery": "retrieval augmented generation","maxResults": 500,"sortBy": "submittedDate","sortOrder": "descending"}
3. Author / lab tracking
Track a specific researcher's output — great for following a competitor lab or an advisor's publication list.
{"mode": "author","authorName": "Yann LeCun","maxResults": 200}
4. Paper detail enrichment
You already have arXiv IDs from somewhere else (citations, a bibliography, a dataset) — pull full metadata in bulk.
{"mode": "paper_details","arxivIds": ["1706.03762", "2301.00234", "2005.14165"]}
📊 Output Example
{"arxivId": "1706.03762","title": "Attention Is All You Need","abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks...","authors": ["Ashish Vaswani", "Noam Shazeer", "Niki Parmar", "Jakob Uszkoreit"],"primaryCategory": "cs.CL","categories": ["cs.CL", "cs.LG"],"published": "2017-06-12T17:57:34Z","updated": "2023-08-02T00:00:00Z","pdfUrl": "http://arxiv.org/pdf/1706.03762v7","htmlUrl": "http://arxiv.org/abs/1706.03762v7","doi": "10.48550/arXiv.1706.03762","comment": "15 pages, 5 tables","journalRef": "Advances in Neural Information Processing Systems 30 (2017)"}
⚙️ Input Parameters
| Parameter | Type | Description |
|---|---|---|
mode | enum | search, new_submissions, paper_details, or author (required) |
searchQuery | string | Keywords/phrase (search mode) |
category | string | arXiv category code (cs.AI, cs.LG, math.CO, physics.hep-ph, etc.) — 150+ categories |
arxivIds | array | List of IDs for paper_details mode: ["1706.03762", "2301.00234"] |
authorName | string | Full author name for author mode |
maxResults | int | 1–500 (default 50) |
sortBy | enum | submittedDate, relevance, lastUpdatedDate |
sortOrder | enum | descending (default) or ascending |
📤 Output Fields
| Field | Description |
|---|---|
arxivId | Paper ID (works for old 0704.0001 and new 2301.00234 formats) |
title, abstract | Full text |
authors[] | Ordered author list |
primaryCategory, categories[] | arXiv subject classifications |
published, updated | ISO timestamps |
pdfUrl, htmlUrl | Direct links |
doi | Persistent DOI |
comment | Author comment (e.g., "Accepted at NeurIPS 2024") |
journalRef | Journal / conference reference when available |
💰 Pricing & Performance
- Pay-per-event: $0.005 per paper.
- Typical monthly cost: $1.50–$5 for daily
cs.AI/cs.LGmonitoring at 50–100 papers/day. - Speed: ~100 papers/minute via the official arXiv API. No rate-limit surprises.
- No auth required — arXiv's API is fully open.
🔌 Integrations
- Zapier / Make / n8n — daily new-submission digest to Slack, email, or Notion.
- LangChain / LlamaIndex — feed abstracts into RAG for a "research assistant" that answers questions about the literature.
- Vector DBs (Pinecone, Weaviate, Qdrant, pgvector) — embed abstracts for semantic search over arXiv.
- LLM fine-tuning corpora — bulk-scrape a domain (e.g., all of
cs.CLover 5 years) as training data. - Semantic Scholar scraper (companion) — combine with citation data to build citation networks. arXiv doesn't ship citation counts; Semantic Scholar does.
🏷️ Popular Categories
cs.AI— Artificial Intelligencecs.LG— Machine Learningcs.CL— Computational Linguistics / NLPcs.CV— Computer Visioncs.SE— Software Engineeringstat.ML— Machine Learning (statistics)math.CO— Combinatoricsq-bio.QM— Quantitative Methods in Biology- Full list: https://arxiv.org/category_taxonomy
❓ FAQ
How fresh are new_submissions?
Daily. arXiv updates the new-submissions feed once per day (US business days). This mode hits that feed directly.
Does this return the full PDF text?
No — it returns metadata + abstract + a direct pdfUrl. For full-text extraction, pass the PDF URL to a separate PDF-extraction Actor.
Can I filter by language? arXiv is predominantly English. Non-English papers are rare and usually have English abstracts as well.
Do old-format IDs (e.g., hep-th/0001001) work?
Yes — both 0704.0001 / 2301.00234 (new format) and hep-th/0001001 (pre-2007 format) are supported in paper_details.
What's the difference between submittedDate and lastUpdatedDate?
submittedDate = when the paper was first posted. lastUpdatedDate = when the most recent version (v2, v3, …) was posted. Use submittedDate for "new papers," lastUpdatedDate for "recently revised."
Rate limits?
arXiv recommends ≤1 request per 3 seconds. The Actor handles pacing internally; you just set maxResults and wait.
🔑 Keywords
arXiv scraper, academic paper scraper, research paper API, arXiv API alternative, ML papers data, AI research monitoring, literature review automation, preprint scraper, scientific paper extraction, cs.AI scraper, cs.LG scraper, research trend tracking, paper metadata extraction, arXiv bulk download, RAG over research papers, research corpus builder, citation network.
📝 Changelog
- v1.0 — Initial release. 4 modes (search, new_submissions, paper_details, author), 150+ categories, up to 500 papers per run.