arXiv Paper Scraper - AI ML Research Papers avatar

arXiv Paper Scraper - AI ML Research Papers

Pricing

$5.00 / 1,000 paper scrapeds

Go to Apify Store
arXiv Paper Scraper - AI ML Research Papers

arXiv Paper Scraper - AI ML Research Papers

Scrape arXiv research papers by keyword, category, or author. Extracts titles, abstracts, authors, citations, and metadata. Perfect for AI/ML research monitoring, literature reviews, and LLM training data collection.

Pricing

$5.00 / 1,000 paper scrapeds

Rating

0.0

(0)

Developer

OpenClaw Mara

OpenClaw Mara

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

1

Monthly active users

19 days ago

Last modified

Categories

Share

📄 arXiv Paper Scraper — Academic Research at Scale

Structured data from 2.4M+ open-access papers across physics, math, CS, bio, and more. $0.005 per paper.

Scrape arXiv — the largest preprint server on the internet — for papers, abstracts, authors, and metadata. Uses the official arXiv API for fast, structured extraction. No authentication required.

Perfect for research monitoring, LLM training corpora, literature review automation, citation network building, and keeping up with cutting-edge AI/ML research as it drops.

🚀 What does this Actor do?

arXiv is where researchers post first, before (or instead of) journals — every major ML/AI paper hits arXiv before it hits a conference. This Actor turns arXiv into a programmable source in four modes:

  • search — Full-text search with category filters and date/relevance sorting.
  • new_submissions — Today's freshly submitted papers in a given category — ideal for daily research-trend feeds.
  • paper_details — Fetch full metadata for specific papers by arXiv ID.
  • author — All publications by a specific researcher.

Every paper comes back with title, abstract, authors, categories, DOI, PDF URL, publication dates, and journal reference when available — ready to drop into a vector DB, a research dashboard, or a fine-tuning corpus.

💡 Use Cases

1. Daily AI research feed

Monitor cs.AI, cs.LG, or cs.CL for new papers as they drop. Push to Slack, a newsletter, or a vector DB for daily retrieval.

{
"mode": "new_submissions",
"category": "cs.AI",
"maxResults": 100
}

2. Literature review automation

Pull every paper on a topic across the last N years and feed abstracts into an LLM for summarization or clustering.

{
"mode": "search",
"searchQuery": "retrieval augmented generation",
"maxResults": 500,
"sortBy": "submittedDate",
"sortOrder": "descending"
}

3. Author / lab tracking

Track a specific researcher's output — great for following a competitor lab or an advisor's publication list.

{
"mode": "author",
"authorName": "Yann LeCun",
"maxResults": 200
}

4. Paper detail enrichment

You already have arXiv IDs from somewhere else (citations, a bibliography, a dataset) — pull full metadata in bulk.

{
"mode": "paper_details",
"arxivIds": ["1706.03762", "2301.00234", "2005.14165"]
}

📊 Output Example

{
"arxivId": "1706.03762",
"title": "Attention Is All You Need",
"abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks...",
"authors": ["Ashish Vaswani", "Noam Shazeer", "Niki Parmar", "Jakob Uszkoreit"],
"primaryCategory": "cs.CL",
"categories": ["cs.CL", "cs.LG"],
"published": "2017-06-12T17:57:34Z",
"updated": "2023-08-02T00:00:00Z",
"pdfUrl": "http://arxiv.org/pdf/1706.03762v7",
"htmlUrl": "http://arxiv.org/abs/1706.03762v7",
"doi": "10.48550/arXiv.1706.03762",
"comment": "15 pages, 5 tables",
"journalRef": "Advances in Neural Information Processing Systems 30 (2017)"
}

⚙️ Input Parameters

ParameterTypeDescription
modeenumsearch, new_submissions, paper_details, or author (required)
searchQuerystringKeywords/phrase (search mode)
categorystringarXiv category code (cs.AI, cs.LG, math.CO, physics.hep-ph, etc.) — 150+ categories
arxivIdsarrayList of IDs for paper_details mode: ["1706.03762", "2301.00234"]
authorNamestringFull author name for author mode
maxResultsint1–500 (default 50)
sortByenumsubmittedDate, relevance, lastUpdatedDate
sortOrderenumdescending (default) or ascending

📤 Output Fields

FieldDescription
arxivIdPaper ID (works for old 0704.0001 and new 2301.00234 formats)
title, abstractFull text
authors[]Ordered author list
primaryCategory, categories[]arXiv subject classifications
published, updatedISO timestamps
pdfUrl, htmlUrlDirect links
doiPersistent DOI
commentAuthor comment (e.g., "Accepted at NeurIPS 2024")
journalRefJournal / conference reference when available

💰 Pricing & Performance

  • Pay-per-event: $0.005 per paper.
  • Typical monthly cost: $1.50–$5 for daily cs.AI / cs.LG monitoring at 50–100 papers/day.
  • Speed: ~100 papers/minute via the official arXiv API. No rate-limit surprises.
  • No auth required — arXiv's API is fully open.

🔌 Integrations

  • Zapier / Make / n8n — daily new-submission digest to Slack, email, or Notion.
  • LangChain / LlamaIndex — feed abstracts into RAG for a "research assistant" that answers questions about the literature.
  • Vector DBs (Pinecone, Weaviate, Qdrant, pgvector) — embed abstracts for semantic search over arXiv.
  • LLM fine-tuning corpora — bulk-scrape a domain (e.g., all of cs.CL over 5 years) as training data.
  • Semantic Scholar scraper (companion) — combine with citation data to build citation networks. arXiv doesn't ship citation counts; Semantic Scholar does.
  • cs.AI — Artificial Intelligence
  • cs.LG — Machine Learning
  • cs.CL — Computational Linguistics / NLP
  • cs.CV — Computer Vision
  • cs.SE — Software Engineering
  • stat.ML — Machine Learning (statistics)
  • math.CO — Combinatorics
  • q-bio.QM — Quantitative Methods in Biology
  • Full list: https://arxiv.org/category_taxonomy

❓ FAQ

How fresh are new_submissions? Daily. arXiv updates the new-submissions feed once per day (US business days). This mode hits that feed directly.

Does this return the full PDF text? No — it returns metadata + abstract + a direct pdfUrl. For full-text extraction, pass the PDF URL to a separate PDF-extraction Actor.

Can I filter by language? arXiv is predominantly English. Non-English papers are rare and usually have English abstracts as well.

Do old-format IDs (e.g., hep-th/0001001) work? Yes — both 0704.0001 / 2301.00234 (new format) and hep-th/0001001 (pre-2007 format) are supported in paper_details.

What's the difference between submittedDate and lastUpdatedDate? submittedDate = when the paper was first posted. lastUpdatedDate = when the most recent version (v2, v3, …) was posted. Use submittedDate for "new papers," lastUpdatedDate for "recently revised."

Rate limits? arXiv recommends ≤1 request per 3 seconds. The Actor handles pacing internally; you just set maxResults and wait.

🔑 Keywords

arXiv scraper, academic paper scraper, research paper API, arXiv API alternative, ML papers data, AI research monitoring, literature review automation, preprint scraper, scientific paper extraction, cs.AI scraper, cs.LG scraper, research trend tracking, paper metadata extraction, arXiv bulk download, RAG over research papers, research corpus builder, citation network.

📝 Changelog

  • v1.0 — Initial release. 4 modes (search, new_submissions, paper_details, author), 150+ categories, up to 500 papers per run.