Pricing

$5.00 / 1,000 paper scrapeds

arXiv Paper Scraper - AI ML Research Papers

Scrape arXiv research papers by keyword, category, or author. Extracts titles, abstracts, authors, citations, and metadata. Perfect for AI/ML research monitoring, literature reviews, and LLM training data collection.

Pricing

$5.00 / 1,000 paper scrapeds

Rating

0.0

(0)

Developer

OpenClaw Mara

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

📄 arXiv Paper Scraper — Academic Research at Scale

Structured data from 2.4M+ open-access papers across physics, math, CS, bio, and more. $0.005 per paper.

Scrape arXiv — the largest preprint server on the internet — for papers, abstracts, authors, and metadata. Uses the official arXiv API for fast, structured extraction. No authentication required.

Perfect for research monitoring, LLM training corpora, literature review automation, citation network building, and keeping up with cutting-edge AI/ML research as it drops.

🚀 What does this Actor do?

arXiv is where researchers post first, before (or instead of) journals — every major ML/AI paper hits arXiv before it hits a conference. This Actor turns arXiv into a programmable source in four modes:

search — Full-text search with category filters and date/relevance sorting.
new_submissions — Today's freshly submitted papers in a given category — ideal for daily research-trend feeds.
paper_details — Fetch full metadata for specific papers by arXiv ID.
author — All publications by a specific researcher.

Every paper comes back with title, abstract, authors, categories, DOI, PDF URL, publication dates, and journal reference when available — ready to drop into a vector DB, a research dashboard, or a fine-tuning corpus.

💡 Use Cases

1. Daily AI research feed

Monitor cs.AI, cs.LG, or cs.CL for new papers as they drop. Push to Slack, a newsletter, or a vector DB for daily retrieval.

{
  "mode": "new_submissions",
  "category": "cs.AI",
  "maxResults": 100
}

2. Literature review automation

Pull every paper on a topic across the last N years and feed abstracts into an LLM for summarization or clustering.

{
  "mode": "search",
  "searchQuery": "retrieval augmented generation",
  "maxResults": 500,
  "sortBy": "submittedDate",
  "sortOrder": "descending"
}

3. Author / lab tracking

Track a specific researcher's output — great for following a competitor lab or an advisor's publication list.

{
  "mode": "author",
  "authorName": "Yann LeCun",
  "maxResults": 200
}

4. Paper detail enrichment

You already have arXiv IDs from somewhere else (citations, a bibliography, a dataset) — pull full metadata in bulk.

{
  "mode": "paper_details",
  "arxivIds": ["1706.03762", "2301.00234", "2005.14165"]
}

📊 Output Example

{
  "arxivId": "1706.03762",
  "title": "Attention Is All You Need",
  "abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks...",
  "authors": ["Ashish Vaswani", "Noam Shazeer", "Niki Parmar", "Jakob Uszkoreit"],
  "primaryCategory": "cs.CL",
  "categories": ["cs.CL", "cs.LG"],
  "published": "2017-06-12T17:57:34Z",
  "updated": "2023-08-02T00:00:00Z",
  "pdfUrl": "http://arxiv.org/pdf/1706.03762v7",
  "htmlUrl": "http://arxiv.org/abs/1706.03762v7",
  "doi": "10.48550/arXiv.1706.03762",
  "comment": "15 pages, 5 tables",
  "journalRef": "Advances in Neural Information Processing Systems 30 (2017)"
}

⚙️ Input Parameters

Parameter	Type	Description
`mode`	enum	`search`, `new_submissions`, `paper_details`, or `author` (required)
`searchQuery`	string	Keywords/phrase (search mode)
`category`	string	arXiv category code (`cs.AI`, `cs.LG`, `math.CO`, `physics.hep-ph`, etc.) — 150+ categories
`arxivIds`	array	List of IDs for `paper_details` mode: `["1706.03762", "2301.00234"]`
`authorName`	string	Full author name for `author` mode
`maxResults`	int	1–500 (default 50)
`sortBy`	enum	`submittedDate`, `relevance`, `lastUpdatedDate`
`sortOrder`	enum	`descending` (default) or `ascending`

📤 Output Fields

Field	Description
`arxivId`	Paper ID (works for old `0704.0001` and new `2301.00234` formats)
`title`, `abstract`	Full text
`authors[]`	Ordered author list
`primaryCategory`, `categories[]`	arXiv subject classifications
`published`, `updated`	ISO timestamps
`pdfUrl`, `htmlUrl`	Direct links
`doi`	Persistent DOI
`comment`	Author comment (e.g., "Accepted at NeurIPS 2024")
`journalRef`	Journal / conference reference when available

💰 Pricing & Performance

Pay-per-event: $0.005 per paper.
Typical monthly cost: $1.50–$5 for daily cs.AI / cs.LG monitoring at 50–100 papers/day.
Speed: ~100 papers/minute via the official arXiv API. No rate-limit surprises.
No auth required — arXiv's API is fully open.

🔌 Integrations

Zapier / Make / n8n — daily new-submission digest to Slack, email, or Notion.
LangChain / LlamaIndex — feed abstracts into RAG for a "research assistant" that answers questions about the literature.
Vector DBs (Pinecone, Weaviate, Qdrant, pgvector) — embed abstracts for semantic search over arXiv.
LLM fine-tuning corpora — bulk-scrape a domain (e.g., all of cs.CL over 5 years) as training data.
Semantic Scholar scraper (companion) — combine with citation data to build citation networks. arXiv doesn't ship citation counts; Semantic Scholar does.

❓ FAQ

How fresh are new_submissions? Daily. arXiv updates the new-submissions feed once per day (US business days). This mode hits that feed directly.

Does this return the full PDF text? No — it returns metadata + abstract + a direct pdfUrl. For full-text extraction, pass the PDF URL to a separate PDF-extraction Actor.

Can I filter by language? arXiv is predominantly English. Non-English papers are rare and usually have English abstracts as well.

Do old-format IDs (e.g., hep-th/0001001) work? Yes — both 0704.0001 / 2301.00234 (new format) and hep-th/0001001 (pre-2007 format) are supported in paper_details.

What's the difference between submittedDate and lastUpdatedDate? submittedDate = when the paper was first posted. lastUpdatedDate = when the most recent version (v2, v3, …) was posted. Use submittedDate for "new papers," lastUpdatedDate for "recently revised."

Rate limits? arXiv recommends ≤1 request per 3 seconds. The Actor handles pacing internally; you just set maxResults and wait.

🔑 Keywords

arXiv scraper, academic paper scraper, research paper API, arXiv API alternative, ML papers data, AI research monitoring, literature review automation, preprint scraper, scientific paper extraction, cs.AI scraper, cs.LG scraper, research trend tracking, paper metadata extraction, arXiv bulk download, RAG over research papers, research corpus builder, citation network.

📝 Changelog

v1.0 — Initial release. 4 modes (search, new_submissions, paper_details, author), 150+ categories, up to 500 papers per run.

arXiv Papers Scraper

resounding_diplomacy/arxiv-papers-scraper

Scrape academic papers from arXiv by category, keyword, or author. Extract titles, authors, abstracts, PDF URLs, DOIs, categories, and more. Perfect for AI/ML research datasets.

alars num

arXiv Research Papers Tracker

wsgcjj/arxiv-papers-scraper

Search and extract academic papers from arXiv by category, keyword, date range. Returns paper title, authors, abstract, categories, published date, PDF URL. Ideal for AI/ML research monitoring and training data collection.

陈俊杰

arXiv Papers Scraper - AI/ML Research at Scale

wetyr_corporation/arxiv-papers-scraper

Search and bulk extract arXiv research papers with abstracts, authors, categories, and PDF links. Built for AI/ML researchers, RAG knowledge bases, and citation tracking.

WETYR

ArXiv Papers Scraper — Research Paper API

fast_api/arxiv-papers-scraper

Search and extract ArXiv research papers as structured JSON: titles, authors, abstracts, categories, dates, PDFs, and metadata. Built for AI research monitoring, literature review, RAG datasets, and academic intelligence.

Fast API

arXiv Papers Monitor for Research Alerts

skootle/arxiv-papers

Monitor arXiv papers by query, category, author, or date. Export titles, abstracts, authors, links, PDFs, categories, and agent-friendly summaries for research monitoring, literature review, and AI paper workflows.

Skootle

arXiv Paper Scraper

skystone_labs/arxiv-scraper

Extract research papers from arXiv using the official API. Get titles, authors, abstracts, PDF URLs, categories, and more. Perfect for research datasets and literature reviews.

Skystone

arXiv Research Paper Scraper

seeb/arxiv-research-paper-scraper

Scrape arXiv papers by keyword or category and return research titles, abstracts, authors, dates, links, and topic signals.

Techionik

Ai-ML-scraper

labrat011/ai-ml-scraper

Search AI/ML models, research papers, and trending papers from HuggingFace Hub and arXiv. No API key required.

mick_

arXiv Paper Scraper

plantane/arxiv-scraper

Scrape research papers from arXiv by search query or category. Get titles, abstracts, authors, categories, and PDF links via the public arXiv API.

Daniel

arXiv Scraper

artificially/arxiv-scraper

Search and extract academic papers from arXiv.org. Get paper titles, authors, abstracts, categories, and PDF links for AI/ML, physics, math, and more.