ArXiv Paper Scraper — Search by Category, Bulk JSON, DOI
Pricing
Pay per usage
ArXiv Paper Scraper — Search by Category, Bulk JSON, DOI
arXiv corpus as JSON — arxivId, title, authors, abstract, categories, dates, DOI, PDF URL. By search OR category. Built for ML/AI training data + lit reviews. 19 runs. Backed by 951-run Trustpilot flagship + 31-actor portfolio. spinov001@gmail.com · blog.spinov.online · t.me/scraping_ai
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Alex
Actor stats
0
Bookmarked
2
Total users
0
Monthly active users
4 days ago
Last modified
Categories
Share
arXiv Paper Scraper — Bulk Research-Paper Metadata via arXiv API
Pull research-paper metadata from the official arXiv API by category or free-text search query — titles, abstracts, authors, primary + secondary categories, submission/update dates, DOI, journal reference, PDF URL — as a flat JSON dataset. No API key, no auth.
Verified against src/main.js — every output field below is what the actor actually pushes.
What you get per paper (14 fields)
| Field | Type | Example | Source in main.js |
|---|---|---|---|
arxivId | string | 2403.14812v1 | parsed from <id> URL tail |
title | string | Attention Is All You Need | <title> text, whitespace-collapsed |
authors | string[] | ["Ashish Vaswani", "Noam Shazeer"] | each <author><name> text — no affiliations, no ORCIDs |
abstract | string | full abstract, whitespace-collapsed | <summary> text |
categories | string[] | ["cs.CL", "cs.LG"] | every <category term="..."> |
primaryCategory | string | null | cs.CL | first <category> |
published | ISO string | 2017-06-12T17:57:34Z | <published> |
updated | ISO string | 2023-08-02T00:41:18Z | <updated> |
doi | string | null | 10.5555/3294996.3295349 | <arxiv:doi> or null |
journalRef | string | null | NeurIPS 2017 | <arxiv:journal_ref> or null |
pdfUrl | string | http://arxiv.org/pdf/2403.14812v1 | <id> with /abs/ → /pdf/ |
abstractUrl | string | http://arxiv.org/abs/2403.14812v1 | <id> text |
source | string | category:cs.AI or search:RAG | tagged at scrape time |
scrapedAt | ISO string | 2026-04-29T10:30:00.000Z | actor capture time |
The actor parses arXiv's Atom-XML response with Cheerio (xmlMode: true). It does not extract author affiliations, institutions, page counts, figure counts, reference counts, citations, or comment-field metadata — those would require parsing PDFs or hitting Semantic Scholar / OpenAlex (separate actors).
Inputs (from .actor/input_schema.json)
| Parameter | Type | Default | Description |
|---|---|---|---|
searchQueries | string[] | [] | free-text queries — sent to arXiv as all:<query> (matches title + abstract + author + comment) |
categories | string[] | [] | arXiv category codes (e.g. cs.AI, cs.LG, stat.ML) |
maxPapersPerSource | integer | 50 | cap per query / per category. Caveat: searchQueries paginate up to this cap (max 500 = 5 batches of 100). Categories only fetch ONE batch (≤100) — values >100 for categories are silently capped at 100. To pull >100 papers per category, use a searchQuery cat:cs.AI instead, or request a paginated-categories custom build. |
sortBy | string | submittedDate | submittedDate | lastUpdatedDate | relevance |
sortOrder | string | descending | ascending | descending |
If both searchQueries and categories are empty, the actor fetches the latest 25 cs.AI papers as a default sample.
Use cases
- Weekly digest — pull new
cs.AI/cs.LG/cs.CLsubmissions every Monday for a research-team Slack feed - RAG corpus ingestion — feed
arxivId+pdfUrlinto a downstream PDF-extraction pipeline (usepypdf/PyMuPDF) - Citation graph stub — collect
doi/journalReffor joining against OpenAlex or Semantic Scholar later - Topic monitoring — search "retrieval augmented generation" weekly, diff against last week
Quick start
from apify_client import ApifyClientclient = ApifyClient("YOUR_APIFY_TOKEN")run = client.actor("knotless_cadence/arxiv-paper-scraper").call(run_input={"categories": ["cs.AI", "cs.LG"],"searchQueries": ["retrieval augmented generation"],"maxPapersPerSource": 200,"sortBy": "submittedDate","sortOrder": "descending",})for p in client.dataset(run["defaultDatasetId"]).iterate_items():print(p["arxivId"], p["primaryCategory"], "|", p["title"])print(" Authors:", ", ".join(p["authors"][:3]))print(" Published:", p["published"])print(" PDF:", p["pdfUrl"])
Bash one-liner
curl -s "https://api.apify.com/v2/acts/knotless_cadence~arxiv-paper-scraper/run-sync-get-dataset-items?token=$APIFY_TOKEN" \-H "Content-Type: application/json" \-d '{"categories":["cs.AI","cs.LG"],"maxPapersPerSource":50,"sortBy":"submittedDate","sortOrder":"descending"}' \| jq -r '.[] | "\(.arxivId)\t\(.primaryCategory)\t\(.title)"'
How it works
- For each
searchQueriesentry → arXiv Atom API:?search_query=all:<query>&sortBy=...&max_results=...&start=...(paginated up tomaxPapersPerSource, max 500) - For each
categoriesentry → arXiv Atom API:?search_query=cat:<cat>&sortBy=...&max_results=<min(maxPapersPerSource,100)>(single-batch only — no pagination; max 100) - Cheerio parses XML (
xmlMode: true), walks each<entry>, builds the 14-field record above - 3-second delay between PAGINATED batches WITHIN a single query —
delay(3000)after each batch. No delay between separate queries or between separate categories — multi-input runs may issue back-to-back requests faster than arXiv's published 3s/request etiquette. Throttle yourself by spacing queries across runs, or request a custom build that adds inter-source delay. Actor.pushData()per paper, capped atmaxPapersPerSource
Honest limitations (read before bulk runs)
- HTTP not HTTPS. Code calls
http://export.arxiv.org/api/query— paper metadata transits cleartext, MITM-tamperable on hostile networks. arXiv supports HTTPS at the same hostname; this would be a one-line code change in a custom build. - Categories don't paginate. A category fetch issues exactly one request with
max_results=min(maxPapersPerSource,100). Asking for 500 papers incs.AIreturns 100, NOT 500. Workaround: use a searchQuery formatted ascat:cs.AI— that path DOES paginate up to 500. - 3s delay is intra-query only. A run with 5 queries + 5 categories issues up to 10 back-to-back requests with no inter-source delay — over arXiv's etiquette threshold for batched callers. Heavy users should expect occasional rate-limiting. Custom build can add
delay(3000)between sources. - Outer try/catch wraps the entire run. A 5xx on query #3 of 10 throws, the catch fires, and queries #4-#10 + ALL categories are SKIPPED. Workaround: split inputs across runs.
- Single-attempt fetch — no retry on transient errors.
searchQueriesuseall:operator — matches title + abstract + author + comment. For more precise control, write the operator yourself in the query (e.g.ti:transformer AND au:vaswani).sortBy=relevanceis best-effort against arXiv's relevance ranker; results can vary.- No author affiliations / ORCIDs / institution / page-count / figure-count / reference-count / citation-count. arXiv API returns paper-level metadata only; deeper enrichment requires Semantic Scholar / OpenAlex / direct PDF parsing.
- No version chain —
arxivIdincludes the version suffix (v1,v2...) but the actor does NOT walk the replacement history. doiandjournalRefare often null — many arXiv papers never get a journal-of-record DOI assigned.- Cheerio XML namespace selector quirk —
arxiv\\:doi, doimatches both namespaced and baredoielements; if arXiv changes their feed structure, the bare-name fallback covers it.
Related actors
- OpenLibrary Book Scraper —
apify.com/knotless_cadence/openlibrary-book-scraper - Hacker News Scraper — discussion threads + scoring →
apify.com/knotless_cadence/hacker-news-scraper - Reddit Discussion Scraper — subreddit threads →
apify.com/knotless_cadence/reddit-discussion-scraper - GitHub Trending Scraper — open-source signal →
apify.com/knotless_cadence/github-trending-scraper
Need a custom build?
Apify-as-a-Service tiers:
- Pilot — $97: 1 actor, basic config, 7-day support
- Standard — $297: custom actor + Slack/email alerts, 30-day support
- Premium — $797: custom actor + dashboard + 90-day support + 1 modification round
Common arXiv-related custom builds: institution-affiliation extraction (PDF-parse), citation graph join with OpenAlex, weekly delta alerts to Slack, RAG-ready chunking + embedding pipeline.
Email: spinov001@gmail.com Blog (case studies): https://blog.spinov.online Telegram channel (scraping & data engineering): https://t.me/scraping_ai
Honest disclosure
- Public arXiv API only (
http://export.arxiv.org/api/query) — no key, no auth, no scraping behind paywalls. HTTP not HTTPS in current code — custom build can switch to HTTPS. - arXiv's published etiquette is one request every 3 seconds — actor enforces this between paginated batches within a single query, NOT between separate queries or categories. Multi-input runs can exceed etiquette.
maxPapersPerSourcefor categories is silently capped at 100 (single-batch fetch, no pagination). For pagination use asearchQueryformatted ascat:cs.AI.- The actor returns arXiv's native metadata only. It does not extract: author affiliations, ORCIDs, page counts, figure counts, reference counts, citation counts, paper comments, or replacement history. For those, use Semantic Scholar / OpenAlex / direct PDF parsing.
- Independent project — not affiliated with arXiv or Cornell University. arXiv data is free for research and educational use under arXiv's license terms.