ArXiv Paper Scraper — Search by Category, Bulk JSON, DOI avatar

ArXiv Paper Scraper — Search by Category, Bulk JSON, DOI

Pricing

Pay per usage

Go to Apify Store
ArXiv Paper Scraper — Search by Category, Bulk JSON, DOI

ArXiv Paper Scraper — Search by Category, Bulk JSON, DOI

arXiv corpus as JSON — arxivId, title, authors, abstract, categories, dates, DOI, PDF URL. By search OR category. Built for ML/AI training data + lit reviews. 19 runs. Backed by 951-run Trustpilot flagship + 31-actor portfolio. spinov001@gmail.com · blog.spinov.online · t.me/scraping_ai

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Alex

Alex

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

4 days ago

Last modified

Share

arXiv Paper Scraper — Bulk Research-Paper Metadata via arXiv API

Pull research-paper metadata from the official arXiv API by category or free-text search query — titles, abstracts, authors, primary + secondary categories, submission/update dates, DOI, journal reference, PDF URL — as a flat JSON dataset. No API key, no auth.

Verified against src/main.js — every output field below is what the actor actually pushes.


What you get per paper (14 fields)

FieldTypeExampleSource in main.js
arxivIdstring2403.14812v1parsed from <id> URL tail
titlestringAttention Is All You Need<title> text, whitespace-collapsed
authorsstring[]["Ashish Vaswani", "Noam Shazeer"]each <author><name> text — no affiliations, no ORCIDs
abstractstringfull abstract, whitespace-collapsed<summary> text
categoriesstring[]["cs.CL", "cs.LG"]every <category term="...">
primaryCategorystring | nullcs.CLfirst <category>
publishedISO string2017-06-12T17:57:34Z<published>
updatedISO string2023-08-02T00:41:18Z<updated>
doistring | null10.5555/3294996.3295349<arxiv:doi> or null
journalRefstring | nullNeurIPS 2017<arxiv:journal_ref> or null
pdfUrlstringhttp://arxiv.org/pdf/2403.14812v1<id> with /abs//pdf/
abstractUrlstringhttp://arxiv.org/abs/2403.14812v1<id> text
sourcestringcategory:cs.AI or search:RAGtagged at scrape time
scrapedAtISO string2026-04-29T10:30:00.000Zactor capture time

The actor parses arXiv's Atom-XML response with Cheerio (xmlMode: true). It does not extract author affiliations, institutions, page counts, figure counts, reference counts, citations, or comment-field metadata — those would require parsing PDFs or hitting Semantic Scholar / OpenAlex (separate actors).


Inputs (from .actor/input_schema.json)

ParameterTypeDefaultDescription
searchQueriesstring[][]free-text queries — sent to arXiv as all:<query> (matches title + abstract + author + comment)
categoriesstring[][]arXiv category codes (e.g. cs.AI, cs.LG, stat.ML)
maxPapersPerSourceinteger50cap per query / per category. Caveat: searchQueries paginate up to this cap (max 500 = 5 batches of 100). Categories only fetch ONE batch (≤100) — values >100 for categories are silently capped at 100. To pull >100 papers per category, use a searchQuery cat:cs.AI instead, or request a paginated-categories custom build.
sortBystringsubmittedDatesubmittedDate | lastUpdatedDate | relevance
sortOrderstringdescendingascending | descending

If both searchQueries and categories are empty, the actor fetches the latest 25 cs.AI papers as a default sample.


Use cases

  • Weekly digest — pull new cs.AI / cs.LG / cs.CL submissions every Monday for a research-team Slack feed
  • RAG corpus ingestion — feed arxivId + pdfUrl into a downstream PDF-extraction pipeline (use pypdf / PyMuPDF)
  • Citation graph stub — collect doi / journalRef for joining against OpenAlex or Semantic Scholar later
  • Topic monitoring — search "retrieval augmented generation" weekly, diff against last week

Quick start

from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("knotless_cadence/arxiv-paper-scraper").call(
run_input={
"categories": ["cs.AI", "cs.LG"],
"searchQueries": ["retrieval augmented generation"],
"maxPapersPerSource": 200,
"sortBy": "submittedDate",
"sortOrder": "descending",
}
)
for p in client.dataset(run["defaultDatasetId"]).iterate_items():
print(p["arxivId"], p["primaryCategory"], "|", p["title"])
print(" Authors:", ", ".join(p["authors"][:3]))
print(" Published:", p["published"])
print(" PDF:", p["pdfUrl"])

Bash one-liner

curl -s "https://api.apify.com/v2/acts/knotless_cadence~arxiv-paper-scraper/run-sync-get-dataset-items?token=$APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{"categories":["cs.AI","cs.LG"],"maxPapersPerSource":50,"sortBy":"submittedDate","sortOrder":"descending"}' \
| jq -r '.[] | "\(.arxivId)\t\(.primaryCategory)\t\(.title)"'

How it works

  1. For each searchQueries entry → arXiv Atom API: ?search_query=all:<query>&sortBy=...&max_results=...&start=... (paginated up to maxPapersPerSource, max 500)
  2. For each categories entry → arXiv Atom API: ?search_query=cat:<cat>&sortBy=...&max_results=<min(maxPapersPerSource,100)> (single-batch only — no pagination; max 100)
  3. Cheerio parses XML (xmlMode: true), walks each <entry>, builds the 14-field record above
  4. 3-second delay between PAGINATED batches WITHIN a single querydelay(3000) after each batch. No delay between separate queries or between separate categories — multi-input runs may issue back-to-back requests faster than arXiv's published 3s/request etiquette. Throttle yourself by spacing queries across runs, or request a custom build that adds inter-source delay.
  5. Actor.pushData() per paper, capped at maxPapersPerSource

Honest limitations (read before bulk runs)

  • HTTP not HTTPS. Code calls http://export.arxiv.org/api/query — paper metadata transits cleartext, MITM-tamperable on hostile networks. arXiv supports HTTPS at the same hostname; this would be a one-line code change in a custom build.
  • Categories don't paginate. A category fetch issues exactly one request with max_results=min(maxPapersPerSource,100). Asking for 500 papers in cs.AI returns 100, NOT 500. Workaround: use a searchQuery formatted as cat:cs.AI — that path DOES paginate up to 500.
  • 3s delay is intra-query only. A run with 5 queries + 5 categories issues up to 10 back-to-back requests with no inter-source delay — over arXiv's etiquette threshold for batched callers. Heavy users should expect occasional rate-limiting. Custom build can add delay(3000) between sources.
  • Outer try/catch wraps the entire run. A 5xx on query #3 of 10 throws, the catch fires, and queries #4-#10 + ALL categories are SKIPPED. Workaround: split inputs across runs.
  • Single-attempt fetch — no retry on transient errors.
  • searchQueries use all: operator — matches title + abstract + author + comment. For more precise control, write the operator yourself in the query (e.g. ti:transformer AND au:vaswani).
  • sortBy=relevance is best-effort against arXiv's relevance ranker; results can vary.
  • No author affiliations / ORCIDs / institution / page-count / figure-count / reference-count / citation-count. arXiv API returns paper-level metadata only; deeper enrichment requires Semantic Scholar / OpenAlex / direct PDF parsing.
  • No version chainarxivId includes the version suffix (v1, v2...) but the actor does NOT walk the replacement history.
  • doi and journalRef are often null — many arXiv papers never get a journal-of-record DOI assigned.
  • Cheerio XML namespace selector quirkarxiv\\:doi, doi matches both namespaced and bare doi elements; if arXiv changes their feed structure, the bare-name fallback covers it.

  • OpenLibrary Book Scraperapify.com/knotless_cadence/openlibrary-book-scraper
  • Hacker News Scraper — discussion threads + scoring → apify.com/knotless_cadence/hacker-news-scraper
  • Reddit Discussion Scraper — subreddit threads → apify.com/knotless_cadence/reddit-discussion-scraper
  • GitHub Trending Scraper — open-source signal → apify.com/knotless_cadence/github-trending-scraper

Need a custom build?

Apify-as-a-Service tiers:

  • Pilot — $97: 1 actor, basic config, 7-day support
  • Standard — $297: custom actor + Slack/email alerts, 30-day support
  • Premium — $797: custom actor + dashboard + 90-day support + 1 modification round

Common arXiv-related custom builds: institution-affiliation extraction (PDF-parse), citation graph join with OpenAlex, weekly delta alerts to Slack, RAG-ready chunking + embedding pipeline.

Email: spinov001@gmail.com Blog (case studies): https://blog.spinov.online Telegram channel (scraping & data engineering): https://t.me/scraping_ai


Honest disclosure

  • Public arXiv API only (http://export.arxiv.org/api/query) — no key, no auth, no scraping behind paywalls. HTTP not HTTPS in current code — custom build can switch to HTTPS.
  • arXiv's published etiquette is one request every 3 seconds — actor enforces this between paginated batches within a single query, NOT between separate queries or categories. Multi-input runs can exceed etiquette.
  • maxPapersPerSource for categories is silently capped at 100 (single-batch fetch, no pagination). For pagination use a searchQuery formatted as cat:cs.AI.
  • The actor returns arXiv's native metadata only. It does not extract: author affiliations, ORCIDs, page counts, figure counts, reference counts, citation counts, paper comments, or replacement history. For those, use Semantic Scholar / OpenAlex / direct PDF parsing.
  • Independent project — not affiliated with arXiv or Cornell University. arXiv data is free for research and educational use under arXiv's license terms.