arXiv Paper Scraper — AI Research, Abstracts & PDF Links avatar

arXiv Paper Scraper — AI Research, Abstracts & PDF Links

Pricing

from $1.84 / 1,000 arxiv papers

Go to Apify Store
arXiv Paper Scraper — AI Research, Abstracts & PDF Links

arXiv Paper Scraper — AI Research, Abstracts & PDF Links

Search arXiv papers by keyword, ID list, or category. Returns title, authors, abstract, categories, PDF URL, DOI, publish dates, and parse_confidence. Official Atom XML API — no proxy, no auth. Pay per result.

Pricing

from $1.84 / 1,000 arxiv papers

Rating

0.0

(0)

Developer

Vitalii Bondarev

Vitalii Bondarev

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

4 days ago

Last modified

Share

arXiv Paper Scraper — AI Research, Abstracts & PDF Links | from $1.50/1K

Used by AI/ML researchers building training datasets, literature review automation tools, and academic institutions monitoring emerging research.

Pricing: $1.50 per 1,000 papers (metadata + PDF links). Include full abstracts for an additional $0.50/1k (includeAbstract=true).

Search arXiv research papers by keyword, arXiv ID, or category. Returns structured metadata including title, authors, abstract, categories, PDF URL, DOI, publish dates, and parse_confidence. Uses the official arXiv Atom XML API — no proxy, no auth required. Pay per result.


What is this actor for?

arXiv (arxiv.org) is the world's largest open-access repository for scientific preprints, with over 2.3 million papers in physics, mathematics, computer science, quantitative biology, and more. Researchers, analysts, and ML practitioners use arXiv daily for literature discovery.

This actor provides a clean, structured interface to the official arXiv API — no scraping, no fragility, no proxy costs for buyers.

What data does it return?

Each result row has 17 fields:

FieldDescription
arxiv_idarXiv paper ID (e.g. 2501.05032v2)
titlePaper title
summaryFull abstract text (toggle off with includeAbstract=false)
authorsList of author name strings
primary_categoryPrimary arXiv category (e.g. cs.CL, cs.LG)
categoriesAll categories the paper is cross-listed in
published_atOriginal submission date (ISO 8601 UTC)
updated_atLast updated date (ISO 8601 UTC)
pdf_urlDirect PDF link
abstract_urlarXiv abstract page URL
doiDOI if available (often present for journal-published papers)
commentAuthor comment (conference/workshop note, code link, etc.)
journal_refJournal reference if published
querySearch query that returned this paper (null for ID lookups)
scraped_atTimestamp of the scrape run (ISO 8601 UTC)
parse_confidenceQuality score 0.0–1.0; 1.0 = all fields parsed correctly
warningsList of warning codes when confidence < 1.0

Input

ParameterTypeDefaultDescription
searchQueriesstring[]Keyword queries. Supports arXiv field prefixes: ti: (title), au: (author), abs: (abstract), cat: (category), all: (all fields).
idListstring[]Specific arXiv IDs to fetch (e.g. 2501.05032).
categorystringFilter to category (e.g. cs.LG, cs.CL). Applied as cat: prefix when no query given.
sortByenumrelevancerelevance / lastUpdatedDate / submittedDate
maxItemsinteger100Total results cap. 0 = unlimited.
includeAbstractbooleantrueInclude full abstract in output.
pageSizeinteger50Results per API page (1–200).
delaySecondsinteger3Delay between API calls. arXiv requires ≥3 seconds.

Usage examples

Search for recent LLM papers:

{
"searchQueries": ["ti:\"large language models\""],
"sortBy": "submittedDate",
"maxItems": 50
}

Look up specific papers by ID:

{
"idList": ["2501.05032", "2402.14679", "1706.03762"]
}

Browse a category:

{
"category": "cs.LG",
"sortBy": "submittedDate",
"maxItems": 100
}

Combined query + category:

{
"searchQueries": ["abs:\"retrieval augmented generation\""],
"category": "cs.CL",
"sortBy": "submittedDate",
"maxItems": 200
}

arXiv search query syntax

arXiv supports prefix operators for targeted searches:

  • ti:transformer — papers with "transformer" in the title
  • au:"Yann LeCun" — papers by Yann LeCun
  • abs:diffusion — papers with "diffusion" in the abstract
  • cat:cs.CV — papers in Computer Vision
  • all:"attention mechanism" — any field contains phrase
  • Combine with AND, OR, ANDNOT

Pricing examples

RunItemsCost
100 papers (metadata, no abstract)100~$0.15
100 papers with full abstracts100~$0.20
1,000 papers (metadata)1,000~$1.50
Weekly category monitor (cs.LG, 200 papers)800/mo~$1.20/mo

You only pay for papers successfully pushed to the dataset.

FAQ

Do I need a proxy or API key? No. The arXiv API is fully public and requires no authentication. Zero proxy cost for buyers.

What formats can I export results in? JSON, CSV, JSONL, Excel — all via Apify dataset export. The PDF URL field lets you download source PDFs programmatically.

Can I monitor a category for new papers on a schedule? Yes. Use Apify Schedules + sortBy: submittedDate to fetch the latest papers daily or weekly. Combine with a webhook to push new papers into Slack, Notion, or a literature database.

What if the actor returns empty results? Check your searchQueries syntax — arXiv requires the correct field prefix format (e.g. ti:"attention" not title:attention). For ID lookups, verify the ID exists on arxiv.org. The OUTPUT key-value store reports any failed queries with reason codes.

Edge cases & known limits

  • Rate limit: arXiv requests ≥3 seconds between API calls. The actor respects this automatically. Large runs may take time — plan accordingly.
  • max_results cap: arXiv API caps single-page results at 2000. The actor paginates automatically for large queries.
  • No auth required: The API is publicly accessible with no token or proxy.
  • Versioned IDs: arXiv returns versioned IDs (e.g. 2501.05032v2). The arxiv_id field contains the version-qualified ID from the response.
  • LaTeX in abstracts: arXiv abstracts may contain LaTeX markup (e.g. $\textbf{...}$). The summary field returns raw text — no LaTeX rendering.
  • Not affiliated with arXiv: This actor uses the official public API. arXiv and Cornell University are not affiliated with this actor.

Why choose this actor?

  • parse_confidence on every record: immediately see if parsing succeeded. No silent failures.
  • Zero proxy cost: official API, no residential proxies needed — buyers pay nothing for egress.
  • All 3 modes: search + ID lookup + category browse in a single actor.
  • Full metadata: doi, comment, journal_ref, versioned PDF URLs — fields competitors skip.
  • Batch-friendly: handles multiple queries in one run with automatic pagination.

Competitor comparison

This actorscrapestorm/arxivAny REST client
Official Atom XML APIYesUnknownYes (DIY)
parse_confidenceYesNoNo
Abstract toggle (save cost)YesAlways-onNo
Category + keyword + ID in one actorYesPartialSeparate calls
Proxy neededNoUnknownNo
Price$1.50/1k$9.99/1kFree (DIY infra)

We are 6.6× cheaper than scrapestorm for academic paper data.

AI training datasets

Pull all papers in cs.LG since 2020 — abstract + title + category = a ready-made AI research corpus. Use includeAbstract=true for full text, includeAbstract=false for fast metadata-only runs.

Use with AI agents (MCP)

Ask your AI agent "find me the 10 most cited papers on RAG from 2024" — this actor returns structured paper metadata with abstract and PDF link, ready for downstream summarization.

Point your MCP client at this tool:

{
"mcpServers": {
"apify": {
"command": "npx",
"args": [
"mcp-remote",
"https://mcp.apify.com/?tools=bovi/arxiv-scraper",
"--header",
"Authorization: Bearer <YOUR_APIFY_TOKEN>"
]
}
}
}

Minimal call:

{ "searchQueries": ["large language models"], "maxItems": 20 }

crossref-scraper cross-sell

Need journal metadata, citation counts, or DOIs for published versions? Use our crossref-scraper alongside this actor to enrich arXiv papers with citation data.

Integrations

Built for ML researchers and R&D teams building literature datasets and monitoring new papers by keyword or category — the JSON/dataset output drops into the tools you already run, no glue code:

  • n8n / Make / Zapier — trigger a run or pipe every new dataset item into 500+ apps (Google Sheets, Airtable, Slack, HubSpot, your database) with no code: n8n, Make, Zapier.
  • Webhooks — fire your own endpoint the moment a run finishes, to push results straight into your pipeline (docs).
  • MCP server — expose this actor as a tool to Claude, Cursor, or any MCP client so an AI agent can pull this data mid-conversation (guide).
  • API & SDKs — fetch the dataset as JSON, CSV, or Excel through the Apify REST API or the Python / JS SDKs.

See all Apify integrations.