Pricing

from $1.84 / 1,000 arxiv papers

arXiv Paper Scraper — AI Research, Abstracts & PDF Links

Search arXiv papers by keyword, ID list, or category. Returns title, authors, abstract, categories, PDF URL, DOI, publish dates, and parse_confidence. Official Atom XML API — no proxy, no auth. Pay per result.

Pricing

from $1.84 / 1,000 arxiv papers

Rating

0.0

(0)

Developer

Vitalii Bondarev

Actor stats

Bookmarked

Total users

Monthly active users

15 days ago

Last modified

arXiv Paper Scraper — AI Research, Abstracts & PDF Links | from $1.50/1K

Used by AI/ML researchers building training datasets, literature review automation tools, and academic institutions monitoring emerging research.

Pricing: $1.50 per 1,000 papers (metadata + PDF links). Include full abstracts for an additional $0.50/1k (includeAbstract=true).

Search arXiv research papers by keyword, arXiv ID, or category. Returns structured metadata including title, authors, abstract, categories, PDF URL, DOI, publish dates, and parse_confidence. Uses the official arXiv Atom XML API — no proxy, no auth required. Pay per result.

What is this actor for?

arXiv (arxiv.org) is the world's largest open-access repository for scientific preprints, with over 2.3 million papers in physics, mathematics, computer science, quantitative biology, and more. Researchers, analysts, and ML practitioners use arXiv daily for literature discovery.

This actor provides a clean, structured interface to the official arXiv API — no scraping, no fragility, no proxy costs for buyers.

What data does it return?

Each result row has 17 fields:

Field	Description
`arxiv_id`	arXiv paper ID (e.g. `2501.05032v2`)
`title`	Paper title
`summary`	Full abstract text (toggle off with `includeAbstract=false`)
`authors`	List of author name strings
`primary_category`	Primary arXiv category (e.g. `cs.CL`, `cs.LG`)
`categories`	All categories the paper is cross-listed in
`published_at`	Original submission date (ISO 8601 UTC)
`updated_at`	Last updated date (ISO 8601 UTC)
`pdf_url`	Direct PDF link
`abstract_url`	arXiv abstract page URL
`doi`	DOI if available (often present for journal-published papers)
`comment`	Author comment (conference/workshop note, code link, etc.)
`journal_ref`	Journal reference if published
`query`	Search query that returned this paper (null for ID lookups)
`scraped_at`	Timestamp of the scrape run (ISO 8601 UTC)
`parse_confidence`	Quality score 0.0–1.0; 1.0 = all fields parsed correctly
`warnings`	List of warning codes when confidence < 1.0

Input

Parameter	Type	Default	Description
`searchQueries`	string[]	—	Keyword queries. Supports arXiv field prefixes: `ti:` (title), `au:` (author), `abs:` (abstract), `cat:` (category), `all:` (all fields).
`idList`	string[]	—	Specific arXiv IDs to fetch (e.g. `2501.05032`).
`category`	string	—	Filter to category (e.g. `cs.LG`, `cs.CL`). Applied as `cat:` prefix when no query given.
`sortBy`	enum	`relevance`	`relevance` / `lastUpdatedDate` / `submittedDate`
`maxItems`	integer	`100`	Total results cap. 0 = unlimited.
`includeAbstract`	boolean	`true`	Include full abstract in output.
`pageSize`	integer	`50`	Results per API page (1–200).
`delaySeconds`	integer	`3`	Delay between API calls. arXiv requires ≥3 seconds.

Usage examples

Search for recent LLM papers:

{
  "searchQueries": ["ti:\"large language models\""],
  "sortBy": "submittedDate",
  "maxItems": 50
}

Look up specific papers by ID:

{
  "idList": ["2501.05032", "2402.14679", "1706.03762"]
}

Browse a category:

{
  "category": "cs.LG",
  "sortBy": "submittedDate",
  "maxItems": 100
}

Combined query + category:

{
  "searchQueries": ["abs:\"retrieval augmented generation\""],
  "category": "cs.CL",
  "sortBy": "submittedDate",
  "maxItems": 200
}

arXiv search query syntax

arXiv supports prefix operators for targeted searches:

ti:transformer — papers with "transformer" in the title
au:"Yann LeCun" — papers by Yann LeCun
abs:diffusion — papers with "diffusion" in the abstract
cat:cs.CV — papers in Computer Vision
all:"attention mechanism" — any field contains phrase
Combine with AND, OR, ANDNOT

Pricing examples

Run	Items	Cost
100 papers (metadata, no abstract)	100	~$0.15
100 papers with full abstracts	100	~$0.20
1,000 papers (metadata)	1,000	~$1.50
Weekly category monitor (cs.LG, 200 papers)	800/mo	~$1.20/mo

You only pay for papers successfully pushed to the dataset.

FAQ

Do I need a proxy or API key? No. The arXiv API is fully public and requires no authentication. Zero proxy cost for buyers.

What formats can I export results in? JSON, CSV, JSONL, Excel — all via Apify dataset export. The PDF URL field lets you download source PDFs programmatically.

Can I monitor a category for new papers on a schedule? Yes. Use Apify Schedules + sortBy: submittedDate to fetch the latest papers daily or weekly. Combine with a webhook to push new papers into Slack, Notion, or a literature database.

What if the actor returns empty results? Check your searchQueries syntax — arXiv requires the correct field prefix format (e.g. ti:"attention" not title:attention). For ID lookups, verify the ID exists on arxiv.org. The OUTPUT key-value store reports any failed queries with reason codes.

Edge cases & known limits

Rate limit: arXiv requests ≥3 seconds between API calls. The actor respects this automatically. Large runs may take time — plan accordingly.
max_results cap: arXiv API caps single-page results at 2000. The actor paginates automatically for large queries.
No auth required: The API is publicly accessible with no token or proxy.
Versioned IDs: arXiv returns versioned IDs (e.g. 2501.05032v2). The arxiv_id field contains the version-qualified ID from the response.
LaTeX in abstracts: arXiv abstracts may contain LaTeX markup (e.g. $\textbf{...}$ ). The summary field returns raw text — no LaTeX rendering.
Not affiliated with arXiv: This actor uses the official public API. arXiv and Cornell University are not affiliated with this actor.

Why choose this actor?

parse_confidence on every record: immediately see if parsing succeeded. No silent failures.
Zero proxy cost: official API, no residential proxies needed — buyers pay nothing for egress.
All 3 modes: search + ID lookup + category browse in a single actor.
Full metadata: doi, comment, journal_ref, versioned PDF URLs — fields competitors skip.
Batch-friendly: handles multiple queries in one run with automatic pagination.

Competitor comparison

	This actor	scrapestorm/arxiv	Any REST client
Official Atom XML API	Yes	Unknown	Yes (DIY)
`parse_confidence`	Yes	No	No
Abstract toggle (save cost)	Yes	Always-on	No
Category + keyword + ID in one actor	Yes	Partial	Separate calls
Proxy needed	No	Unknown	No
Price	$1.50/1k	$9.99/1k	Free (DIY infra)

We are 6.6× cheaper than scrapestorm for academic paper data.

AI training datasets

Pull all papers in cs.LG since 2020 — abstract + title + category = a ready-made AI research corpus. Use includeAbstract=true for full text, includeAbstract=false for fast metadata-only runs.

Use with AI agents (MCP)

Ask your AI agent "find me the 10 most cited papers on RAG from 2024" — this actor returns structured paper metadata with abstract and PDF link, ready for downstream summarization.

Point your MCP client at this tool:

{
  "mcpServers": {
    "apify": {
      "command": "npx",
      "args": [
        "mcp-remote",
        "https://mcp.apify.com/?tools=bovi/arxiv-scraper",
        "--header",
        "Authorization: Bearer <YOUR_APIFY_TOKEN>"
      ]
    }
  }
}

Minimal call:

{ "searchQueries": ["large language models"], "maxItems": 20 }

crossref-scraper cross-sell

Need journal metadata, citation counts, or DOIs for published versions? Use our crossref-scraper alongside this actor to enrich arXiv papers with citation data.

Integrations

Built for ML researchers and R&D teams building literature datasets and monitoring new papers by keyword or category — the JSON/dataset output drops into the tools you already run, no glue code:

n8n / Make / Zapier — trigger a run or pipe every new dataset item into 500+ apps (Google Sheets, Airtable, Slack, HubSpot, your database) with no code: n8n, Make, Zapier.
Webhooks — fire your own endpoint the moment a run finishes, to push results straight into your pipeline (docs).
MCP server — expose this actor as a tool to Claude, Cursor, or any MCP client so an AI agent can pull this data mid-conversation (guide).
API & SDKs — fetch the dataset as JSON, CSV, or Excel through the Apify REST API or the Python / JS SDKs.

See all Apify integrations.

arXiv Scraper: Papers, Authors, Categories & Search

perconey/arxiv-scraper

Scrape arxiv.org via the official Atom API. Full-text search, by author / title / category, paper detail by id, latest in any category. Returns title, abstract, authors, DOI, PDF link. No auth, no proxies. Pay only per result item.

Perconey

arXiv Research Paper Scraper

crawlerbros/arxiv-research-paper-scraper

Scrape research papers from arXiv.org - search by query, category, or author; lookup by arXiv ID. Returns title, authors, abstract, PDF URL, DOI, categories, and more. Uses the public arXiv Atom API. No login or proxy required.

Crawler Bros

arXiv Papers Scraper: AI & Science Research Tracker

scrapemint/arxiv-papers-scraper

Track new research papers on arXiv by keyword, category, or author. One clean JSON row per paper: title, abstract, authors, categories, dates, PDF link, and DOI. Official open API, no key, no browser. Pay per paper.

Ken M

arXiv Papers Scraper

troy_007/arxiv-papers-scraper

Search and export arXiv research papers by query, category, or author — title, abstract, authors, categories, dates, PDF link, and DOI. Uses the official arXiv API.

Pathik Shah

arXiv Paper Scraper

plantane/arxiv-scraper

Scrape research papers from arXiv by search query or category. Get titles, abstracts, authors, categories, and PDF links via the public arXiv API.

Daniel

arXiv Paper Scraper - Research Papers, Abstracts & Authors

antishock/arxiv-paper-scraper

Search and extract arXiv papers by keyword, category (cs.AI, q-bio, math) or date. Returns title, authors, abstract, DOI and PDF link. Uses official API - free, no rate limits.

Ryan Zinburg

arXiv Papers Scraper

crawlerbros/arxiv-papers-scraper

Scrape academic preprints from arXiv.org by keyword, author, or category. Returns clean records with title, authors, abstract, categories, PDF URL, DOI. HTTP-only via the public arXiv API. No login, no proxy.

Crawler Bros

ArXiv Paper Search

gentle_cloud/arxiv-paper-search

Search and extract academic papers from ArXiv. Find papers by keyword, author, or category with full metadata including title, authors, abstract, categories, and PDF links.

Monkey Coder

arXiv Paper Scraper - AI Research Tracker

arjunannamalai/arxiv-paper-scraper

Track new arXiv papers by category, keyword and author. Clean output with authors, abstract and direct PDF links. Public API, no key.

Arjun Annamalai

arXiv Paper Scraper

lulzasaur/arxiv-scraper

Search and scrape arXiv academic papers. Get titles, authors, abstracts, categories, PDF links, DOIs. Search by keyword, browse recent papers by category, or fetch by arXiv ID.