Pricing

from $3.50 / 1,000 results

arXiv Paper Scraper — Abstracts, Authors & Metadata

Scrape research paper metadata from arXiv.org the worlds largest open-access repository. Search by keyword across computer science physics mathematics biology. Returns titles abstracts authors categories PDF links and DOIs. No API key required.

Pricing

from $3.50 / 1,000 results

Rating

0.0

(0)

Developer

Logiover

Actor stats

Bookmarked

Total users

Monthly active users

6 days ago

Last modified

arXiv Paper Scraper — Research Metadata, Abstracts & Author Data (No API Key)

Scrape research paper metadata from arXiv.org, the world's largest open-access preprint repository with 2.5M+ scholarly articles across computer science, physics, mathematics, quantitative biology, economics, statistics and more. Search by keyword, topic, author or category and get back structured records with titles, full abstracts, complete author lists, categories, submission dates, PDF links, DOIs and journal references — clean JSON, straight from the official arXiv API. Fast, no browser, no API key, no login.

🏆 Why this arXiv scraper?

15 fields per paper · up to 1,000 papers per query, unlimited queries · direct HTTP against the official arXiv Atom API (no browser, no key) · datacenter-proxy friendly · export to JSON / CSV / Excel. The turnkey way to build a research-paper dataset for literature reviews, AI/tech intelligence, talent sourcing and NLP training data.

✨ What this Actor does / Key features

📚 Full-text abstracts — the complete abstract for every paper, not just a snippet — ready for embeddings, summarization or topic modeling.
👤 Complete author lists — every co-author name, perfect for talent sourcing and citation-graph work.
🏷️ Category filtering — narrow to arXiv taxonomy codes like cs.AI, cs.CL, cs.CV, stat.ML, physics.optics or q-bio.GN, or search across all fields.
🔎 Multi-query search — pass many search terms in one run; each paper is tagged with the searchQuery that found it.
🗓️ Date filtering & sorting — keep only papers submitted after a dateFrom cutoff; sort by relevance, last-updated or submission date.
🔗 Direct links — abstract-page URL and direct PDF download URL for every result.
🧾 Publication metadata — doi, journalRef and author comment (accepted-venue notes, code links, page counts) when the authors provide them.
⚡ Official API, no browser — queries arXiv's public Atom/OAI-compatible API at export.arxiv.org; no headless browser, no HTML parsing, no scraping tricks.
🛡️ Proxy support — Apify Proxy integration; arXiv is proxy-friendly and datacenter proxies work fine.

🚀 Quick start (3 steps)

Configure — add one or more searchQueries (keywords, a topic, or an author name). Optionally add categories, a dateFrom cutoff and a sortBy order.
Run — click Start. The Actor pages through the arXiv API for every query and streams papers into your dataset.
Get your data — open the Output tab and export to JSON, CSV, Excel or XML, or pull it via the Apify API.

📥 Input

Give the Actor at least one searchQueries value. Everything else is optional.

Example — recent AI & NLP preprints (dataset building)

{
  "searchQueries": ["large language models", "retrieval augmented generation"],
  "categories": ["cs.AI", "cs.CL", "cs.LG"],
  "sortBy": "submittedDate",
  "dateFrom": "2026-01-01",
  "maxResults": 500
}

Example — track one author's output (talent sourcing)

{
  "searchQueries": ["Yann LeCun"],
  "sortBy": "submittedDate",
  "maxResults": 200
}

Example — broad topic sweep across fields

{
  "searchQueries": ["quantum computing", "diffusion models", "protein folding"],
  "sortBy": "relevance",
  "maxResults": 300,
  "proxyConfiguration": { "useApifyProxy": true }
}

Field	Type	Description
`searchQueries`	array	Search terms — keywords, topics, paper titles or author names. Each term is queried independently. Required.
`categories`	array	arXiv category codes to filter by (e.g. `cs.AI`, `cs.CL`, `cs.CV`, `stat.ML`, `physics.optics`, `q-bio.GN`). Leave empty to search all categories.
`maxResults`	integer	Maximum total papers to return across all queries (1–1000). Default 200.
`sortBy`	string	`relevance`, `lastUpdatedDate` or `submittedDate`. Default `relevance`.
`dateFrom`	string	Keep only papers submitted on/after this date (`YYYY-MM-DD`). Empty = no date filter.
`proxyConfiguration`	object	Apify Proxy settings. arXiv is proxy-friendly; datacenter proxies work fine.

Tip — category codes: browse arxiv.org/category_taxonomy to find the exact code for your field. Combining a focused searchQueries term with one or two categories gives the cleanest, most on-topic dataset.

📤 Output

One row per paper — 15 fields, exportable to JSON, CSV, Excel or XML. Here is a trimmed sample record:

{
  "arxivId": "2401.12345",
  "title": "Scaling Laws for Retrieval-Augmented Language Models",
  "authors": "Jane Doe, John Smith, Alice Zhang",
  "abstract": "We study how retrieval augmentation changes the scaling behavior of large language models across compute, data and parameter budgets. We find that…",
  "categories": "cs.CL, cs.AI, cs.LG",
  "primaryCategory": "cs.CL",
  "publishedDate": "2024-01-22T18:00:04Z",
  "updatedDate": "2024-02-15T09:31:12Z",
  "pdfUrl": "https://arxiv.org/pdf/2401.12345",
  "arxivUrl": "https://arxiv.org/abs/2401.12345",
  "comment": "18 pages, 7 figures. Accepted at ICML 2024. Code: github.com/example/repo",
  "journalRef": "Proceedings of ICML 2024",
  "doi": "10.48550/arXiv.2401.12345",
  "searchQuery": "large language models",
  "scrapedAt": "2026-07-06T12:00:00.000Z"
}

💡 Use cases

Literature reviews — pull every paper matching your topic and categories, filter by date, and export structured metadata instead of copy-pasting from the website.
AI & tech intelligence — track emerging methods and competitor research output by monitoring the categories that matter to your team.
Talent sourcing — search by domain, then mine the authors field to find researchers publishing in your area — every author is a potential hire.
NLP / ML dataset building — assemble titles and abstracts for text classification, topic modeling, retrieval benchmarks or citation-graph construction.
Trend & recency monitoring — sort by submittedDate with a rolling dateFrom to keep only fresh preprints and detect new work as it lands.
Bibliometrics — join doi / journalRef with other sources to enrich a scholarly-publications database.

👥 Who uses it

Researchers & PhD students running systematic literature reviews · VC and corporate-strategy teams tracking emerging tech · recruiters and engineering leaders sourcing research talent · ML/NLP teams building training and evaluation datasets · data journalists and bibliometrics analysts mapping the research landscape.

💰 Pricing

This Actor runs on a simple pay-per-result model — you pay for the papers you extract, with no separate Apify platform fees to calculate. The arXiv API itself is free. Try it on the free tier first, then scale up. See the Pricing tab on this page for the current rate.

❓ Frequently Asked Questions

Is it legal to scrape arXiv? This Actor uses arXiv's own public, officially supported API and collects only openly available metadata. arXiv explicitly provides this API for programmatic access. You are responsible for respecting arXiv's API terms of use (polite request rates) and using the data lawfully. The Actor includes built-in delays to stay within arXiv's polite-use guidance.

Does arXiv have a public API? Is this an API alternative? arXiv does offer a free public API, and this Actor uses it directly (the Atom/OAI-PMH-compatible endpoint at export.arxiv.org). Think of this Actor as a ready-to-run wrapper around that API: it handles pagination, multi-query batching, date filtering, retries and clean field mapping, then hands you export-ready JSON/CSV — so you don't have to write and maintain the client yourself.

Do I need an API key or a login? No. arXiv's API requires no API key, no authentication and no login — only an Apify account to run the Actor.

Can I scrape arXiv without an API key or login? Yes. There is no key or account on the arXiv side. You supply your search queries, the Actor calls the public API over direct HTTP, and returns structured papers.

How much data can I get? Up to 1,000 papers per query, with no limit on the number of queries — so you can build datasets of tens of thousands of papers in a single run by combining many search terms. arXiv paginates large result sets; the Actor fetches pages sequentially with polite delays.

Can I download the actual PDFs? The Actor returns the direct pdfUrl for every paper. You can download the PDFs separately from those links. Full-text extraction from the PDFs is not included in this Actor.

Can I search arXiv by author name? Yes. Put the author's name in searchQueries and the Actor returns every matching paper with its full co-author list, abstract, categories and PDF link.

How do I build a dataset of recent AI papers from arXiv?

Set searchQueries to your AI topics, add cs.AI / cs.LG / cs.CL to categories, set sortBy to submittedDate, and use dateFrom to keep only recent preprints. Then export the dataset as CSV or JSON.

How do I export arXiv data to CSV or JSON?

Run the Actor, then export the resulting dataset as CSV, JSON, Excel or XML from the Apify console, or pull it programmatically via the Apify API.

What's the difference between `publishedDate` and `updatedDate`?

publishedDate is the original v1 submission date; updatedDate reflects the most recent revision. Sort by lastUpdatedDate to catch papers that were recently revised with new results.

Why is `doi` or `journalRef` sometimes empty?

Preprints often haven't been formally published yet, so they carry no DOI or journal reference. These fields populate once (and if) the authors add that metadata on arXiv.

🔗 More research & AI-intelligence scrapers by logiover

Building a research or competitive-intelligence pipeline? Pair arXiv with the rest of the AI-research suite:

Actor	What it does
Semantic Scholar Research Scraper	Peer-reviewed papers, citations & influence metrics
Hugging Face Hub Intelligence	Models, datasets & spaces metadata
GitHub Repository Scraper	Repo metadata, stars, topics & activity
GitHub Activity Stream	Real-time commits, releases & events
npm Package Intelligence	Package metadata, downloads & dependencies
Company Deep Research Scraper	Full company dossier: tech stack, socials, contacts
AI Deep Research	Autonomous multi-source research agent
AI Web Search	Structured web search results for agents
News Intelligence Scraper	Multi-source, deduplicated, sentiment-scored news
Discussion Intelligence Scraper	Reddit + HN + Product Hunt + Stack Exchange opinion
CVE Security Advisory Monitor	Fresh CVEs & security advisories
AI Citation Source Finder	Find citable sources for AI-generated claims

👉 Browse all logiover scrapers on Apify Store — 180+ actors across real estate, jobs, crypto, social media & B2B data.

⏰ Scheduling & integration

Schedule this Actor on Apify to track new preprints in your field daily or weekly. Export results to JSON, CSV or Excel, sync to Google Sheets, or push to your database, BI tools and webhooks through the Apify API. Connect it to Make, n8n or Zapier to build automated research-monitoring and alerting pipelines — or wrap it in an MCP server so AI agents can pull fresh papers into their context on demand.

⭐ Support & feedback

Found a bug or need an extra field? Open an issue on the Issues tab — response is usually fast. If this Actor saves you time, a ★★★★★ review on the Store page genuinely helps and is hugely appreciated. 🙏

⚖️ Legal

This Actor extracts only publicly available metadata via arXiv's officially supported API, and is intended for legitimate research, analytics and dataset-building use. You are responsible for complying with arXiv's API terms of use, respecting polite request rates, and following any applicable local laws.

📝 Changelog

2026-07-06

✨ README overhaul: keyword-rich hero, full 15-field output reference with a realistic sample, three ready-to-run example scenarios, high-intent FAQ, and cross-links to the wider AI-research scraper suite.

2026-07-01

Maintenance pass: re-verified end-to-end on live data and confirmed successful runs within the 5-minute quality window on the default input.
Sharpened Store metadata (SEO title & description) and expanded the FAQ with high-intent, long-tail questions for easier discovery in Google and Apify Store search.
Added ready-to-run example tasks that cover common real-world use cases.

arXiv Scraper

jungle_synthesizer/arxiv-scraper

Export preprints from arXiv.org. Search 2.5M+ open-access papers across physics, mathematics, computer science, biology, economics, and quantitative finance. Query by keyword, author, category, or date range. Returns titles, authors, abstracts, categories, and PDF links.

BowTiedRaccoon

arXiv Research Paper Scraper

codingfrontend/arxiv-search-scraper

Extract comprehensive research paper data from arXiv search results including titles, authors, abstracts, categories, and more.

Coding Frontned

arXiv Preprint Scraper

parseforge/arxiv-scraper

Export preprints from arXiv.org. Search 2.5M+ open-access papers across physics, mathematics, computer science, biology, economics, and quantitative finance. Query by keyword, author, category, or date range. Pull titles, authors, abstracts, categories, DOIs, journal refs, and PDF links.

ParseForge

5.0

arXiv Paper Scraper

plantane/arxiv-scraper

Scrape research papers from arXiv by search query or category. Get titles, abstracts, authors, categories, and PDF links via the public arXiv API.

Daniel

ArXiv Paper Search MCP

reverberant_equality/mcp-arxiv-search

Search ArXiv papers and retrieve paper details. AI agents can discover academic research, abstracts, authors, categories, and PDF links.

Jordan C

arXiv Search Scraper 📚

easyapi/arxiv-search-scraper

Extract comprehensive research paper data from arXiv search results. Get detailed metadata including titles, authors, abstracts, categories and more. Perfect for academic research monitoring, trend analysis and building paper databases. 🎓📚

EasyApi

ArXiv Research Paper Scraper

datapilot/arxiv-research-paper-scraper

arXiv Research Paper Scraper retrieves academic paper metadata from the arXiv API based on a keyword. It extracts titles, abstracts, authors with affiliations, DOI, categories, submission dates, and PDF links. Supports proxy usage and outputs structured JSON results for research and data analysis.

Data Pilot

arXiv Paper Scraper

lulzasaur/arxiv-scraper

Search and scrape arXiv academic papers. Get titles, authors, abstracts, categories, PDF links, DOIs. Search by keyword, browse recent papers by category, or fetch by arXiv ID.

lulz bot

arXiv Research Paper Scraper

seeb/arxiv-research-paper-scraper

Scrape arXiv papers by keyword or category and return research titles, abstracts, authors, dates, links, and topic signals.

Techionik

arXiv Paper Scraper

skystone_labs/arxiv-scraper

Extract research papers from arXiv using the official API. Get titles, authors, abstracts, PDF URLs, categories, and more. Perfect for research datasets and literature reviews.

Skystone

arXiv Paper Scraper — Abstracts, Authors & Metadata

arXiv Paper Scraper — Research Metadata, Abstracts & Author Data (No API Key)

🏆 Why this arXiv scraper?

✨ What this Actor does / Key features

🚀 Quick start (3 steps)

📥 Input

Example — recent AI & NLP preprints (dataset building)

Example — track one author's output (talent sourcing)

Example — broad topic sweep across fields

📤 Output

💡 Use cases

👥 Who uses it

💰 Pricing

❓ Frequently Asked Questions

How do I build a dataset of recent AI papers from arXiv?

How do I export arXiv data to CSV or JSON?

What's the difference between publishedDate and updatedDate?

Why is doi or journalRef sometimes empty?

🔗 More research & AI-intelligence scrapers by logiover

⏰ Scheduling & integration

⭐ Support & feedback

⚖️ Legal

📝 Changelog

2026-07-06

2026-07-01

You might also like

arXiv Scraper

arXiv Research Paper Scraper

arXiv Preprint Scraper

arXiv Paper Scraper

ArXiv Paper Search MCP

arXiv Search Scraper 📚

ArXiv Research Paper Scraper

arXiv Paper Scraper

arXiv Research Paper Scraper

arXiv Paper Scraper

What's the difference between `publishedDate` and `updatedDate`?

Why is `doi` or `journalRef` sometimes empty?