arXiv Paper Scraper โ€” Abstracts, Authors & Metadata avatar

arXiv Paper Scraper โ€” Abstracts, Authors & Metadata

Pricing

from $3.50 / 1,000 results

Go to Apify Store
arXiv Paper Scraper โ€” Abstracts, Authors & Metadata

arXiv Paper Scraper โ€” Abstracts, Authors & Metadata

Scrape research paper metadata from arXiv.org the worlds largest open-access repository. Search by keyword across computer science physics mathematics biology. Returns titles abstracts authors categories PDF links and DOIs. No API key required.

Pricing

from $3.50 / 1,000 results

Rating

0.0

(0)

Developer

Logiover

Logiover

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

๐Ÿ“„ arXiv Paper Scraper โ€” Research Metadata, Abstracts & Author Data

Scrape research paper metadata from arXiv.org, the world's largest open-access research repository with over 2.5 million scholarly articles across physics, computer science, mathematics, biology, economics, and more. This actor queries the public arXiv API (no API key required) and returns structured paper data including titles, abstracts, authors, categories, publication dates, PDF links and DOIs.

๐ŸŽ“ Why arXiv Data Is Valuable

arXiv is the primary preprint server for cutting-edge research. Every major AI breakthrough โ€” from transformers to diffusion models โ€” appeared on arXiv first. Researchers, companies, universities, VCs, and journalists track arXiv to stay ahead of scientific developments.

What you can extract per paper:

  • arXiv ID and direct links to abstract page and PDF
  • Title and abstract (full text)
  • Author list with all co-authors
  • Categories (e.g., cs.AI, cs.CL, stat.ML) and primary category
  • Publication date (original submission) and last updated date
  • DOI (Digital Object Identifier) and journal reference if published
  • Author comments (implementation notes, accepted venues, code links)

๐Ÿ“Š Output Fields

FieldDescription
arxivIdUnique arXiv paper identifier (e.g., 2401.12345)
titlePaper title
authorsComma-separated author names
abstractFull paper abstract text
categoriesAll arXiv category codes
primaryCategoryPrimary category
publishedDateOriginal submission date
updatedDateLast update date
pdfUrlDirect PDF download link
arxivUrlAbstract page URL
commentAuthor comments
journalRefJournal reference
doiDigital Object Identifier
searchQueryThe query that found this paper

โš™๏ธ Input Parameters

ParameterTypeDefaultDescription
searchQueriesarray["machine learning"]Search terms โ€” paper titles, author names, keywords
categoriesarray[]arXiv category filters (cs.AI, stat.ML, etc.)
maxResultsinteger200Max papers to return per query (up to 1000)
sortByenumrelevancerelevance / lastUpdatedDate / submittedDate
dateFromstringโ€”Filter papers after date (YYYY-MM-DD)

๐ŸŽฏ Use Cases

AI & Tech Industry Intelligence

Venture capital firms and corporate strategy teams scrape arXiv daily to identify emerging technologies, track competitor research output, and discover promising startups before they raise funding. The people publishing breakthrough papers today are founding the unicorns of tomorrow.

Academic Literature Reviews

PhD students and researchers search across thousands of papers, filter by date and category, and export structured metadata for systematic literature reviews. No more manual copy-pasting from the arXiv website.

Recruitment & Talent Sourcing

Recruiters and engineering leaders search arXiv for authors publishing in specific domains. Every paper author is a potential candidate โ€” arXiv gives you their name, research area, and publication track record.

Dataset Building for NLP/ML

Machine learning teams build training datasets from arXiv abstracts and titles. The clean XML API response makes it ideal for text classification, topic modeling, and citation graph construction.

Competitive Research Monitoring

Companies monitor arXiv categories relevant to their industry and get alerts when competitors or key researchers publish new work. Stay ahead of your competitors' R&D pipeline.

๐Ÿ’ฐ Pricing

Pay per event โ€” charged per API request to arXiv. Each query returns up to 100 papers in a single API call. A run with 5 search queries and 200 results each costs approximately $0.05โ€“0.15 in compute units. The arXiv API is free and has no rate limiting beyond a polite delay between requests.

๐Ÿš€ Tips

  • Use specific categories (e.g., cs.CL for NLP, cs.CV for computer vision) for more targeted results
  • Combine queries: run 5โ€“10 related queries to build a comprehensive dataset
  • Filter by date: use dateFrom to only get papers from the last month or year
  • Sort by lastUpdatedDate to find recently revised papers with new results
  • arXiv rate limit: the API asks for polite delays (this actor includes built-in delays)

โ“ FAQ

Q: Is this the official arXiv API? A: Yes โ€” this actor uses arXiv's public OAI-PMH compatible API at export.arxiv.org. No API key, no authentication, no rate-limiting beyond polite use.

Q: Can I download the actual PDFs? A: The actor provides PDF URLs. You can download PDFs separately. Full-text extraction from PDFs is not included.

Q: How many papers can I get in one run? A: Up to 1000 papers per query, with no limit on the number of queries.

Keywords: arxiv scraper, research paper api, academic paper data, arxiv metadata extractor, scientific paper scraper, arxiv abstract api, machine learning papers dataset, arxiv search tool, research literature mining, preprint server data, arxiv paper downloader, scholarly article scraper, cs papers data, arxiv api without key, academic research tool

How do I build a dataset of recent AI papers from arXiv?

Set searchQueries to your AI topics, add cs.AI or cs.LG categories, sort by submittedDate and use dateFrom to keep only recent preprints in your export.

Can I search arXiv by author name?

Yes. Put the author name in searchQueries and the scraper returns every matching paper with the full co-author list, abstract and PDF link.

๐Ÿ“ Changelog

2026-07-01

  • Maintenance pass: re-verified end-to-end on live data and confirmed successful runs within the 5-minute quality window on the default input.
  • Sharpened Store metadata (SEO title & description) and expanded the FAQ with high-intent, long-tail questions for easier discovery in Google and Apify Store search.
  • Added ready-to-run example tasks that cover common real-world use cases.