arXiv Paper & Author Scraper avatar

arXiv Paper & Author Scraper

Under maintenance

Pricing

Pay per usage

Go to Apify Store
arXiv Paper & Author Scraper

arXiv Paper & Author Scraper

Under maintenance

Extract academic papers, abstracts, and author details from arXiv using the official API. Ideal for research monitoring, literature reviews, and building academic datasets.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Automly

Automly

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

5 days ago

Last modified

Share

Extract academic papers, abstracts, and author details from arXiv using the official API. This actor is perfect for research monitoring, systematic literature reviews, building academic datasets, and feeding RAG pipelines with the latest scientific publications.

Why use this actor?

  • Official API reliability — Uses the arXiv export API for stable, structured data without scraping complexity.
  • Research monitoring — Track new papers in specific fields or by keyword.
  • Literature reviews — Collect abstracts, authors, and categories for systematic analysis.
  • Academic lead generation — Build lists of researchers and their affiliations by topic.
  • RAG & AI pipelines — Feed paper abstracts and metadata into vector databases for semantic search.

Features

  • Search papers by free-text query or arXiv category codes
  • Filter by date range (last week, last month, last year, or custom range)
  • Sort by relevance, submission date, or last updated date
  • Extract full abstracts and author lists with affiliations
  • Output authors as separate records for easy analysis
  • Respects arXiv polite usage policy with built-in rate limiting

Input

FieldTypeDefaultDescription
searchQuerystringarXiv search query, e.g. machine learning or cat:cs.AI
categoriesarrayList of arXiv category codes, e.g. ["cs.AI", "cs.LG"]
dateRangestringlastWeek, lastMonth, lastYear, or YYYY-MM-DD TO YYYY-MM-DD
maxResultsinteger100Maximum papers to return (1–500)
extractAuthorsbooleantrueInclude author records as separate rows
extractAbstractbooleantrueInclude paper abstracts
sortBystringrelevancerelevance, lastUpdatedDate, or submittedDate
sortOrderstringdescendingascending or descending

Example input

{
"searchQuery": "large language models",
"categories": ["cs.CL", "cs.AI"],
"dateRange": "lastMonth",
"maxResults": 50,
"extractAuthors": true,
"sortBy": "submittedDate",
"sortOrder": "descending"
}

Output

Each record includes a type field to distinguish entities.

Paper

FieldTypeDescription
typestringpaper
arxivIdstringarXiv identifier
urlstringarXiv abstract page URL
pdfUrlstringDirect PDF URL
titlestringPaper title
abstractstringPaper abstract
publishedAtstringISO 8601 submission date
updatedAtstringISO 8601 last update date
authorsarrayList of {name, affiliation} objects
categoriesarrayarXiv category codes
primaryCategorystringPrimary arXiv category

Author

FieldTypeDescription
typestringauthor
arxivIdstringAssociated paper identifier
paperTitlestringAssociated paper title
namestringAuthor name
affiliationstringAuthor affiliation

Limits and caveats

  • arXiv API returns up to 100 results per request; the actor paginates automatically.
  • A 3-second delay is enforced between requests to respect arXiv's polite usage policy.
  • Only publicly available papers are returned.
  • Author affiliations are only available when provided by the submitter.

Pricing

This actor uses Pay Per Event pricing. You are charged only for successfully extracted data.

EventPriceDescription
Paper scraped$0.003Each paper successfully extracted
Author scraped$0.001Each author record successfully extracted

Tiered discounts apply based on your Apify subscription level. A small actor-start fee may also apply.

FAQ

Do I need an arXiv account? No. The arXiv API is completely open and requires no authentication.

Can I download the full PDF? The actor returns direct PDF URLs in the pdfUrl field. You can download them separately.

What categories are available? arXiv uses codes like cs.AI (Artificial Intelligence), cs.LG (Machine Learning), cs.CL (Computation and Language), physics.gen-ph, math.ST, etc. See the full list at arxiv.org.

How recent is the data? Data reflects the current arXiv index at the time of the run. New papers are typically available within minutes of submission.