arXiv Research Paper Scraper avatar

arXiv Research Paper Scraper

Pricing

from $4.99 / 1,000 results

Go to Apify Store
arXiv Research Paper Scraper

arXiv Research Paper Scraper

Extract comprehensive research paper data from arXiv search results including titles, authors, abstracts, categories, and more.

Pricing

from $4.99 / 1,000 results

Rating

0.0

(0)

Developer

Coding Frontned

Coding Frontned

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Extract comprehensive research paper metadata from arXiv โ€” the premier open-access preprint server for physics, mathematics, computer science, and more. ๐ŸŽ“๐Ÿ“š

Features

  • Full paper metadata โ€” arXiv ID, title, authors, abstract, categories, dates
  • PDF & abstract links โ€” direct links to papers
  • Pagination โ€” automatically iterates through pages to reach maxItems
  • Deduplication โ€” no duplicate papers across pages
  • Flexible search โ€” search by all fields, title, author, abstract, category, etc.
  • Sorting โ€” sort by relevance, submission date, or last updated date
  • No anti-bot issues โ€” arXiv is an open academic resource

Input Parameters

FieldTypeDefaultDescription
querystring(required)Search query (e.g. "large language models", "quantum computing")
searchTypestring"all"Search field: all, ti (title), au (author), abs (abstract), cat (category)
sortBystring"relevance"Sort by: relevance, lastUpdatedDate, submittedDate
sortOrderstring"descending"Sort order: descending, ascending
maxItemsinteger50Maximum number of papers to extract (1โ€“1000)
proxyConfigurationobjectโ€”Apify proxy config

Example INPUT.json

{
"query": "large language models",
"searchType": "all",
"sortBy": "submittedDate",
"sortOrder": "descending",
"maxItems": 50
}

Output Fields

FieldTypeDescription
positionintegerRank in results (1-based)
arxivIdstringarXiv paper ID (e.g. 2401.12345)
titlestringFull paper title
authorsarrayList of author names
abstractstringFull paper abstract
primaryCategorystringPrimary subject category (e.g. cs.AI)
categoriesarrayAll subject categories
submittedDatestringOriginal submission date
updatedDatestringLast updated date
abstractUrlstringURL to the abstract page
pdfUrlstringDirect link to the PDF
commentsstringAuthor comments (e.g. "20 pages, 5 figures")
journalRefstringJournal reference if published
doistringDOI if available
reportNumberstringReport number if available
searchQuerystringQuery used for this result
scrapedAtstringISO 8601 timestamp

Example Output

{
"position": 1,
"arxivId": "2501.12345",
"title": "Scaling Laws for Neural Language Models",
"authors": ["Jared Kaplan", "Sam McCandlish"],
"abstract": "We study empirical scaling laws for language model performance...",
"primaryCategory": "cs.LG",
"categories": ["cs.LG", "cs.CL", "stat.ML"],
"submittedDate": "15 January, 2025",
"updatedDate": null,
"abstractUrl": "https://arxiv.org/abs/2501.12345",
"pdfUrl": "https://arxiv.org/pdf/2501.12345",
"comments": "35 pages, 14 figures",
"journalRef": null,
"doi": null,
"searchQuery": "large language models",
"scrapedAt": "2025-05-01T12:00:00.000Z"
}

Pagination

arXiv returns 25 results per page. The scraper automatically navigates through pages using the start offset parameter until maxItems is reached or no more results are available.

Use Cases

  • Academic research monitoring โ€” track new papers in your field
  • Trend analysis โ€” identify emerging topics and research directions
  • Author profiling โ€” collect all papers by specific authors
  • Citation database โ€” build reference datasets for research tools
  • Competitive intelligence โ€” monitor publications from research groups
  • AI/ML dataset creation โ€” collect paper abstracts for NLP training

Notes

  • arXiv is a free, open-access resource โ€” no authentication needed
  • Results may vary slightly based on arXiv's real-time indexing
  • The abstract field contains the full abstract text
  • Use searchType: "au" to search by author name (e.g. "Hinton, Geoffrey")
  • Use searchType: "cat" with category codes like "cs.AI", "math.CO", "hep-th"