arXiv Pro Scraper - API & Full Text avatar
arXiv Pro Scraper - API & Full Text

Pricing

$10.00 / 1,000 papers

Go to Apify Store
arXiv Pro Scraper - API & Full Text

arXiv Pro Scraper - API & Full Text

Developed by

Lukas

Lukas

Maintained by Community

A professional, low-cost arXiv scraper that uses the official API to find papers, then downloads, cleans, and chunks the full PDF text—creating AI-ready datasets in one click.

0.0 (0)

Pricing

$10.00 / 1,000 papers

1

2

2

Last modified

5 days ago

arXiv API & Full-Text Scraper (AI-Ready)

This Apify Actor provides a complete, AI-ready dataset from arXiv.org.

It uses the official arXiv API for fast and reliable metadata scraping, then downloads, cleans, and chunks the full text from each paper's PDF.

This tool is designed for AI/LLM developers, researchers, and data scientists who need high-quality text corpora for model training and Retrieval-Augmented Generation (RAG) pipelines.

Key Competitive Features

  • ⚡️ Blazing Fast & Low Cost: Uses the official arXiv API, not a slow browser, to find papers. This is thousands of times faster and cheaper than other scrapers.
  • 🤖 AI-Ready Chunking: Automatically splits the clean text into overlapping chunks, perfect for ingestion into vector databases (RAG).
  • 🧹 Automatic Text Cleaning: Cleans the raw PDF text to remove headers, footers, page numbers, and bibliographies.
  • 🎯 Powerful Search: Scrape by keyword, category code (e.g., cs.AI), or both.

Input

The actor accepts the following JSON input. searchQuery or category is required.

FieldTypeDescriptionDefault
searchQueryString(Optional) The keyword search query (e.g., "black hole physics").
categoryString(Optional) The arXiv category code (e.g., cs.AI for AI, gr-qc for General Relativity).
maxPagesNumberThe maximum number of result pages to scrape (50 results per page).1
chunkSizeNumber(Optional) The target size (in characters) for each text chunk.1000
chunkOverlapNumber(Optional) The number of characters to overlap between chunks.200
{
"searchQuery": "quantum computing",
"maxPages": 3
}
### Example Input (AI-Ready Category Search)
```json
{
"category": "cs.CL",
"maxPages": 10,
"chunkSize": 1500,
"chunkOverlap": 300
}
### Example Output
{
"title": "Quantum black holes: inside and outside",
"authors": "Bernard S. Kay",
"abstract": "We review and add to a conjectured 'principle' according to which...",
"url": "[http://arxiv.org/abs/2510.20799v1](http://arxiv.org/abs/2510.20799v1)",
"cleanedFullText": "Quantum black holes: inside and outside\n\nBernard S. Kay\n\n(Dated: October 23, 2025)\n\nAbstract\nWe review and add to a conjectured 'principle' according to which...\n\n[... rest of the cleaned paper text ...]\n\n",
"textChunks": [
"Quantum black holes: inside and outside\n\nBernard S. Kay\n\n(Dated: October 23, 2025)\n\nAbstract\nWe review and add to a conjectured 'principle' according to which a necessary condition for a ... [900 more characters]",
"necessary condition for a ... [800 more characters] ... The principle is intended to apply only to 'civilized' spacetimes. We shall recall a ... [and 200 more characters]",
"[... etc ...]"
]
}