Under maintenance

Pricing

$10.00 / 1,000 papers

Try for free

Go to Apify Store

arXiv Pro Scraper - API & Full Text

Under maintenance

Try for free

A professional, low-cost arXiv scraper that uses the official API to find papers, then downloads, cleans, and chunks the full PDF text—creating AI-ready datasets in one click.

Pricing

$10.00 / 1,000 papers

Rating

0.0

(0)

Developer

Lukas

Actor stats

Bookmarked

Total users

Monthly active users

8 hours ago

Last modified

arXiv API & Full-Text Scraper (AI-Ready)

This Apify Actor provides a complete, AI-ready dataset from arXiv.org.

It uses the official arXiv API for fast and reliable metadata scraping, then downloads, cleans, and chunks the full text from each paper's PDF.

This tool is designed for AI/LLM developers, researchers, and data scientists who need high-quality text corpora for model training and Retrieval-Augmented Generation (RAG) pipelines.

Key Competitive Features

⚡️ Blazing Fast & Low Cost: Uses the official arXiv API, not a slow browser, to find papers. This is thousands of times faster and cheaper than other scrapers.
🤖 AI-Ready Chunking: Automatically splits the clean text into overlapping chunks, perfect for ingestion into vector databases (RAG).
🧹 Automatic Text Cleaning: Cleans the raw PDF text to remove headers, footers, page numbers, and bibliographies.
🎯 Powerful Search: Scrape by keyword, category code (e.g., cs.AI), or both.

Input

The actor accepts the following JSON input. searchQuery or category is required.

Field	Type	Description	Default
`searchQuery`	String	(Optional) The keyword search query (e.g., "black hole physics").
`category`	String	(Optional) The arXiv category code (e.g., `cs.AI` for AI, `gr-qc` for General Relativity).
`maxPages`	Number	The maximum number of result pages to scrape (50 results per page).	`1`
`chunkSize`	Number	(Optional) The target size (in characters) for each text chunk.	`1000`
`chunkOverlap`	Number	(Optional) The number of characters to overlap between chunks.	`200`

Example Input (Simple Keyword Search)

{
  "searchQuery": "quantum computing",
  "maxPages": 3
}

### Example Input (AI-Ready Category Search)
```json
{
  "category": "cs.CL",
  "maxPages": 10,
  "chunkSize": 1500,
  "chunkOverlap": 300
}

### Example Output
{
  "title": "Quantum black holes: inside and outside",
  "authors": "Bernard S. Kay",
  "abstract": "We review and add to a conjectured 'principle' according to which...",
  "url": "[http://arxiv.org/abs/2510.20799v1](http://arxiv.org/abs/2510.20799v1)",
  "cleanedFullText": "Quantum black holes: inside and outside\n\nBernard S. Kay\n\n(Dated: October 23, 2025)\n\nAbstract\nWe review and add to a conjectured 'principle' according to which...\n\n[... rest of the cleaned paper text ...]\n\n",
  "textChunks": [
    "Quantum black holes: inside and outside\n\nBernard S. Kay\n\n(Dated: October 23, 2025)\n\nAbstract\nWe review and add to a conjectured 'principle' according to which a necessary condition for a ... [900 more characters]",
    "necessary condition for a ... [800 more characters] ... The principle is intended to apply only to 'civilized' spacetimes. We shall recall a ... [and 200 more characters]",
    "[... etc ...]"
  ]
}

Arxiv Category Scraper

brave_paradise/arxiv-category-scraper

Scrapes recent papers from specific arXiv categories like cs.AI, math.CO via the arXiv API.

Donny

Arxiv Recent Papers Scraper

urban_quidnunc/arxiv-recent-papers-scraper

Donny

arXiv Paper Scraper

cloud9_ai/arxiv-paper-scraper

Scrape academic papers from arXiv.org. Search by keyword, browse categories, or get latest papers. Extract titles, abstracts, authors, PDF links, and citation data via arXiv API.

cloud9

Arxiv Semantic Search

draouadmohamed/arxiv-semantic-search

Scrape arXiv papers by category and find relevant research using AI-powered semantic search. Get papers from any field (AI, physics, biology, economics, etc.) with embeddings for RAG systems. Find your categories at: https://arxiv.org/category_taxonomy

Mohamed Aouad

ArXiv MCP server

jakub.kopecky/arxiv-mcp-server

The ArXiv MCP server provides a bridge between AI assistants and arXiv's research repository through the Model Context Protocol (MCP). It allows AI models to search for papers and access their content in a programmatic way.

Jakub Kopecký

Arxiv Paper Scraper

technicaldost/arxiv-paper-scraper

Technical Dost Solutions

arXiv Daily Digest Scraper

tropical_quince/arxiv-daily-digest

Scrape arXiv papers by search query or category. Extract titles, authors, abstracts, and PDF links from recent submissions.

Donny Nguyen

arXiv Scraper

artificially/arxiv-scraper

Search and extract academic papers from arXiv.org. Get paper titles, authors, abstracts, categories, and PDF links for AI/ML, physics, math, and more.

Artificially

ArXiv Academic Paper Scraper

fortuitous_pirate/arxiv-scraper

Scrape academic papers from ArXiv. Extract titles, authors, abstracts, categories, and PDF links. Essential for research and literature reviews.

Fortuitous Pirate

ArXiv Paper Scraper

nexgendata/arxiv-scraper

Extract research papers, abstracts, authors, and citations from arXiv.org. Perfect for academic research monitoring, literature reviews, and scientific trend analysis.