Pricing

from $0.67 / 1,000 dossier records

arXiv Scraper for RAG: Papers as Chunked JSON

Scrape arXiv papers by date and category. Strips LaTeX and returns RAG-ready JSON with tokenizer-aware chunks (cl100k_base, 512/50). Drop-in for LangChain, LlamaIndex, Qdrant, Pinecone, Weaviate, pgvector, Chroma. Skip GROBID / Nougat / pandoc. $0.015 per paper.

Pricing

from $0.67 / 1,000 dossier records

Rating

0.0

(0)

Developer

GetAScraper

Actor stats

Bookmarked

Total users

Monthly active users

a day ago

Last modified

📄 arXiv scraper for RAG: papers as chunked JSON

Scrape arXiv research papers into RAG-ready JSON in one call. Pulls papers by date range and category, strips the LaTeX source to clean text, and returns fixed-token chunks (512 tokens, 50 overlap) with full metadata, ready to drop into LangChain, LlamaIndex, Qdrant, Pinecone, Weaviate, pgvector, or Chroma. Built for AI training data teams and anyone who has fought LaTeX compiling trying to get arXiv into an embedding pipeline.

🔍 What does arXiv scraper for RAG do?

This Apify Actor scrapes arXiv papers within a date range and category filter, fetches the LaTeX source for each paper, strips the markup, and splits the resulting plain text into tokenizer-aware chunks (512 tokens, 50-token overlap, tiktoken cl100k_base) ready to embed or feed into a RAG index.

Each output record contains clean metadata (title, authors, categories, DOI, PDF URL, published/updated dates) and a chunks array of { idx, text, tokens } ready for direct ingestion into a vector database.

Try it in the Apify Console. Fill in a category (e.g. cs.LG), a date range, a paper cap, and hit Start. Download the results as JSON, CSV, or Excel.

Built on the Apify platform, you also get: scheduled runs, HTTP API access, integrations with Zapier, Make, and Zapier, proxy rotation when needed, monitoring, and alerts. No infrastructure to run yourself.

💡 Why use arXiv scraper for RAG?

Skip the LaTeX parsing hell: Papers are delivered as plain text, not mathematical syntax soup. No complex extraction servers or custom macro handlers needed.
Pre-chunked for RAG: tiktoken cl100k_base tokenization, compatible with OpenAI text-embedding-3, Claude, Cohere, and most BGE/E5/nomic embedding models.
Vector-DB-neutral: Drops straight into Qdrant, Pinecone, Weaviate, pgvector (Supabase / Neon), Chroma, and Milvus without reformatting.
Framework-ready: Works out of the box with LangChain, LlamaIndex, Haystack, and LangGraph pipelines.
Dates + categories: We query the repository metadata directly, no need to scrape arXiv listing pages yourself.
Respects arXiv's rate limits: 1 request per 3 seconds, handled server-side, so your runs stay stable and unblocked.
Cheap: $0.015 per paper. A week of machine learning submissions (~500 papers) costs under $8.

🚀 How to use arXiv RAG extractor

Open the Actor in Apify Console.
Set dateFrom and dateTo in YYYY-MM-DD format (these are inclusive submission-date bounds).
Set categoriesFilter to an array of arXiv category tags like ["cs.LG", "cs.AI", "stat.ML"]. Leave empty to include all categories.
Set maxPapers to cap the run. Default: 10.
Click Start. Expect roughly 20 preprints per minute under the OAI rate limits.
Download results from the Storage tab.

⚙️ Input

Field	Type	Required	Description
`dateFrom`	string	Yes	Inclusive lower bound of paper submission date in YYYY-MM-DD format. Default: `"2024-01-01"`.
`dateTo`	string	Yes	Inclusive upper bound of paper submission date in YYYY-MM-DD format. Default: `"2024-01-02"`.
`categoriesFilter`	array of strings	No	arXiv category tags (e.g. `["cs.LG", "cs.AI"]`). Empty matches all categories.
`maxPapers`	integer	No	Hard cap on papers returned (1 to 100000). Default: `10`.

Example input:

{
    "categoriesFilter": ["cs.LG", "cs.AI"],
    "dateFrom": "2024-01-01",
    "dateTo": "2024-01-02",
    "maxPapers": 10
}

📦 Output

Each paper becomes one dataset item. You can download the dataset in JSON, HTML, CSV, or Excel.

{
    "arxiv_id": "2401.12345",
    "title": "Attention Is All You Need",
    "abstract": "The dominant sequence transduction models...",
    "authors": ["Ashish Vaswani", "Noam Shazeer", "..."],
    "categories": ["cs.LG", "cs.CL"],
    "published": "2024-01-15T00:00:00Z",
    "updated": "2024-01-18T00:00:00Z",
    "doi": "10.48550/arXiv.2401.12345",
    "pdf_url": "https://arxiv.org/pdf/2401.12345v1",
    "source": "latex",
    "chunks": [
        { "idx": 0, "text": "...", "tokens": 512 },
        { "idx": 1, "text": "...", "tokens": 512 },
        { "idx": 2, "text": "...", "tokens": 487 }
    ]
}

Data table

Field	Type	Description
`arxiv_id`	string	arXiv identifier (e.g. `2401.12345`)
`title`	string	Paper title (whitespace-normalized)
`abstract`	string	Abstract as returned by arXiv
`authors`	string[]	Author display names, order preserved
`categories`	string[]	arXiv category tags
`published`	ISO date	First submission datetime
`updated`	ISO date	Latest update datetime
`doi`	string?	DOI (when provided by arXiv)
`pdf_url`	string	Direct PDF link
`source`	`"latex"` \| `"abstract"`	Text origin: `latex` if source was extractable, `abstract` as fallback
`chunks`	Chunk[]	Fixed-token chunks ready for embedding
`chunks[].idx`	number	0-indexed position
`chunks[].text`	string	Chunk text
`chunks[].tokens`	number	Token count under cl100k_base (≤ 512)

💰 Pricing

$0.015 per paper (PPR, pay per result).

How much does it cost to scrape arXiv?

Volume	Estimated cost
10 papers	~$0.15
100 papers	~$1.50
1,000 papers	~$15.00
10,000 papers	~$150.00
100,000 papers	~$1,500.00

No subscription. No minimum. You pay only for successful records.

⭐ Enjoying arXiv scraper for RAG?

⭐ ⭐ ⭐ ⭐ ⭐
Pre-chunked, tokenizer-aware arXiv text ready for your vector database, without wrestling LaTeX yourself.
A 5-star rating takes 10 seconds and helps other AI training data teams and RAG pipeline builders find it. Your feedback also tells us what to build next.

★ Rate this Actor on Apify

⚠️ Limits you should know before you run

Only the latest version of each preprint is returned. Version history is a v2 feature.
No figure or table extraction: Captions stay inline as text inside body chunks. Figure and table content is dropped during the JATS strip pass.
No citation graph: Reference lists are stripped from body text to keep chunks dense. Reference extraction is a v2 feature.
No section-aware chunking: Chunks are fixed-token (512 with 50 overlap). Section-level splitting (Abstract / Introduction / Methods / Results / Discussion) is deferred.

📌 Tips

Narrow the category filter. Running on all of arXiv will hit your maxPapers cap fast. Use specific categories like cs.LG, stat.ML, q-bio.QM.
Short date windows for testing. 1-day windows are a good smoke test.
source: "abstract" means no LaTeX source was available. Either arXiv only hosts a PDF, or the tarball was unreadable. About 5 to 10% of papers fall into this category depending on era.
Schedule weekly runs to keep an embedding index fresh on newly-published research.
For large backfills, split into month-sized windows and run in parallel (separate Actor runs) to stay under the OAI rate limit cleanly.

❓ FAQ and limitations

Is scraping arXiv legal?

arXiv offers an open API. This Actor respects arXiv's stated rate limit of 1 request every 3 seconds and fetches only publicly-available content.

What's not in the current release?

Multi-source support (PubMed, bioRxiv, Semantic Scholar).
Entity extraction (datasets, tasks, models).
Section-aware splitting (we chunk by fixed tokens across the whole paper).
Equation, figure, and table preservation. These are dropped by the LaTeX stripper.
Semantic chunking (fixed-token only).
Custom chunk sizes (hardcoded at 512/50).

Many of these are deferred. Open an issue (Issues tab) if one is blocking for you.

Rate limits

arXiv API: 1 request per 3 seconds (enforced). ~1000 papers minimum run time ≈ 50 minutes.
arXiv e-print: Same rate limit, handled independently.

Support

Found a bug or want a feature? Use the Issues tab on the Actor's page. Custom requirements (non-arXiv sources, different chunking, section-aware splitting)? Reach out via the Actor's Support link. Custom solutions available.

Disclaimer

Output metadata is from arXiv's public API. Full text, when available, is from arXiv's public e-print archive. All content remains under the license specified by the paper's authors on arXiv. Check arxiv.org/abs/<arxiv_id> for each paper's license before downstream use (CC-BY, CC-0, arXiv non-exclusive, etc. Some papers do not permit commercial redistribution).

🔗 Other actors

bioRxiv and medRxiv scraper for RAG ↗ - extracts preprint papers as chunked JSON for RAG pipelines.
PubMed Scraper for RAG ↗ - pulls biomedical literature as chunked JSON for embeddings.
CourtListener RAG Extractor ↗ - extracts legal opinions and case law as chunked JSON.
SEC EDGAR Scraper for RAG ↗ - extracts 10-K, 10-Q and 8-K filings as chunked JSON.

PubMed Scraper for RAG: Papers as Chunked JSON

getascraper/pubmed-rag-extractor

Scrape PubMed citations by search term, MeSH, and article type. Returns RAG-ready JSON with full-text chunks from PMC Open Access (cl100k_base, 512/50) and abstract fallback. Drop-in for LangChain, LlamaIndex, Qdrant, Pinecone, Weaviate, pgvector. Skip GROBID / Pubmed Parser.

GetAScraper

bioRxiv + medRxiv Scraper for RAG

getascraper/biorxiv-medrxiv-rag-extractor

Scrape bioRxiv and medRxiv preprints by server, category, and date range. Returns RAG-ready JSON with JATS full-text chunks (cl100k_base, 512/50) when available and abstract fallback otherwise. Drop-in for LangChain, LlamaIndex, Qdrant, Pinecone, Weaviate, pgvector.

GetAScraper

ArXiv Papers Scraper

leftwinglautus/arxiv-papers-scraper

Search and scrape academic papers from the arXiv API by keyword, category, or author.

Moeeze Hassan

arXiv Paper Search Scraper

fetch_cat/arxiv-paper-search-scraper

Search arXiv papers by keyword, author, category, and date using public paper metadata.

Hanna Nosova

ArXiv Papers Scraper — Research Paper API

fast_api/arxiv-papers-scraper

Search and extract ArXiv research papers as structured JSON: titles, authors, abstracts, categories, dates, PDFs, and metadata. Built for AI research monitoring, literature review, RAG datasets, and academic intelligence.

Fast API

arXiv Paper Scraper

lulzasaur/arxiv-scraper

Search and scrape arXiv academic papers. Get titles, authors, abstracts, categories, PDF links, DOIs. Search by keyword, browse recent papers by category, or fetch by arXiv ID.

lulz bot

arXiv Paper Scraper

plantane/arxiv-scraper

Scrape research papers from arXiv by search query or category. Get titles, abstracts, authors, categories, and PDF links via the public arXiv API.

Daniel

arXiv Paper Scraper

cloud9_ai/arxiv-paper-scraper

Scrape academic papers from arXiv.org. Search by keyword, browse categories, or get latest papers. Extract titles, abstracts, authors, PDF links, and citation data via arXiv API.

cloud9

ArXiv Paper Search

gentle_cloud/arxiv-paper-search

Search and extract academic papers from ArXiv. Find papers by keyword, author, or category with full metadata including title, authors, abstract, categories, and PDF links.