arXiv Scraper for RAG: Papers as Chunked JSON avatar

arXiv Scraper for RAG: Papers as Chunked JSON

Pricing

from $15.00 / 1,000 papers

Go to Apify Store
arXiv Scraper for RAG: Papers as Chunked JSON

arXiv Scraper for RAG: Papers as Chunked JSON

Scrape arXiv papers by date and category. Strips LaTeX and returns RAG-ready JSON with tokenizer-aware chunks (cl100k_base, 512/50). Drop-in for LangChain, LlamaIndex, Qdrant, Pinecone, Weaviate, pgvector, Chroma. Skip GROBID / Nougat / pandoc. $0.015 per paper.

Pricing

from $15.00 / 1,000 papers

Rating

0.0

(0)

Developer

Devansh Tiwari

Devansh Tiwari

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

16 hours ago

Last modified

Share

Scrape arXiv research papers into RAG-ready JSON in one call. Pulls papers by date range and category, strips the LaTeX source to clean text, and returns fixed-token chunks (cl100k_base, 512 tokens / 50 overlap) with full metadata, ready to drop into LangChain, LlamaIndex, Qdrant, Pinecone, Weaviate, pgvector, or Chroma. Built for AI training data teams and anyone who's fought GROBID, Nougat, or pandoc trying to get arXiv into an embedding pipeline.

What does arXiv Scraper for RAG do?

This Apify Actor scrapes arXiv papers within a date range and category filter, fetches the LaTeX source for each paper, strips the markup, and splits the resulting plain text into tokenizer-aware chunks (512 tokens, 50-token overlap, tiktoken cl100k_base) ready to embed or feed into a RAG index.

Each output record contains clean metadata (title, authors, categories, DOI, PDF URL, published/updated dates) and a chunks array of { idx, text, tokens } ready for direct ingestion into a vector database.

Try it in the Apify Console. Fill in a category (e.g. cs.LG), a date range, a paper cap, and hit Start. Download the results as JSON, CSV, or Excel.

Built on the Apify platform, you also get: scheduled runs, HTTP API access, integrations with Zapier / n8n / Make, proxy rotation when needed, monitoring, and alerts. No infrastructure to run yourself.

Why use arXiv Scraper for RAG?

  • Skip the LaTeX parsing hell. Papers are delivered as plain text, not \begin{equation}...\end{equation} soup. No GROBID, no Nougat, no custom macro handlers.
  • Pre-chunked for RAG. tiktoken cl100k_base tokenization, compatible with OpenAI text-embedding-3, Claude, Cohere, and most BGE/E5/nomic embedding models.
  • Vector-DB-neutral. Drops straight into Qdrant, Pinecone, Weaviate, pgvector (Supabase / Neon), Chroma, and Milvus without reformatting.
  • Framework-ready. Works out of the box with LangChain, LlamaIndex, Haystack, and LangGraph pipelines.
  • Dates + categories, not URLs. No need to scrape arXiv listing pages yourself.
  • Respects arXiv's rate limits. 1 request per 3 seconds, handled server-side, so you won't get IP-banned.
  • Cheap. $0.015 per paper. A week of cs.LG submissions (~500 papers) costs under $8.

How to use arXiv RAG Extractor

  1. Open the Actor in Apify Console.
  2. Set dateFrom and dateTo in YYYY-MM-DD format (these are inclusive submission-date bounds).
  3. (Optional) Set categoriesFilter: an array of arXiv category tags like ["cs.LG", "cs.AI", "stat.ML"]. Leave empty to include all categories.
  4. (Optional) Set maxPapers: default 1000. A hard upper bound per run.
  5. Click Start. Expect ~3 seconds per paper due to arXiv's rate limit (1000 papers ≈ 50 min minimum).
  6. Download results from the Storage tab as JSON, CSV, or Excel, or hit the API endpoint programmatically.

Input

FieldTypeRequiredDefaultDescription
dateFromstring (YYYY-MM-DD)Inclusive lower bound of paper submission date
dateTostring (YYYY-MM-DD)Inclusive upper bound of paper submission date
categoriesFilterstring[][]arXiv category tags. Empty = all categories
maxPapersinteger1000Hard cap on papers returned (1 to 100000)

Example input:

{
"categoriesFilter": ["cs.LG", "cs.AI"],
"dateFrom": "2024-01-01",
"dateTo": "2024-01-07",
"maxPapers": 500
}

Output

Each paper becomes one dataset item. You can download the dataset in JSON, HTML, CSV, or Excel.

{
"arxiv_id": "2401.12345",
"title": "Attention Is All You Need",
"abstract": "The dominant sequence transduction models...",
"authors": ["Ashish Vaswani", "Noam Shazeer", "..."],
"categories": ["cs.LG", "cs.CL"],
"published": "2024-01-15T00:00:00Z",
"updated": "2024-01-18T00:00:00Z",
"doi": "10.48550/arXiv.2401.12345",
"pdf_url": "https://arxiv.org/pdf/2401.12345v1",
"source": "latex",
"chunks": [
{ "idx": 0, "text": "...", "tokens": 512 },
{ "idx": 1, "text": "...", "tokens": 512 },
{ "idx": 2, "text": "...", "tokens": 487 }
]
}

Data table

FieldTypeDescription
arxiv_idstringarXiv identifier (e.g. 2401.12345)
titlestringPaper title (whitespace-normalized)
abstractstringAbstract as returned by arXiv
authorsstring[]Author display names, order preserved
categoriesstring[]arXiv category tags
publishedISO dateFirst submission datetime
updatedISO dateLatest update datetime
doistring?DOI (when provided by arXiv)
pdf_urlstringDirect PDF link
source"latex" | "abstract"Text origin: latex if source was extractable, abstract as fallback
chunksChunk[]Fixed-token chunks ready for embedding
chunks[].idxnumber0-indexed position
chunks[].textstringChunk text
chunks[].tokensnumberToken count under cl100k_base (≤ 512)

Pricing

$0.015 per paper (PPR, pay per result).

How much does it cost to scrape arXiv?

VolumeEstimated cost
100 papers~$1.50
1,000 papers~$15
10,000 papers~$150
100,000 papers~$1,500

No subscription. No minimum. You pay only for successful records.

Tips

  • Narrow the category filter. Running on all of arXiv will hit your maxPapers cap fast. Use specific cats like cs.LG, stat.ML, q-bio.QM.
  • Short date windows for testing. 1-week windows are a good smoke test (200-800 papers for cs.LG).
  • source: "abstract" means no LaTeX source was available. Either arXiv only hosts a PDF, or the tarball was unreadable. About 5 to 10% of papers fall into this category depending on era.
  • Schedule weekly runs to keep an embedding index fresh on newly-published research.
  • For large backfills, split into month-sized windows and run in parallel (separate Actor runs) to stay under the 3-sec rate limit cleanly.

FAQ and limitations

arXiv offers an open API expressly for programmatic access (arXiv API Access). This Actor respects arXiv's stated rate limit of 1 request every 3 seconds on export.arxiv.org and fetches only publicly-available content.

What's not in v1?

  • Multi-source (PubMed, bioRxiv, Semantic Scholar). arXiv only for v1.
  • Entity extraction (datasets, tasks, models).
  • Section-aware splitting (we chunk by fixed tokens across the whole paper).
  • Equation / figure / table preservation. These are dropped by the LaTeX stripper.
  • Semantic chunking. Fixed-token only.
  • Custom chunk sizes. Hardcoded at 512/50 for v1.

Many of these land in v2. Open an issue (Issues tab) if one is blocking for you.

Rate limits

  • arXiv API: 1 request per 3 seconds (enforced). ~1000 papers minimum run time ≈ 50 minutes.
  • arXiv e-print: Same rate limit, handled independently.

Support

Found a bug or want a feature? Use the Issues tab on the Actor's page. Custom requirements (non-arXiv sources, different chunking, section-aware splitting)? Reach out via the Actor's Support link. Custom solutions available.

Disclaimer

Output metadata is from arXiv's public API. Full text, when available, is from arXiv's public e-print archive. All content remains under the license specified by the paper's authors on arXiv. Check arxiv.org/abs/<arxiv_id> for each paper's license before downstream use (CC-BY, CC-0, arXiv non-exclusive, etc. Some papers do not permit commercial redistribution).


Built with Apify + Crawlee + TypeScript. Part of the actorstack portfolio.