Pricing

from $15.75 / 1,000 paper scrapeds

Try for free

Go to Apify Store

arXiv Paper Scraper - Research Papers & Abstracts

Try for free

Search and extract ArXiv papers, abstracts, authors, and citations. Track research trends across any scientific field. AI-powered analysis.

Pricing

from $15.75 / 1,000 paper scrapeds

Rating

5.0

(4)

Developer

viralanalyzer

Actor stats

Bookmarked

Total users

Monthly active users

13 days ago

Last modified

📄 ArXiv Paper Intelligence — Academic Paper Scraper & Research Monitor

🔗 View on Apify Store | 🇺🇸 English | 🇧🇷 Português

Scrape academic papers from ArXiv using the public Atom API. Search by keyword, browse by category (cs.AI, cs.LG, stat.ML, etc.), or fetch specific papers by ArXiv ID — with titles, abstracts, authors, categories, and PDF links.

✨ Features

🔍 Keyword search — Use ArXiv query syntax (ti:transformer+AND+ti:attention, all:machine+learning)
📂 Browse by category — cs.AI, cs.LG, cs.CL, cs.CV, stat.ML, math.OC, physics.comp-ph, and more
🆔 Fetch by ArXiv ID — Get specific papers by their ArXiv identifier (e.g., 2301.12345)
📑 Rich metadata — Title, abstract (trimmed to 500 chars), authors, categories, dates
📥 Direct PDF links — Each paper includes its PDF download URL
🔄 Sorting options — Sort by relevance, last updated, or submission date
⚡ Rate-limit compliant — Respects ArXiv's 1 request per 3 seconds policy
🛡️ Anti-placeholder guardrails — Only real data, never fake results

📥 Input

Parameter	Type	Required	Default	Description
`mode`	String	Yes	`search`	Scraping mode: `search`, `by_category`, or `by_ids`
`searchQueries`	Array	Only in `search` mode	—	Keywords to search for (max 10 queries)
`categories`	Array	Only in `by_category` mode	—	ArXiv category codes (e.g., `cs.AI`, `cs.LG`) (max 5)
`arxivIds`	Array	Only in `by_ids` mode	—	Specific ArXiv paper IDs (max 50)
`sortBy`	String	No	`relevance`	Sort order: `relevance`, `lastUpdatedDate`, or `submittedDate`
`maxItems`	Integer	No	`50`	Maximum number of papers to scrape (1–200)

Input Example

{
    "mode": "search",
    "searchQueries": ["all:large+language+model", "ti:transformer+AND+ti:attention"],
    "sortBy": "submittedDate",
    "maxItems": 20
}

📤 Output

Field	Type	Description
`arxivId`	String	ArXiv paper identifier (e.g., `2301.12345`)
`title`	String	Paper title
`abstract`	String	Paper abstract (trimmed to 500 characters)
`authors`	Array	List of author names
`primaryCategory`	String	Primary ArXiv category (e.g., `cs.CL`)
`categories`	Array	All ArXiv categories for the paper
`publishedDate`	String	Original publication date (ISO 8601)
`updatedDate`	String	Last update date (ISO 8601)
`pdfUrl`	String	Direct link to PDF download
`arxivUrl`	String	ArXiv abstract page URL
`platform`	String	Always `arxiv`
`scrapedAt`	String	Timestamp of data extraction (ISO 8601)

Output Example

{
    "arxivId": "2303.08774",
    "title": "GPT-4 Technical Report",
    "abstract": "We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers...",
    "authors": ["OpenAI", "Josh Achiam", "Steven Adler", "Sandhini Agarwal"],
    "primaryCategory": "cs.CL",
    "categories": ["cs.CL", "cs.AI"],
    "publishedDate": "2023-03-15T17:15:04Z",
    "updatedDate": "2024-03-04T03:44:33Z",
    "pdfUrl": "http://arxiv.org/pdf/2303.08774v6",
    "arxivUrl": "http://arxiv.org/abs/2303.08774",
    "platform": "arxiv",
    "scrapedAt": "2026-03-06T14:30:00.000Z"
}

📋 Use Cases

📊 Research monitoring — Track new papers in your field (AI, ML, NLP, physics, math)
🏢 Competitive intelligence — Monitor publications from specific research labs or companies
📈 Trend analysis — Identify hot topics by analyzing paper volumes across categories
🎓 Literature reviews — Bulk-collect papers for systematic reviews or meta-analyses
🤖 AI dataset building — Feed paper metadata into recommendation engines or knowledge graphs
📰 Newsletter curation — Automatically find the latest papers for research digests

❓ FAQ

Q: What ArXiv query syntax is supported? A: You can use ArXiv's standard query syntax — ti: for title, au: for author, abs: for abstract, all: for all fields. Combine with +AND+, +OR+, +ANDNOT+. Example: ti:transformer+AND+au:vaswani.

Q: How many papers can I scrape per run? A: Up to 200 papers per run (maxItems parameter). For larger datasets, run multiple times with different queries or categories.

Q: Does this actor respect ArXiv rate limits? A: Yes, the actor enforces a minimum 3.1-second delay between API requests, complying with ArXiv's policy of 1 request per 3 seconds. Retries with exponential backoff on 429/503 errors.

Q: What happens if a search returns zero results? A: The actor throws an explicit error instead of returning silently — you will always know if your query produced no matches. Check your query syntax or try broader terms.

Q: Are abstracts complete or truncated? A: Abstracts are trimmed to 500 characters to keep the dataset compact. The full abstract is available at the arxivUrl link.

💰 Pricing

This actor uses Pay Per Event (PPE) pricing:

Metric	Cost
Per paper scraped	$0.03

Wikipedia Trending Pages — Scrape most-viewed Wikipedia articles and pageview history
NPM Package Intelligence — NPM package metadata, scores, and download stats
Open Library Book Scraper — Search and scrape book data from Open Library
Google Trends Scraper — Trending search topics and interest over time

📝 Changelog

v1.0 (Current)

✅ Search by keyword with ArXiv query syntax
✅ Browse by ArXiv category (cs.AI, cs.LG, stat.ML, etc.)
✅ Fetch specific papers by ArXiv ID (batch of up to 50)
✅ Sorting by relevance, last updated, or submission date
✅ Rate-limit compliant (1 req / 3s with exponential backoff)
✅ Anti-placeholder guardrails — real data only
✅ PPE billing via Actor.charge()

📄 ArXiv Paper Intelligence — Extrator de Artigos Acadêmicos e Monitor de Pesquisas

🇺🇸 English | 🇧🇷 Português

Extraia artigos acadêmicos do ArXiv usando a API pública Atom. Pesquise por palavra-chave, navegue por categoria (cs.AI, cs.LG, stat.ML, etc.), ou busque artigos específicos por ID do ArXiv — com títulos, resumos, autores, categorias e links para PDF.

✨ Funcionalidades

🔍 Pesquisa por palavra-chave — Use a sintaxe de consulta do ArXiv (ti:transformer+AND+ti:attention, all:machine+learning)
📂 Navegação por categoria — cs.AI, cs.LG, cs.CL, cs.CV, stat.ML, math.OC, physics.comp-ph e mais
🆔 Busca por ID do ArXiv — Obtenha artigos específicos pelo identificador (ex.: 2301.12345)
📑 Metadados completos — Título, resumo (limitado a 500 caracteres), autores, categorias, datas
📥 Links diretos para PDF — Cada artigo inclui a URL de download do PDF
🔄 Opções de ordenação — Ordene por relevância, última atualização ou data de submissão
⚡ Conformidade com rate limit — Respeita a política do ArXiv de 1 requisição a cada 3 segundos
🛡️ Proteção anti-placeholder — Apenas dados reais, nunca resultados falsos

📥 Entrada

Parâmetro	Tipo	Obrigatório	Padrão	Descrição
`mode`	String	Sim	`search`	Modo de extração: `search`, `by_category` ou `by_ids`
`searchQueries`	Array	Apenas no modo `search`	—	Palavras-chave para pesquisar (máximo 10 consultas)
`categories`	Array	Apenas no modo `by_category`	—	Códigos de categoria do ArXiv (ex.: `cs.AI`, `cs.LG`) (máximo 5)
`arxivIds`	Array	Apenas no modo `by_ids`	—	IDs específicos de artigos do ArXiv (máximo 50)
`sortBy`	String	Não	`relevance`	Ordenação: `relevance`, `lastUpdatedDate` ou `submittedDate`
`maxItems`	Integer	Não	`50`	Número máximo de artigos para extrair (1–200)

Exemplo de Entrada

{
    "mode": "search",
    "searchQueries": ["all:large+language+model", "ti:transformer+AND+ti:attention"],
    "sortBy": "submittedDate",
    "maxItems": 20
}

📤 Saída

Campo	Tipo	Descrição
`arxivId`	String	Identificador do artigo no ArXiv (ex.: `2301.12345`)
`title`	String	Título do artigo
`abstract`	String	Resumo do artigo (limitado a 500 caracteres)
`authors`	Array	Lista de nomes dos autores
`primaryCategory`	String	Categoria principal do ArXiv (ex.: `cs.CL`)
`categories`	Array	Todas as categorias do ArXiv para o artigo
`publishedDate`	String	Data de publicação original (ISO 8601)
`updatedDate`	String	Data da última atualização (ISO 8601)
`pdfUrl`	String	Link direto para download do PDF
`arxivUrl`	String	URL da página de resumo no ArXiv
`platform`	String	Sempre `arxiv`
`scrapedAt`	String	Timestamp da extração dos dados (ISO 8601)

Exemplo de Saída

{
    "arxivId": "2303.08774",
    "title": "GPT-4 Technical Report",
    "abstract": "We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers...",
    "authors": ["OpenAI", "Josh Achiam", "Steven Adler", "Sandhini Agarwal"],
    "primaryCategory": "cs.CL",
    "categories": ["cs.CL", "cs.AI"],
    "publishedDate": "2023-03-15T17:15:04Z",
    "updatedDate": "2024-03-04T03:44:33Z",
    "pdfUrl": "http://arxiv.org/pdf/2303.08774v6",
    "arxivUrl": "http://arxiv.org/abs/2303.08774",
    "platform": "arxiv",
    "scrapedAt": "2026-03-06T14:30:00.000Z"
}

📋 Casos de Uso

📊 Monitoramento de pesquisas — Acompanhe novos artigos na sua área (IA, ML, NLP, física, matemática)
🏢 Inteligência competitiva — Monitore publicações de laboratórios de pesquisa ou empresas específicas
📈 Análise de tendências — Identifique tópicos quentes analisando volumes de artigos por categoria
🎓 Revisões de literatura — Colete artigos em massa para revisões sistemáticas ou meta-análises
🤖 Construção de datasets de IA — Alimente metadados de artigos em motores de recomendação ou grafos de conhecimento
📰 Curadoria de newsletters — Encontre automaticamente os artigos mais recentes para resumos de pesquisa

❓ Perguntas Frequentes

P: Qual sintaxe de consulta do ArXiv é suportada? R: Você pode usar a sintaxe padrão de consulta do ArXiv — ti: para título, au: para autor, abs: para resumo, all: para todos os campos. Combine com +AND+, +OR+, +ANDNOT+. Exemplo: ti:transformer+AND+au:vaswani.

P: Quantos artigos posso extrair por execução? R: Até 200 artigos por execução (parâmetro maxItems). Para conjuntos de dados maiores, execute várias vezes com consultas ou categorias diferentes.

P: Este actor respeita os limites de taxa do ArXiv? R: Sim, o actor impõe um atraso mínimo de 3,1 segundos entre requisições à API, cumprindo a política do ArXiv de 1 requisição a cada 3 segundos. Tentativas com backoff exponencial em erros 429/503.

P: O que acontece se uma pesquisa retornar zero resultados? R: O actor lança um erro explícito em vez de retornar silenciosamente — você sempre saberá se sua consulta não produziu resultados. Verifique a sintaxe da consulta ou tente termos mais amplos.

P: Os resumos são completos ou truncados? R: Os resumos são limitados a 500 caracteres para manter o dataset compacto. O resumo completo está disponível no link arxivUrl.

💰 Preços

Este actor usa precificação Pay Per Event (PPE):

Métrica	Custo
Por artigo extraído	$0,03

🔗 Actors Relacionados

Wikipedia Trending Pages — Extraia os artigos mais visualizados da Wikipédia e histórico de visualizações
NPM Package Intelligence — Metadados, scores e estatísticas de download de pacotes NPM
Open Library Book Scraper — Pesquise e extraia dados de livros da Open Library
Google Trends Scraper — Tópicos de pesquisa em tendência e interesse ao longo do tempo

📝 Changelog

v1.0 (Atual)

✅ Pesquisa por palavra-chave com sintaxe de consulta do ArXiv
✅ Navegação por categoria do ArXiv (cs.AI, cs.LG, stat.ML, etc.)
✅ Busca de artigos específicos por ID do ArXiv (lote de até 50)
✅ Ordenação por relevância, última atualização ou data de submissão
✅ Conformidade com rate limit (1 req / 3s com backoff exponencial)
✅ Proteção anti-placeholder — apenas dados reais
✅ Cobrança PPE via Actor.charge()

arXiv Research Paper Scraper

seeb/arxiv-research-paper-scraper

Scrape arXiv papers by keyword or category and return research titles, abstracts, authors, dates, links, and topic signals.

Techionik

arXiv Paper Scraper — Search Academic Papers & Abstracts

puskin/arxiv-scraper

Search and retrieve academic papers from arXiv by keyword, author, or category. Extracts titles, authors, abstracts, and download links via the free arXiv API — no authentication needed.

Giovanni Bucci

arXiv Paper Scraper

plantane/arxiv-scraper

Scrape research papers from arXiv by search query or category. Get titles, abstracts, authors, categories, and PDF links via the public arXiv API.

Daniel

arXiv Paper Scraper

cloud9_ai/arxiv-paper-scraper

Scrape academic papers from arXiv.org. Search by keyword, browse categories, or get latest papers. Extract titles, abstracts, authors, PDF links, and citation data via arXiv API.

cloud9

ArXiv Papers Scraper — Research Paper API

fast_api/arxiv-papers-scraper

Search and extract ArXiv research papers as structured JSON: titles, authors, abstracts, categories, dates, PDFs, and metadata. Built for AI research monitoring, literature review, RAG datasets, and academic intelligence.

Fast API

arXiv Paper Scraper

skystone_labs/arxiv-scraper

Extract research papers from arXiv using the official API. Get titles, authors, abstracts, PDF URLs, categories, and more. Perfect for research datasets and literature reviews.

Skystone

ArXiv Paper Search MCP

reverberant_equality/mcp-arxiv-search

Search ArXiv papers and retrieve paper details. AI agents can discover academic research, abstracts, authors, categories, and PDF links.

Jordan C

arXiv Paper Scraper

lulzasaur/arxiv-scraper

Search and scrape arXiv academic papers. Get titles, authors, abstracts, categories, PDF links, DOIs. Search by keyword, browse recent papers by category, or fetch by arXiv ID.

lulz bot

arXiv Paper Scraper - AI ML Research Papers

openclawmara/arxiv-paper-scraper

Scrape arXiv research papers by keyword, category, or author. Extracts titles, abstracts, authors, citations, and metadata. Perfect for AI/ML research monitoring, literature reviews, and LLM training data collection.