Pricing

from $10.00 / 1,000 results

Research Paper Scraper

Gather information such as paper titles, authors, abstracts, categories, PDF links, DOIs, and additional relevant details. This process does not require an API key for access.

Pricing

from $10.00 / 1,000 results

Rating

0.0

(0)

Developer

Jamshaid Arif

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

📚 arXiv Research Paper Scraper

Scrape academic papers from arXiv.org — the world's largest open-access preprint repository with 2M+ papers in Physics, Mathematics, Computer Science, Biology, Finance, and Statistics.

Extract paper titles, authors, abstracts, categories, PDF links, DOIs, and more. No API key needed.

What does this actor do?

This actor searches arXiv.org and returns structured data for academic papers. It uses the official arXiv API, which is free and open — no authentication required.

Use it to:

Monitor new papers in your research field daily
Build literature review databases for any topic
Track a specific author's publication history
Retrieve full metadata for known papers by their arXiv IDs
Filter papers by date range, category, or keywords in the abstract
Analyze research trends by category and year distribution

How to use

1. Choose a scrape mode

The actor supports 4 scraping modes, selected via the scrape_mode input:

Mode	When to use	Required input
Keyword Search	Find papers on a topic	`search_query`
Author Search	Find all papers by a researcher	`author_name`
Category Browse	Browse latest papers in a field	`category`
ID Lookup	Fetch specific known papers	`paper_ids`

2. Configure your search

Fill in the input fields in Apify Console. Only the fields relevant to your chosen mode are required — everything else is optional.

3. Run and export

Click Start and wait for the results. Export your dataset as JSON, CSV, Excel, or connect it to Google Sheets, Slack, or any other integration.

Input parameters

Core settings

Field	Type	Default	Description
`scrape_mode`	Select	`keyword_search`	Which search mode to use
`search_query`	Text	`machine learning`	Keywords to search for (Keyword Search mode)
`search_field`	Select	`all`	Which field to search: `all`, `ti` (title), `abs` (abstract), `au` (author)
`author_name`	Text	—	Researcher name (Author Search mode). Example: `Yoshua Bengio`
`category`	Text	—	arXiv category code (Category Browse mode, or as a filter in other modes)
`paper_ids`	Text	—	Comma-separated arXiv IDs (ID Lookup mode). Example: `1706.03762,2005.14165`

Filters

Field	Type	Default	Description
`date_from`	Text	—	Only papers from this date onward (`YYYY-MM-DD`)
`date_to`	Text	—	Only papers up to this date (`YYYY-MM-DD`)
`abstract_contains`	Text	—	Abstract must contain ALL of these comma-separated keywords
`exclude_keywords`	Text	—	Exclude papers containing ANY of these keywords in title or abstract
`min_authors`	Number	`0`	Minimum number of authors (0 = no filter)

Output settings

Field	Type	Default	Description
`max_results`	Number	`50`	Number of papers to scrape (1–2000)
`sort_by`	Select	`relevance`	Sort order: `relevance`, `submitted_date`, or `last_updated`
`include_abstract`	Boolean	`true`	Include full abstract text
`include_pdf_links`	Boolean	`true`	Include direct PDF download URLs

Input examples

Example 1: Find recent NLP papers about transformers

{
    "scrape_mode": "keyword_search",
    "search_query": "transformer attention mechanism",
    "search_field": "all",
    "category": "cs.CL",
    "max_results": 50,
    "sort_by": "submitted_date",
    "date_from": "2025-01-01"
}

Example 2: Get all papers by a specific author

{
    "scrape_mode": "author_search",
    "author_name": "Geoffrey Hinton",
    "max_results": 100,
    "sort_by": "submitted_date",
    "include_abstract": true
}

Example 3: Browse latest Machine Learning papers

{
    "scrape_mode": "category_browse",
    "category": "cs.LG",
    "max_results": 30,
    "sort_by": "submitted_date"
}

Example 4: Fetch 5 famous AI papers by ID

{
    "scrape_mode": "id_lookup",
    "paper_ids": "1706.03762, 2005.14165, 1512.03385, 1406.2661, 2301.07041"
}

Example 5: Advanced — diffusion models in ML, 2024+, excluding "survey"

{
    "scrape_mode": "keyword_search",
    "search_query": "diffusion model",
    "search_field": "ti",
    "category": "cs.LG",
    "max_results": 40,
    "sort_by": "submitted_date",
    "date_from": "2024-01-01",
    "abstract_contains": "generative",
    "exclude_keywords": "survey, review"
}

Output format

The actor produces two types of records in the dataset:

Summary record (first item)

The first record is a summary of the entire scrape run:

{
    "type": "summary",
    "total_papers": 50,
    "unique_authors": 187,
    "top_categories": [
        { "category": "cs.CL", "count": 32 },
        { "category": "cs.LG", "count": 18 },
        { "category": "cs.AI", "count": 9 }
    ],
    "year_distribution": {
        "2024": 12,
        "2025": 38
    },
    "query_used": "all:transformer AND all:attention AND cat:cs.CL",
    "scrape_mode": "keyword_search"
}

Paper records

Each paper is a flat JSON object:

{
    "rank": 1,
    "arxiv_id": "2301.07041v1",
    "title": "LLaMA: Open and Efficient Foundation Language Models",
    "authors": ["Hugo Touvron", "Thibaut Lavril", "Gautier Izacard", "..."],
    "authors_short": "Hugo Touvron, Thibaut Lavril, Gautier Izacard et al.",
    "num_authors": 14,
    "published": "2023-02-27",
    "updated": "2023-02-27",
    "primary_category": "cs.CL",
    "categories": ["cs.CL"],
    "category_names": ["Computer Science"],
    "abstract": "We introduce LLaMA, a collection of foundation language models...",
    "abstract_length": 1247,
    "arxiv_url": "http://arxiv.org/abs/2301.07041v1",
    "pdf_url": "http://arxiv.org/pdf/2301.07041v1",
    "source_url": "https://arxiv.org/e-print/2301.07041v1",
    "doi": "",
    "journal_ref": "",
    "comment": "Submitted to NeurIPS 2023",
    "links": [
        { "title": "pdf", "href": "http://arxiv.org/pdf/2301.07041v1", "type": "application/pdf" }
    ],
    "scraped_at": "2026-04-04T12:00:00+00:00"
}

arXiv categories quick reference

Use these codes in the category field:

Computer Science

Code	Field
`cs.AI`	Artificial Intelligence
`cs.CL`	Computation & Language (NLP)
`cs.CV`	Computer Vision
`cs.LG`	Machine Learning
`cs.NE`	Neural & Evolutionary Computing
`cs.RO`	Robotics
`cs.SE`	Software Engineering
`cs.CR`	Cryptography & Security
`cs.DB`	Databases
`cs.IR`	Information Retrieval

Other popular categories

Code	Field
`stat.ML`	Statistics — Machine Learning
`math.OC`	Mathematics — Optimization & Control
`quant-ph`	Quantum Physics
`econ.GN`	Economics — General
`q-bio.NC`	Quantitative Biology — Neurons & Cognition
`q-fin.ST`	Quantitative Finance — Statistical Finance

The full category list is at arxiv.org/category_taxonomy.

Advanced query syntax

In search_query, you can use arXiv's native query syntax for precise searches:

Syntax	Meaning	Example
`ti:keyword`	Search in title only	`ti:deep learning`
`abs:keyword`	Search in abstract only	`abs:reinforcement`
`au:name`	Search by author	`au:bengio`
`cat:code`	Search by category	`cat:cs.AI`
`AND`	Both terms must match	`ti:neural AND ti:network`
`OR`	Either term matches	`cat:cs.CL OR cat:cs.LG`
`ANDNOT`	Exclude a term	`ti:transformer ANDNOT ti:survey`
`"phrase"`	Exact phrase match	`abs:"language model"`

Example: Find papers by Bengio about attention in NLP:

au:bengio AND abs:attention AND cat:cs.CL

How much does it cost to run?

This actor is very lightweight because arXiv's API is free and doesn't require browser rendering.

Papers	Estimated time	Apify platform credits
50	~30 seconds	< $0.01
200	~2 minutes	~$0.01
500	~5 minutes	~$0.02
2000	~20 minutes	~$0.05

The arXiv API requests a 3-second delay between requests, which the actor respects automatically.

Integrations and scheduling

Schedule daily paper monitoring

Set up a scheduled run in Apify Console to scrape new papers in your field every day. Combine with the Google Sheets or Slack integration to get automatic notifications.

Example: Daily cs.AI papers to Slack

Create a scheduled task with category_browse mode and cs.AI category
Sort by submitted_date, limit to max_results: 20
Connect the Slack integration to post results to your #research channel

Export to Google Sheets

After each run, the dataset can be automatically synced to a Google Sheet for easy sharing with your research team.

Tips and best practices

Start with a small max_results (10–20) to preview your query, then scale up.
Use category as a filter in Keyword Search mode to narrow results to your field.
Combine abstract_contains with broad searches to find niche papers — for example, search cat:cs.LG but require "graph neural" in the abstract.
Use exclude_keywords to filter out survey/review papers if you only want original research.
arXiv limits results to 2000 per query — for larger datasets, run multiple searches with different date ranges.
Sort by submitted_date for monitoring new papers; sort by relevance for topic exploration.

Changelog

v1.0.0 (2026-04-04)

Initial release
4 scraping modes: keyword search, author search, category browse, ID lookup
Advanced filtering: date range, abstract keywords, exclude keywords, min authors
Summary statistics with top categories and year distribution
Batched dataset output for large result sets

arXiv Paper Scraper — Abstracts, Authors & Metadata

logiover/arxiv-paper-scraper

Scrape research paper metadata from arXiv.org the worlds largest open-access repository. Search by keyword across computer science physics mathematics biology. Returns titles abstracts authors categories PDF links and DOIs. No API key required.

Logiover

arXiv Research Paper Scraper

codingfrontend/arxiv-search-scraper

Extract comprehensive research paper data from arXiv search results including titles, authors, abstracts, categories, and more.

Coding Frontned

Academic Paper Scraper

constant_quadruped/academic-paper-scraper

Search arXiv and PubMed in one request. Returns unified paper data: titles, authors, abstracts, DOIs, and PDF links. Filter by keywords, authors, categories, and date range. Built-in rate limiting and cross-source deduplication. Export to JSON, CSV, or Excel.

ArXiv Paper Search MCP

reverberant_equality/mcp-arxiv-search

Search ArXiv papers and retrieve paper details. AI agents can discover academic research, abstracts, authors, categories, and PDF links.

Jordan C

ArXiv Research Paper Scraper

datapilot/arxiv-research-paper-scraper

arXiv Research Paper Scraper retrieves academic paper metadata from the arXiv API based on a keyword. It extracts titles, abstracts, authors with affiliations, DOI, categories, submission dates, and PDF links. Supports proxy usage and outputs structured JSON results for research and data analysis.

Data Pilot

arXiv Search Scraper 📚

easyapi/arxiv-search-scraper

Extract comprehensive research paper data from arXiv search results. Get detailed metadata including titles, authors, abstracts, categories and more. Perfect for academic research monitoring, trend analysis and building paper databases. 🎓📚

EasyApi

Academic Research & Papers Scraper (OpenAlex)

rupom888/academic-research-scraper

Search 200M+ academic papers, researchers, and institutions via OpenAlex API. Completely free, no API key needed. Get paper titles, abstracts, DOIs, citations, authors, open access links, and concepts. Filter by year, paper type, open access, and field of study.

Syed Rupom

Semantic Scholar Scraper

forlex/semantic-scholar-scraper

Search Semantic Scholar's 200M+ paper database and get clean JSON with titles, abstracts, authors, citations, DOIs, and open-access PDF links. Optional API key for higher rate limits.

Rifky Afriza

arXiv Paper Scraper

skystone_labs/arxiv-scraper

Extract research papers from arXiv using the official API. Get titles, authors, abstracts, PDF URLs, categories, and more. Perfect for research datasets and literature reviews.

Skystone

Semantic Scholar Scraper - Low-cost💲🔥📚🤖

delectable_incubator/semantic-scholar-scraper-low-cost

📚🔎 Extract research papers from Semantic Scholar using keywords, paper URLs, or author profiles. Collect paper titles, authors, publication years, abstracts, citations, venues, research fields, paper URLs, and metadata. Ideal for academic research, literature reviews & AI research monitoring. 🚀

Prime Scrape