Research Paper Scraper avatar

Research Paper Scraper

Pricing

from $10.00 / 1,000 results

Go to Apify Store
Research Paper Scraper

Research Paper Scraper

Gather information such as paper titles, authors, abstracts, categories, PDF links, DOIs, and additional relevant details. This process does not require an API key for access.

Pricing

from $10.00 / 1,000 results

Rating

0.0

(0)

Developer

Jamshaid Arif

Jamshaid Arif

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

5 days ago

Last modified

Share

πŸ“š arXiv Research Paper Scraper

Scrape academic papers from arXiv.org β€” the world's largest open-access preprint repository with 2M+ papers in Physics, Mathematics, Computer Science, Biology, Finance, and Statistics.

Extract paper titles, authors, abstracts, categories, PDF links, DOIs, and more. No API key needed.


What does this actor do?

This actor searches arXiv.org and returns structured data for academic papers. It uses the official arXiv API, which is free and open β€” no authentication required.

Use it to:

  • Monitor new papers in your research field daily
  • Build literature review databases for any topic
  • Track a specific author's publication history
  • Retrieve full metadata for known papers by their arXiv IDs
  • Filter papers by date range, category, or keywords in the abstract
  • Analyze research trends by category and year distribution

How to use

1. Choose a scrape mode

The actor supports 4 scraping modes, selected via the scrape_mode input:

ModeWhen to useRequired input
Keyword SearchFind papers on a topicsearch_query
Author SearchFind all papers by a researcherauthor_name
Category BrowseBrowse latest papers in a fieldcategory
ID LookupFetch specific known paperspaper_ids

Fill in the input fields in Apify Console. Only the fields relevant to your chosen mode are required β€” everything else is optional.

3. Run and export

Click Start and wait for the results. Export your dataset as JSON, CSV, Excel, or connect it to Google Sheets, Slack, or any other integration.


Input parameters

Core settings

FieldTypeDefaultDescription
scrape_modeSelectkeyword_searchWhich search mode to use
search_queryTextmachine learningKeywords to search for (Keyword Search mode)
search_fieldSelectallWhich field to search: all, ti (title), abs (abstract), au (author)
author_nameTextβ€”Researcher name (Author Search mode). Example: Yoshua Bengio
categoryTextβ€”arXiv category code (Category Browse mode, or as a filter in other modes)
paper_idsTextβ€”Comma-separated arXiv IDs (ID Lookup mode). Example: 1706.03762,2005.14165

Filters

FieldTypeDefaultDescription
date_fromTextβ€”Only papers from this date onward (YYYY-MM-DD)
date_toTextβ€”Only papers up to this date (YYYY-MM-DD)
abstract_containsTextβ€”Abstract must contain ALL of these comma-separated keywords
exclude_keywordsTextβ€”Exclude papers containing ANY of these keywords in title or abstract
min_authorsNumber0Minimum number of authors (0 = no filter)

Output settings

FieldTypeDefaultDescription
max_resultsNumber50Number of papers to scrape (1–2000)
sort_bySelectrelevanceSort order: relevance, submitted_date, or last_updated
include_abstractBooleantrueInclude full abstract text
include_pdf_linksBooleantrueInclude direct PDF download URLs

Input examples

Example 1: Find recent NLP papers about transformers

{
"scrape_mode": "keyword_search",
"search_query": "transformer attention mechanism",
"search_field": "all",
"category": "cs.CL",
"max_results": 50,
"sort_by": "submitted_date",
"date_from": "2025-01-01"
}

Example 2: Get all papers by a specific author

{
"scrape_mode": "author_search",
"author_name": "Geoffrey Hinton",
"max_results": 100,
"sort_by": "submitted_date",
"include_abstract": true
}

Example 3: Browse latest Machine Learning papers

{
"scrape_mode": "category_browse",
"category": "cs.LG",
"max_results": 30,
"sort_by": "submitted_date"
}

Example 4: Fetch 5 famous AI papers by ID

{
"scrape_mode": "id_lookup",
"paper_ids": "1706.03762, 2005.14165, 1512.03385, 1406.2661, 2301.07041"
}

Example 5: Advanced β€” diffusion models in ML, 2024+, excluding "survey"

{
"scrape_mode": "keyword_search",
"search_query": "diffusion model",
"search_field": "ti",
"category": "cs.LG",
"max_results": 40,
"sort_by": "submitted_date",
"date_from": "2024-01-01",
"abstract_contains": "generative",
"exclude_keywords": "survey, review"
}

Output format

The actor produces two types of records in the dataset:

Summary record (first item)

The first record is a summary of the entire scrape run:

{
"type": "summary",
"total_papers": 50,
"unique_authors": 187,
"top_categories": [
{ "category": "cs.CL", "count": 32 },
{ "category": "cs.LG", "count": 18 },
{ "category": "cs.AI", "count": 9 }
],
"year_distribution": {
"2024": 12,
"2025": 38
},
"query_used": "all:transformer AND all:attention AND cat:cs.CL",
"scrape_mode": "keyword_search"
}

Paper records

Each paper is a flat JSON object:

{
"rank": 1,
"arxiv_id": "2301.07041v1",
"title": "LLaMA: Open and Efficient Foundation Language Models",
"authors": ["Hugo Touvron", "Thibaut Lavril", "Gautier Izacard", "..."],
"authors_short": "Hugo Touvron, Thibaut Lavril, Gautier Izacard et al.",
"num_authors": 14,
"published": "2023-02-27",
"updated": "2023-02-27",
"primary_category": "cs.CL",
"categories": ["cs.CL"],
"category_names": ["Computer Science"],
"abstract": "We introduce LLaMA, a collection of foundation language models...",
"abstract_length": 1247,
"arxiv_url": "http://arxiv.org/abs/2301.07041v1",
"pdf_url": "http://arxiv.org/pdf/2301.07041v1",
"source_url": "https://arxiv.org/e-print/2301.07041v1",
"doi": "",
"journal_ref": "",
"comment": "Submitted to NeurIPS 2023",
"links": [
{ "title": "pdf", "href": "http://arxiv.org/pdf/2301.07041v1", "type": "application/pdf" }
],
"scraped_at": "2026-04-04T12:00:00+00:00"
}

arXiv categories quick reference

Use these codes in the category field:

Computer Science

CodeField
cs.AIArtificial Intelligence
cs.CLComputation & Language (NLP)
cs.CVComputer Vision
cs.LGMachine Learning
cs.NENeural & Evolutionary Computing
cs.RORobotics
cs.SESoftware Engineering
cs.CRCryptography & Security
cs.DBDatabases
cs.IRInformation Retrieval
CodeField
stat.MLStatistics β€” Machine Learning
math.OCMathematics β€” Optimization & Control
quant-phQuantum Physics
econ.GNEconomics β€” General
q-bio.NCQuantitative Biology β€” Neurons & Cognition
q-fin.STQuantitative Finance β€” Statistical Finance

The full category list is at arxiv.org/category_taxonomy.


Advanced query syntax

In search_query, you can use arXiv's native query syntax for precise searches:

SyntaxMeaningExample
ti:keywordSearch in title onlyti:deep learning
abs:keywordSearch in abstract onlyabs:reinforcement
au:nameSearch by authorau:bengio
cat:codeSearch by categorycat:cs.AI
ANDBoth terms must matchti:neural AND ti:network
OREither term matchescat:cs.CL OR cat:cs.LG
ANDNOTExclude a termti:transformer ANDNOT ti:survey
"phrase"Exact phrase matchabs:"language model"

Example: Find papers by Bengio about attention in NLP:

au:bengio AND abs:attention AND cat:cs.CL

How much does it cost to run?

This actor is very lightweight because arXiv's API is free and doesn't require browser rendering.

PapersEstimated timeApify platform credits
50~30 seconds< $0.01
200~2 minutes~$0.01
500~5 minutes~$0.02
2000~20 minutes~$0.05

The arXiv API requests a 3-second delay between requests, which the actor respects automatically.


Integrations and scheduling

Schedule daily paper monitoring

Set up a scheduled run in Apify Console to scrape new papers in your field every day. Combine with the Google Sheets or Slack integration to get automatic notifications.

Example: Daily cs.AI papers to Slack

  1. Create a scheduled task with category_browse mode and cs.AI category
  2. Sort by submitted_date, limit to max_results: 20
  3. Connect the Slack integration to post results to your #research channel

Export to Google Sheets

After each run, the dataset can be automatically synced to a Google Sheet for easy sharing with your research team.


Tips and best practices

  • Start with a small max_results (10–20) to preview your query, then scale up.
  • Use category as a filter in Keyword Search mode to narrow results to your field.
  • Combine abstract_contains with broad searches to find niche papers β€” for example, search cat:cs.LG but require "graph neural" in the abstract.
  • Use exclude_keywords to filter out survey/review papers if you only want original research.
  • arXiv limits results to 2000 per query β€” for larger datasets, run multiple searches with different date ranges.
  • Sort by submitted_date for monitoring new papers; sort by relevance for topic exploration.

Changelog

v1.0.0 (2026-04-04)

  • Initial release
  • 4 scraping modes: keyword search, author search, category browse, ID lookup
  • Advanced filtering: date range, abstract keywords, exclude keywords, min authors
  • Summary statistics with top categories and year distribution
  • Batched dataset output for large result sets