Pricing

from $2.00 / 1,000 paper scrapeds

arXiv Metadata Collector— Metadata, PDF, Authors & Abstract

Scrape arXiv research papers with metadata including title, authors, abstract, PDF links, DOI, and categories. Supports keyword search, proxy integration, and structured dataset output for AI, ML, and academic research use

Pricing

from $2.00 / 1,000 paper scrapeds

Rating

0.0

(0)

Developer

Scrape Pilot

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

📄 arXiv Metadata Collector — Research Paper Scraper (Official API)

Extract structured academic paper metadata from arXiv — titles, authors, abstracts, PDF links, categories, DOIs, and more.
The arXiv Metadata Collector uses the official arXiv API (no auth, no API key) to search by keyword, author, category, or date range. Get clean JSON output perfect for academic research, literature reviews, and dataset building.

Pricing: Only $0.60 per 1,000 results – pay only for what you use.

💡 What is the arXiv Metadata Collector?

The arXiv Metadata Collector is a professional Apify actor that retrieves paper metadata from arXiv.org using its official, public API. arXiv is the world’s largest open‑access repository of scientific papers (2+ million) covering physics, mathematics, computer science, quantitative biology, finance, and statistics.

With this actor you can:

Search by keywords (optional: restrict to title only)
Search by author (full or partial name)
Filter by category (e.g., cs.LG, cs.AI, stat.ML, physics, etc.)
Apply date ranges (date_from, date_to)
Sort by relevance, submission date, or update date
Control batch size with max_results

The output includes every field you need for literature analysis:

Title, authors, abstract
PDF URL, arXiv abstract URL, arXiv ID
Publication date, last update date
Primary category and all categories
DOI, journal reference, comment/notes

The arXiv Metadata Collector respects arXiv’s rate limits (3 seconds between requests) and never uses proxies (arXiv blocks proxy IPs). It also includes a demo mode with 10 real sample papers so you can test the output instantly.

📦 What Data Can You Extract?

🧩 Data Type	📋 Description
📄 Title	Full paper title.
👥 Authors	List of author names.
📝 Abstract	Paper summary (plain text, cleaned).
📥 PDF URL	Direct link to the PDF file.
🔗 arXiv URL	Abstract page URL.
🆔 arXiv ID	Unique identifier (e.g., `1706.03762`).
📅 Published Date	Original submission date (YYYY-MM-DD).
📅 Updated Date	Last revision date (if any).
🏷️ Primary Category	Main classification (e.g., `cs.CL`).
🏷️ Categories	Array of all categories.
🔗 DOI	Digital Object Identifier (if available).
📖 Journal Ref	Journal or conference name (e.g., `NeurIPS 2017`).
💬 Comment	Additional notes (e.g., “15 pages, 5 figures”).
🏛️ Source	Always `arXiv`.

⚙️ Key Features

Official arXiv API – 100% compliant, no API key, no login.
Full‑text search – Search all fields or restrict to title only.
Author & category filters – Narrow down to specific researchers or research areas.
Date range – Get papers published in a specific window.
Sort options – Relevance, date submitted, or last updated.
Rate‑limit safe – Built‑in 3‑second delay between requests to avoid 429 errors.
No proxy – arXiv blocks datacenter proxies; the actor uses direct connections.
Demo mode – Instantly see 10 real landmark papers (no API calls, no cost).
Clean JSON – Ready for analysis, dashboards, or ingestion into databases.
Pay‑per‑use – $0.60 per 1,000 results – no monthly commitment.

📥 Input Parameters

The actor accepts a JSON object with the following fields:

Parameter	Type	Required	Default	Description
`demo_mode`	boolean	No	`false`	Return 10 sample papers (real, from arXiv).
`search_query`	string	No*	–	Keywords for full‑text search (or title only if `title_only=true`).
`author`	string	No*	–	Author name (e.g., `Yann LeCun`).
`category`	string	No*	–	arXiv category (e.g., `cs.LG`, `stat.ML`). Aliases: `machine learning`, `ai`, `nlp`, `computer vision`.
`title_only`	boolean	No	`false`	If `true`, `search_query` is matched only in the title.
`sort_by`	string	No	`relevance`	`relevance`, `date` (submitted), or `updated`.
`date_from`	string	No	–	Earliest publication date (YYYY-MM-DD).
`date_to`	string	No	–	Latest publication date (YYYY-MM-DD).
`max_results`	integer	No	`20`	Maximum number of papers to return.
`proxyConfiguration`	object	No	–	Ignored – arXiv requires direct connection. Do not use proxy.

Note: Provide at least one of: search_query, author, category. If none, the actor defaults to search_query: "machine learning".

Example Input (Keyword Search)

{
  "search_query": "transformer attention",
  "max_results": 10,
  "sort_by": "date"
}

Example Input (Author + Category)

{
  "author": "Yoshua Bengio",
  "category": "cs.LG",
  "max_results": 15
}

Example Input (Date Range)

{
  "category": "physics",
  "date_from": "2023-01-01",
  "date_to": "2023-12-31",
  "max_results": 50
}

📤 Output Fields

Each paper is returned as an object with the following possible fields:

Field	Type	Description
`title`	string	Paper title.
`authors`	array	List of author names.
`abstract`	string	Paper summary (cleaned).
`pdf_url`	string	Direct PDF link.
`arxiv_url`	string	Abstract page URL.
`arxiv_id`	string	e.g., `1706.03762`.
`published`	string	YYYY-MM-DD.
`updated`	string	YYYY-MM-DD (if revised).
`primary_category`	string	Main category (e.g., `cs.CL`).
`categories`	array	All categories (e.g., `["cs.CL","cs.LG"]`).
`doi`	string	Digital Object Identifier (or `null`).
`journal_ref`	string	Journal/conference name (or `null`).
`comment`	string	Additional notes (or `null`).
`source`	string	Always `"arXiv"`.

Example Output

[
  {
    "title": "Attention Is All You Need",
    "authors": ["Ashish Vaswani", "Noam Shazeer", "Niki Parmar", "Jakob Uszkoreit", "Llion Jones", "Aidan N. Gomez", "Lukasz Kaiser", "Illia Polosukhin"],
    "abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder...",
    "pdf_url": "https://arxiv.org/pdf/1706.03762",
    "arxiv_url": "https://arxiv.org/abs/1706.03762",
    "arxiv_id": "1706.03762",
    "published": "2017-06-12",
    "updated": "2023-08-02",
    "primary_category": "cs.CL",
    "categories": ["cs.CL", "cs.LG"],
    "doi": "10.48550/arXiv.1706.03762",
    "journal_ref": "Advances in Neural Information Processing Systems 30, 2017",
    "comment": "15 pages, 5 figures",
    "source": "arXiv"
  }
]

💰 Pricing

Metric	Price
Per 1,000 results	$0.60
Minimum charge per run	$0.01
Typical run (20 results)	$0.012

You pay only for the number of papers returned (deduplicated).
Free demo mode – set demo_mode: true to test the output without any cost.
No monthly subscription – just pay as you go.

Example:

100,000 papers = 100 × $0.60 = $60 – enough for a large literature review or training dataset.

🛠 How to Use on Apify

Create a task with this actor.
Set demo_mode: true to see sample output instantly.
Adjust search parameters – keywords, author, category, date range.
Choose sorting – relevance, date, or updated.
Run – the actor calls the official arXiv API (direct, no proxy).
Export – download results as JSON, CSV, or Excel.

Important: Do not enable proxy for this actor. arXiv blocks datacenter proxy IPs. The actor automatically ignores proxy settings.

Running via API

curl -X POST "https://api.apify.com/v2/acts/your-username~arxiv-metadata-collector/runs" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -d '{
    "search_query": "generative adversarial networks",
    "max_results": 10
  }'

🎯 Use Cases

Academic Research – Collect papers for systematic literature reviews.
Machine Learning Dataset Building – Gather abstracts and metadata to train a research recommendation system.
Competitive Intelligence – Track publications by specific authors or labs.
Meta‑Analysis – Study publication trends over time by category.
Content Aggregation – Build a custom search engine or daily arXiv digest.
University Libraries – Provide an API endpoint for researchers to fetch paper metadata.

The arXiv Metadata Collector delivers high‑quality, structured data that powers everything from simple bibliography generation to complex scholarly analytics.

❓ Frequently Asked Questions

Q1. Do I need an API key?

No. arXiv’s API is completely open and free. No registration, no token.

Q2. Why can’t I use a proxy?

arXiv aggressively rate‑limits proxy IP addresses (even residential ones). The actor uses direct connections, which are faster and more reliable.

Q3. How fresh is the data?

arXiv API is updated in real time. New papers usually appear within hours of submission.

Q4. What is the difference between `published` and `updated`?

published: original submission date (or first version date for revised papers).
updated: date of the most recent revision (v2, v3, etc.). For papers with one version, updated equals published.

Q5. Can I search by exact author name?

Yes. The API supports au:author_name. The actor encodes it automatically. You can enter partial names as well.

Q6. Are there any rate limits?

The actor respects arXiv’s recommendation: 3 seconds between requests. It also uses exponential backoff on 429 responses. For >1,000 papers, the actor will run several minutes but will complete successfully.

Q7. What categories are supported?

All arXiv categories: cs.*, math.*, physics.*, stat.*, q-bio.*, q-fin.*, econ.*. You can also use friendly aliases: machine learning → cs.LG, ai → cs.AI, nlp → cs.CL, computer vision → cs.CV, etc.

Q8. Does it download the PDF file?

No. The actor returns only the PDF URL. You can use that URL to download the file separately. This keeps the actor fast and cost‑efficient.

Q9. How accurate is the demo mode?

The demo mode shows real arXiv papers (the landmark ones that everyone cites). They are static and serve as an example of the output schema.

Q10. Can I get historical papers (e.g., from 1990)?

Yes, by setting date_from and date_to. arXiv’s API goes back to the very first paper (1991).

📝 Technical Notes

Dependencies: curl_cffi (for connection handling) – bundled in the actor.
Rate limiting: Automatic 3‑second delay between batches, plus exponential backoff on 429/503.
No proxy: The actor ignores proxyConfiguration. Do not enable it – it will cause failures.
Date filtering: Uses the submittedDate field in the API, which corresponds to the original submission date.
Category aliases: The actor converts friendly names (e.g., machine learning) to official arXiv categories (e.g., cs.LG). Unknown aliases are passed as‑is.
XML parsing: Uses Python’s built‑in xml.etree.ElementTree with namespaces.

🔍 SEO Keywords

arXiv Metadata Collector, arXiv paper scraper, research paper API, academic metadata extractor, arXiv PDF download, machine learning dataset scraper, scientific literature API, bulk arXiv search, paper abstract extractor, Apify arXiv actor, scholarly metadata, citation data collector

Start collecting arXiv metadata today – $0.60 per 1,000 papers. No subscription.

ArXiv Paper Search

gentle_cloud/arxiv-paper-search

Search and extract academic papers from ArXiv. Find papers by keyword, author, or category with full metadata including title, authors, abstract, categories, and PDF links.

Monkey Coder

arXiv Papers Scraper — AI & Research by Keyword or Category

hichemdev/arxiv-papers-scraper

Scrape arXiv research papers by keyword or category: title, authors, abstract, dates, categories, DOI and PDF link. Perfect for tracking AI/ML research.

Hichem Ben Moussa

arXiv Scraper

dami_studio/arxiv-scraper

Search arXiv via the official API and return structured paper metadata as JSON: title, abstract, authors, categories, DOI, dates, and abstract + PDF links. Best for literature reviews.

Dami's Studio

5.0

arXiv Papers Scraper

troy_007/arxiv-papers-scraper

Search and export arXiv research papers by query, category, or author — title, abstract, authors, categories, dates, PDF link, and DOI. Uses the official arXiv API.

Pathik Shah

arXiv Papers Scraper

crawlerbros/arxiv-papers-scraper

Scrape academic preprints from arXiv.org by keyword, author, or category. Returns clean records with title, authors, abstract, categories, PDF URL, DOI. HTTP-only via the public arXiv API. No login, no proxy.

Crawler Bros

arXiv Paper Scraper - AI Research Tracker

arjunannamalai/arxiv-paper-scraper

Track new arXiv papers by category, keyword and author. Clean output with authors, abstract and direct PDF links. Public API, no key.

Arjun Annamalai

arXiv Research Paper Scraper

crawlerbros/arxiv-research-paper-scraper

Scrape research papers from arXiv.org - search by query, category, or author; lookup by arXiv ID. Returns title, authors, abstract, PDF URL, DOI, categories, and more. Uses the public arXiv Atom API. No login or proxy required.

Crawler Bros

arXiv Paper Scraper — AI Research, Abstracts & PDF Links

bovi/arxiv-scraper

Search arXiv papers by keyword, ID list, or category. Returns title, authors, abstract, categories, PDF URL, DOI, publish dates, and parse_confidence. Official Atom XML API — no proxy, no auth. Pay per result.

Vitalii Bondarev

🔬 arXiv Scraper - Scientific Papers, Abstracts & PDFs

benthepythondev/arxiv-scraper

arXiv Scraper for the official arXiv API. Search 2M+ scientific papers in CS, physics, math and biology by keyword, title, author, abstract or category. Extract title, authors, abstract, categories, DOI, dates and PDF links. For AI/ML research, literature reviews and RAG datasets.

Ben

arXiv Papers Scraper

resounding_diplomacy/arxiv-papers-scraper

Scrape academic papers from arXiv by category, keyword, or author. Extract titles, authors, abstracts, PDF URLs, DOIs, categories, and more. Perfect for AI/ML research datasets.

alars num

arXiv Metadata Collector— Metadata, PDF, Authors & Abstract

📄 arXiv Metadata Collector — Research Paper Scraper (Official API)

💡 What is the arXiv Metadata Collector?

📦 What Data Can You Extract?

⚙️ Key Features

📥 Input Parameters

Example Input (Keyword Search)

Example Input (Author + Category)

Example Input (Date Range)

📤 Output Fields

Example Output

💰 Pricing

🛠 How to Use on Apify

Running via API

🎯 Use Cases

❓ Frequently Asked Questions

Q1. Do I need an API key?

Q2. Why can’t I use a proxy?

Q3. How fresh is the data?

Q4. What is the difference between published and updated?

Q5. Can I search by exact author name?

Q6. Are there any rate limits?

Q7. What categories are supported?

Q8. Does it download the PDF file?

Q9. How accurate is the demo mode?

Q10. Can I get historical papers (e.g., from 1990)?

📝 Technical Notes

🔍 SEO Keywords

You might also like

ArXiv Paper Search

arXiv Papers Scraper — AI & Research by Keyword or Category

arXiv Scraper

arXiv Papers Scraper

arXiv Papers Scraper

arXiv Paper Scraper - AI Research Tracker

arXiv Research Paper Scraper

arXiv Paper Scraper — AI Research, Abstracts & PDF Links

🔬 arXiv Scraper - Scientific Papers, Abstracts & PDFs

arXiv Papers Scraper

Q4. What is the difference between `published` and `updated`?