arXiv Metadata Collector— Metadata, PDF, Authors & Abstract
Pricing
from $0.60 / 1,000 results
arXiv Metadata Collector— Metadata, PDF, Authors & Abstract
Scrape arXiv research papers with metadata including title, authors, abstract, PDF links, DOI, and categories. Supports keyword search, proxy integration, and structured dataset output for AI, ML, and academic research use
Pricing
from $0.60 / 1,000 results
Rating
0.0
(0)
Developer
Scrape Pilot
Actor stats
0
Bookmarked
5
Total users
2
Monthly active users
5 days ago
Last modified
Categories
Share
📄 arXiv Metadata Collector — Research Paper Scraper (Official API)
Extract structured academic paper metadata from arXiv — titles, authors, abstracts, PDF links, categories, DOIs, and more.
The arXiv Metadata Collector uses the official arXiv API (no auth, no API key) to search by keyword, author, category, or date range. Get clean JSON output perfect for academic research, literature reviews, and dataset building.
Pricing: Only $0.60 per 1,000 results – pay only for what you use.
💡 What is the arXiv Metadata Collector?
The arXiv Metadata Collector is a professional Apify actor that retrieves paper metadata from arXiv.org using its official, public API. arXiv is the world’s largest open‑access repository of scientific papers (2+ million) covering physics, mathematics, computer science, quantitative biology, finance, and statistics.
With this actor you can:
- Search by keywords (optional: restrict to title only)
- Search by author (full or partial name)
- Filter by category (e.g.,
cs.LG,cs.AI,stat.ML,physics, etc.) - Apply date ranges (
date_from,date_to) - Sort by relevance, submission date, or update date
- Control batch size with
max_results
The output includes every field you need for literature analysis:
- Title, authors, abstract
- PDF URL, arXiv abstract URL, arXiv ID
- Publication date, last update date
- Primary category and all categories
- DOI, journal reference, comment/notes
The arXiv Metadata Collector respects arXiv’s rate limits (3 seconds between requests) and never uses proxies (arXiv blocks proxy IPs). It also includes a demo mode with 10 real sample papers so you can test the output instantly.
📦 What Data Can You Extract?
| 🧩 Data Type | 📋 Description |
|---|---|
| 📄 Title | Full paper title. |
| 👥 Authors | List of author names. |
| 📝 Abstract | Paper summary (plain text, cleaned). |
| 📥 PDF URL | Direct link to the PDF file. |
| 🔗 arXiv URL | Abstract page URL. |
| 🆔 arXiv ID | Unique identifier (e.g., 1706.03762). |
| 📅 Published Date | Original submission date (YYYY-MM-DD). |
| 📅 Updated Date | Last revision date (if any). |
| 🏷️ Primary Category | Main classification (e.g., cs.CL). |
| 🏷️ Categories | Array of all categories. |
| 🔗 DOI | Digital Object Identifier (if available). |
| 📖 Journal Ref | Journal or conference name (e.g., NeurIPS 2017). |
| 💬 Comment | Additional notes (e.g., “15 pages, 5 figures”). |
| 🏛️ Source | Always arXiv. |
⚙️ Key Features
- Official arXiv API – 100% compliant, no API key, no login.
- Full‑text search – Search all fields or restrict to title only.
- Author & category filters – Narrow down to specific researchers or research areas.
- Date range – Get papers published in a specific window.
- Sort options – Relevance, date submitted, or last updated.
- Rate‑limit safe – Built‑in 3‑second delay between requests to avoid
429errors. - No proxy – arXiv blocks datacenter proxies; the actor uses direct connections.
- Demo mode – Instantly see 10 real landmark papers (no API calls, no cost).
- Clean JSON – Ready for analysis, dashboards, or ingestion into databases.
- Pay‑per‑use – $0.60 per 1,000 results – no monthly commitment.
📥 Input Parameters
The actor accepts a JSON object with the following fields:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
demo_mode | boolean | No | false | Return 10 sample papers (real, from arXiv). |
search_query | string | No* | – | Keywords for full‑text search (or title only if title_only=true). |
author | string | No* | – | Author name (e.g., Yann LeCun). |
category | string | No* | – | arXiv category (e.g., cs.LG, stat.ML). Aliases: machine learning, ai, nlp, computer vision. |
title_only | boolean | No | false | If true, search_query is matched only in the title. |
sort_by | string | No | relevance | relevance, date (submitted), or updated. |
date_from | string | No | – | Earliest publication date (YYYY-MM-DD). |
date_to | string | No | – | Latest publication date (YYYY-MM-DD). |
max_results | integer | No | 20 | Maximum number of papers to return. |
proxyConfiguration | object | No | – | Ignored – arXiv requires direct connection. Do not use proxy. |
Note: Provide at least one of:
search_query,author,category. If none, the actor defaults tosearch_query: "machine learning".
Example Input (Keyword Search)
{"search_query": "transformer attention","max_results": 10,"sort_by": "date"}
Example Input (Author + Category)
{"author": "Yoshua Bengio","category": "cs.LG","max_results": 15}
Example Input (Date Range)
{"category": "physics","date_from": "2023-01-01","date_to": "2023-12-31","max_results": 50}
📤 Output Fields
Each paper is returned as an object with the following possible fields:
| Field | Type | Description |
|---|---|---|
title | string | Paper title. |
authors | array | List of author names. |
abstract | string | Paper summary (cleaned). |
pdf_url | string | Direct PDF link. |
arxiv_url | string | Abstract page URL. |
arxiv_id | string | e.g., 1706.03762. |
published | string | YYYY-MM-DD. |
updated | string | YYYY-MM-DD (if revised). |
primary_category | string | Main category (e.g., cs.CL). |
categories | array | All categories (e.g., ["cs.CL","cs.LG"]). |
doi | string | Digital Object Identifier (or null). |
journal_ref | string | Journal/conference name (or null). |
comment | string | Additional notes (or null). |
source | string | Always "arXiv". |
Example Output
[{"title": "Attention Is All You Need","authors": ["Ashish Vaswani", "Noam Shazeer", "Niki Parmar", "Jakob Uszkoreit", "Llion Jones", "Aidan N. Gomez", "Lukasz Kaiser", "Illia Polosukhin"],"abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder...","pdf_url": "https://arxiv.org/pdf/1706.03762","arxiv_url": "https://arxiv.org/abs/1706.03762","arxiv_id": "1706.03762","published": "2017-06-12","updated": "2023-08-02","primary_category": "cs.CL","categories": ["cs.CL", "cs.LG"],"doi": "10.48550/arXiv.1706.03762","journal_ref": "Advances in Neural Information Processing Systems 30, 2017","comment": "15 pages, 5 figures","source": "arXiv"}]
💰 Pricing
| Metric | Price |
|---|---|
| Per 1,000 results | $0.60 |
| Minimum charge per run | $0.01 |
| Typical run (20 results) | $0.012 |
- You pay only for the number of papers returned (deduplicated).
- Free demo mode – set
demo_mode: trueto test the output without any cost. - No monthly subscription – just pay as you go.
Example:
- 100,000 papers = 100 × $0.60 = $60 – enough for a large literature review or training dataset.
🛠 How to Use on Apify
- Create a task with this actor.
- Set
demo_mode: trueto see sample output instantly. - Adjust search parameters – keywords, author, category, date range.
- Choose sorting – relevance, date, or updated.
- Run – the actor calls the official arXiv API (direct, no proxy).
- Export – download results as JSON, CSV, or Excel.
Important: Do not enable proxy for this actor. arXiv blocks datacenter proxy IPs. The actor automatically ignores proxy settings.
Running via API
curl -X POST "https://api.apify.com/v2/acts/your-username~arxiv-metadata-collector/runs" \-H "Content-Type: application/json" \-H "Authorization: Bearer YOUR_API_TOKEN" \-d '{"search_query": "generative adversarial networks","max_results": 10}'
🎯 Use Cases
- Academic Research – Collect papers for systematic literature reviews.
- Machine Learning Dataset Building – Gather abstracts and metadata to train a research recommendation system.
- Competitive Intelligence – Track publications by specific authors or labs.
- Meta‑Analysis – Study publication trends over time by category.
- Content Aggregation – Build a custom search engine or daily arXiv digest.
- University Libraries – Provide an API endpoint for researchers to fetch paper metadata.
The arXiv Metadata Collector delivers high‑quality, structured data that powers everything from simple bibliography generation to complex scholarly analytics.
❓ Frequently Asked Questions
Q1. Do I need an API key?
No. arXiv’s API is completely open and free. No registration, no token.
Q2. Why can’t I use a proxy?
arXiv aggressively rate‑limits proxy IP addresses (even residential ones). The actor uses direct connections, which are faster and more reliable.
Q3. How fresh is the data?
arXiv API is updated in real time. New papers usually appear within hours of submission.
Q4. What is the difference between published and updated?
published: original submission date (or first version date for revised papers).updated: date of the most recent revision (v2, v3, etc.). For papers with one version,updatedequalspublished.
Q5. Can I search by exact author name?
Yes. The API supports au:author_name. The actor encodes it automatically. You can enter partial names as well.
Q6. Are there any rate limits?
The actor respects arXiv’s recommendation: 3 seconds between requests. It also uses exponential backoff on 429 responses. For >1,000 papers, the actor will run several minutes but will complete successfully.
Q7. What categories are supported?
All arXiv categories: cs.*, math.*, physics.*, stat.*, q-bio.*, q-fin.*, econ.*. You can also use friendly aliases: machine learning → cs.LG, ai → cs.AI, nlp → cs.CL, computer vision → cs.CV, etc.
Q8. Does it download the PDF file?
No. The actor returns only the PDF URL. You can use that URL to download the file separately. This keeps the actor fast and cost‑efficient.
Q9. How accurate is the demo mode?
The demo mode shows real arXiv papers (the landmark ones that everyone cites). They are static and serve as an example of the output schema.
Q10. Can I get historical papers (e.g., from 1990)?
Yes, by setting date_from and date_to. arXiv’s API goes back to the very first paper (1991).
📝 Technical Notes
- Dependencies:
curl_cffi(for connection handling) – bundled in the actor. - Rate limiting: Automatic 3‑second delay between batches, plus exponential backoff on 429/503.
- No proxy: The actor ignores
proxyConfiguration. Do not enable it – it will cause failures. - Date filtering: Uses the
submittedDatefield in the API, which corresponds to the original submission date. - Category aliases: The actor converts friendly names (e.g.,
machine learning) to official arXiv categories (e.g.,cs.LG). Unknown aliases are passed as‑is. - XML parsing: Uses Python’s built‑in
xml.etree.ElementTreewith namespaces.
🔍 SEO Keywords
arXiv Metadata Collector, arXiv paper scraper, research paper API, academic metadata extractor, arXiv PDF download, machine learning dataset scraper, scientific literature API, bulk arXiv search, paper abstract extractor, Apify arXiv actor, scholarly metadata, citation data collector
Start collecting arXiv metadata today – $0.60 per 1,000 papers. No subscription.