arXiv Metadata Collector— Metadata, PDF, Authors & Abstract avatar

arXiv Metadata Collector— Metadata, PDF, Authors & Abstract

Pricing

from $0.60 / 1,000 results

Go to Apify Store
arXiv Metadata Collector— Metadata, PDF, Authors & Abstract

arXiv Metadata Collector— Metadata, PDF, Authors & Abstract

Scrape arXiv research papers with metadata including title, authors, abstract, PDF links, DOI, and categories. Supports keyword search, proxy integration, and structured dataset output for AI, ML, and academic research use

Pricing

from $0.60 / 1,000 results

Rating

0.0

(0)

Developer

Scrape Pilot

Scrape Pilot

Maintained by Community

Actor stats

0

Bookmarked

5

Total users

2

Monthly active users

5 days ago

Last modified

Share


📄 arXiv Metadata Collector — Research Paper Scraper (Official API)

Extract structured academic paper metadata from arXiv — titles, authors, abstracts, PDF links, categories, DOIs, and more.
The arXiv Metadata Collector uses the official arXiv API (no auth, no API key) to search by keyword, author, category, or date range. Get clean JSON output perfect for academic research, literature reviews, and dataset building.

Pricing: Only $0.60 per 1,000 results – pay only for what you use.


💡 What is the arXiv Metadata Collector?

The arXiv Metadata Collector is a professional Apify actor that retrieves paper metadata from arXiv.org using its official, public API. arXiv is the world’s largest open‑access repository of scientific papers (2+ million) covering physics, mathematics, computer science, quantitative biology, finance, and statistics.

With this actor you can:

  • Search by keywords (optional: restrict to title only)
  • Search by author (full or partial name)
  • Filter by category (e.g., cs.LG, cs.AI, stat.ML, physics, etc.)
  • Apply date ranges (date_from, date_to)
  • Sort by relevance, submission date, or update date
  • Control batch size with max_results

The output includes every field you need for literature analysis:

  • Title, authors, abstract
  • PDF URL, arXiv abstract URL, arXiv ID
  • Publication date, last update date
  • Primary category and all categories
  • DOI, journal reference, comment/notes

The arXiv Metadata Collector respects arXiv’s rate limits (3 seconds between requests) and never uses proxies (arXiv blocks proxy IPs). It also includes a demo mode with 10 real sample papers so you can test the output instantly.


📦 What Data Can You Extract?

🧩 Data Type📋 Description
📄 TitleFull paper title.
👥 AuthorsList of author names.
📝 AbstractPaper summary (plain text, cleaned).
📥 PDF URLDirect link to the PDF file.
🔗 arXiv URLAbstract page URL.
🆔 arXiv IDUnique identifier (e.g., 1706.03762).
📅 Published DateOriginal submission date (YYYY-MM-DD).
📅 Updated DateLast revision date (if any).
🏷️ Primary CategoryMain classification (e.g., cs.CL).
🏷️ CategoriesArray of all categories.
🔗 DOIDigital Object Identifier (if available).
📖 Journal RefJournal or conference name (e.g., NeurIPS 2017).
💬 CommentAdditional notes (e.g., “15 pages, 5 figures”).
🏛️ SourceAlways arXiv.

⚙️ Key Features

  • Official arXiv API – 100% compliant, no API key, no login.
  • Full‑text search – Search all fields or restrict to title only.
  • Author & category filters – Narrow down to specific researchers or research areas.
  • Date range – Get papers published in a specific window.
  • Sort options – Relevance, date submitted, or last updated.
  • Rate‑limit safe – Built‑in 3‑second delay between requests to avoid 429 errors.
  • No proxy – arXiv blocks datacenter proxies; the actor uses direct connections.
  • Demo mode – Instantly see 10 real landmark papers (no API calls, no cost).
  • Clean JSON – Ready for analysis, dashboards, or ingestion into databases.
  • Pay‑per‑use – $0.60 per 1,000 results – no monthly commitment.

📥 Input Parameters

The actor accepts a JSON object with the following fields:

ParameterTypeRequiredDefaultDescription
demo_modebooleanNofalseReturn 10 sample papers (real, from arXiv).
search_querystringNo*Keywords for full‑text search (or title only if title_only=true).
authorstringNo*Author name (e.g., Yann LeCun).
categorystringNo*arXiv category (e.g., cs.LG, stat.ML). Aliases: machine learning, ai, nlp, computer vision.
title_onlybooleanNofalseIf true, search_query is matched only in the title.
sort_bystringNorelevancerelevance, date (submitted), or updated.
date_fromstringNoEarliest publication date (YYYY-MM-DD).
date_tostringNoLatest publication date (YYYY-MM-DD).
max_resultsintegerNo20Maximum number of papers to return.
proxyConfigurationobjectNoIgnored – arXiv requires direct connection. Do not use proxy.

Note: Provide at least one of: search_query, author, category. If none, the actor defaults to search_query: "machine learning".

{
"search_query": "transformer attention",
"max_results": 10,
"sort_by": "date"
}

Example Input (Author + Category)

{
"author": "Yoshua Bengio",
"category": "cs.LG",
"max_results": 15
}

Example Input (Date Range)

{
"category": "physics",
"date_from": "2023-01-01",
"date_to": "2023-12-31",
"max_results": 50
}

📤 Output Fields

Each paper is returned as an object with the following possible fields:

FieldTypeDescription
titlestringPaper title.
authorsarrayList of author names.
abstractstringPaper summary (cleaned).
pdf_urlstringDirect PDF link.
arxiv_urlstringAbstract page URL.
arxiv_idstringe.g., 1706.03762.
publishedstringYYYY-MM-DD.
updatedstringYYYY-MM-DD (if revised).
primary_categorystringMain category (e.g., cs.CL).
categoriesarrayAll categories (e.g., ["cs.CL","cs.LG"]).
doistringDigital Object Identifier (or null).
journal_refstringJournal/conference name (or null).
commentstringAdditional notes (or null).
sourcestringAlways "arXiv".

Example Output

[
{
"title": "Attention Is All You Need",
"authors": ["Ashish Vaswani", "Noam Shazeer", "Niki Parmar", "Jakob Uszkoreit", "Llion Jones", "Aidan N. Gomez", "Lukasz Kaiser", "Illia Polosukhin"],
"abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder...",
"pdf_url": "https://arxiv.org/pdf/1706.03762",
"arxiv_url": "https://arxiv.org/abs/1706.03762",
"arxiv_id": "1706.03762",
"published": "2017-06-12",
"updated": "2023-08-02",
"primary_category": "cs.CL",
"categories": ["cs.CL", "cs.LG"],
"doi": "10.48550/arXiv.1706.03762",
"journal_ref": "Advances in Neural Information Processing Systems 30, 2017",
"comment": "15 pages, 5 figures",
"source": "arXiv"
}
]

💰 Pricing

MetricPrice
Per 1,000 results$0.60
Minimum charge per run$0.01
Typical run (20 results)$0.012
  • You pay only for the number of papers returned (deduplicated).
  • Free demo mode – set demo_mode: true to test the output without any cost.
  • No monthly subscription – just pay as you go.

Example:

  • 100,000 papers = 100 × $0.60 = $60 – enough for a large literature review or training dataset.

🛠 How to Use on Apify

  1. Create a task with this actor.
  2. Set demo_mode: true to see sample output instantly.
  3. Adjust search parameters – keywords, author, category, date range.
  4. Choose sorting – relevance, date, or updated.
  5. Run – the actor calls the official arXiv API (direct, no proxy).
  6. Export – download results as JSON, CSV, or Excel.

Important: Do not enable proxy for this actor. arXiv blocks datacenter proxy IPs. The actor automatically ignores proxy settings.

Running via API

curl -X POST "https://api.apify.com/v2/acts/your-username~arxiv-metadata-collector/runs" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-d '{
"search_query": "generative adversarial networks",
"max_results": 10
}'

🎯 Use Cases

  • Academic Research – Collect papers for systematic literature reviews.
  • Machine Learning Dataset Building – Gather abstracts and metadata to train a research recommendation system.
  • Competitive Intelligence – Track publications by specific authors or labs.
  • Meta‑Analysis – Study publication trends over time by category.
  • Content Aggregation – Build a custom search engine or daily arXiv digest.
  • University Libraries – Provide an API endpoint for researchers to fetch paper metadata.

The arXiv Metadata Collector delivers high‑quality, structured data that powers everything from simple bibliography generation to complex scholarly analytics.


❓ Frequently Asked Questions

Q1. Do I need an API key?

No. arXiv’s API is completely open and free. No registration, no token.

Q2. Why can’t I use a proxy?

arXiv aggressively rate‑limits proxy IP addresses (even residential ones). The actor uses direct connections, which are faster and more reliable.

Q3. How fresh is the data?

arXiv API is updated in real time. New papers usually appear within hours of submission.

Q4. What is the difference between published and updated?

  • published: original submission date (or first version date for revised papers).
  • updated: date of the most recent revision (v2, v3, etc.). For papers with one version, updated equals published.

Q5. Can I search by exact author name?

Yes. The API supports au:author_name. The actor encodes it automatically. You can enter partial names as well.

Q6. Are there any rate limits?

The actor respects arXiv’s recommendation: 3 seconds between requests. It also uses exponential backoff on 429 responses. For >1,000 papers, the actor will run several minutes but will complete successfully.

Q7. What categories are supported?

All arXiv categories: cs.*, math.*, physics.*, stat.*, q-bio.*, q-fin.*, econ.*. You can also use friendly aliases: machine learningcs.LG, aics.AI, nlpcs.CL, computer visioncs.CV, etc.

Q8. Does it download the PDF file?

No. The actor returns only the PDF URL. You can use that URL to download the file separately. This keeps the actor fast and cost‑efficient.

Q9. How accurate is the demo mode?

The demo mode shows real arXiv papers (the landmark ones that everyone cites). They are static and serve as an example of the output schema.

Q10. Can I get historical papers (e.g., from 1990)?

Yes, by setting date_from and date_to. arXiv’s API goes back to the very first paper (1991).


📝 Technical Notes

  • Dependencies: curl_cffi (for connection handling) – bundled in the actor.
  • Rate limiting: Automatic 3‑second delay between batches, plus exponential backoff on 429/503.
  • No proxy: The actor ignores proxyConfiguration. Do not enable it – it will cause failures.
  • Date filtering: Uses the submittedDate field in the API, which corresponds to the original submission date.
  • Category aliases: The actor converts friendly names (e.g., machine learning) to official arXiv categories (e.g., cs.LG). Unknown aliases are passed as‑is.
  • XML parsing: Uses Python’s built‑in xml.etree.ElementTree with namespaces.

🔍 SEO Keywords

arXiv Metadata Collector, arXiv paper scraper, research paper API, academic metadata extractor, arXiv PDF download, machine learning dataset scraper, scientific literature API, bulk arXiv search, paper abstract extractor, Apify arXiv actor, scholarly metadata, citation data collector



Start collecting arXiv metadata today – $0.60 per 1,000 papers. No subscription.