arXiv Papers Scraper — Metadata, PDF, Authors & Abstracts avatar

arXiv Papers Scraper — Metadata, PDF, Authors & Abstracts

Pricing

$12.99/month + usage

Go to Apify Store
arXiv Papers Scraper — Metadata, PDF, Authors & Abstracts

arXiv Papers Scraper — Metadata, PDF, Authors & Abstracts

Scrape arXiv research papers with metadata including title, authors, abstract, PDF links, DOI, and categories. Supports keyword search, proxy integration, and structured dataset output for AI, ML, and academic research use

Pricing

$12.99/month + usage

Rating

0.0

(0)

Developer

Scrape Pilot

Scrape Pilot

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share


arXiv Papers Scraper — Metadata, PDF, Authors & Abstracts

Apify Actor License: MIT

A fast and reliable Apify Actor to scrape research paper metadata from arXiv. Extract titles, authors, abstracts, PDF links, DOI, and categories using simple search queries.


🚀 Features

  • 🔍 Search arXiv using keywords (e.g., "Machine Learning", "AI")
  • 📄 Extract full metadata:
    • Title
    • Authors
    • Abstract
    • PDF URL
    • DOI
    • Published & Updated dates
    • Primary category
  • ⚡ Fast and optimized scraping using the official arXiv API
  • 🌐 Optional proxy support (Apify Proxy compatible)
  • 📦 Clean JSON dataset output
  • 🔄 Retry & delay handling for stable scraping

🧾 Input Schema

The actor accepts the following input in JSON format:

FieldTypeRequiredDefaultDescription
search_querystringYesSearch keyword or phrase for arXiv.
max_resultsintegerNo20Maximum number of papers to fetch.
proxyConfigurationobjectNoNoneApify proxy settings (optional).

Example Input

{
"search_query": "Machine Learning",
"max_results": 20,
"proxyConfiguration": {
"useApifyProxy": true
}
}
---
---
## 📤 Output Format
Each dataset item contains the following fields:
| Field | Type | Description |
|--------------------|---------------|------------------------------------------------|
| `title` | string | Full title of the paper. |
| `authors` | array[string] | List of authors. |
| `abstract` | string | Short summary of the paper. |
| `pdf_url` | string | Direct link to the PDF. |
| `published` | string | Original publication date (YYYY-MM-DD). |
| `updated` | string | Last updated date (if applicable). |
| `primary_category` | string | Main arXiv category (e.g., `cs.AI`). |
| `doi` | string | Digital Object Identifier (if available). |
| `source` | string | Always `"arXiv"`. |
### Example Output
```json
[
{
"title": "Attention Is All You Need",
"authors": ["Ashish Vaswani", "Noam Shazeer", "Niki Parmar", "Jakob Uszkoreit", "Llion Jones", "Aidan N. Gomez", "Łukasz Kaiser", "Illia Polosukhin"],
"abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks...",
"pdf_url": "https://arxiv.org/pdf/1706.03762.pdf",
"published": "2017-06-12",
"updated": "2023-08-01",
"primary_category": "cs.CL",
"doi": "10.48550/arXiv.1706.03762",
"source": "arXiv"
}
]

⚙️ How It Works

  1. Takes the user’s search query.
  2. Fetches results from the official arXiv API.
  3. Parses the XML response and extracts structured metadata.
  4. Pushes each paper as an item to the Apify Dataset.
  5. Applies delays and retries to avoid rate limiting.

🛡️ Proxy Handling

  • Automatically uses Apify Proxy if configured in input.
  • Helps avoid IP‑based rate limits and "Access Denied" issues.
  • Smart fallback ensures reliability without SDK conflicts.

💡 Use Cases

  • 📚 Academic research – quickly collect papers on a topic.
  • 🤖 AI & ML dataset collection – build training datasets from abstracts.
  • 🧠 Knowledge base building – curate a personal library of papers.
  • 📊 Research trend analysis – monitor publication trends over time.
  • 📰 Content aggregation – create newsletters or feeds for specific fields.

⚠️ Notes

  • Uses the official arXiv API – fully compliant and reliable.
  • No login or cookies required; publicly available data.
  • Rate‑limited with built‑in delays to respect arXiv’s usage policies.

👨‍💻 Author

Built for developers, researchers, and data enthusiasts using Apify.


🔍 SEO Keywords

arxiv scraper, research papers scraper, academic data extractor, machine learning dataset scraper, ai research scraper, pdf metadata extractor, scientific papers api, research automation tool, arxiv api scraper