arXiv Papers Scraper — Metadata, PDF, Authors & Abstracts
Pricing
$12.99/month + usage
arXiv Papers Scraper — Metadata, PDF, Authors & Abstracts
Scrape arXiv research papers with metadata including title, authors, abstract, PDF links, DOI, and categories. Supports keyword search, proxy integration, and structured dataset output for AI, ML, and academic research use
Pricing
$12.99/month + usage
Rating
0.0
(0)
Developer
Scrape Pilot
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
arXiv Papers Scraper — Metadata, PDF, Authors & Abstracts
A fast and reliable Apify Actor to scrape research paper metadata from arXiv. Extract titles, authors, abstracts, PDF links, DOI, and categories using simple search queries.
🚀 Features
- 🔍 Search arXiv using keywords (e.g., "Machine Learning", "AI")
- 📄 Extract full metadata:
- Title
- Authors
- Abstract
- PDF URL
- DOI
- Published & Updated dates
- Primary category
- ⚡ Fast and optimized scraping using the official arXiv API
- 🌐 Optional proxy support (Apify Proxy compatible)
- 📦 Clean JSON dataset output
- 🔄 Retry & delay handling for stable scraping
🧾 Input Schema
The actor accepts the following input in JSON format:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
search_query | string | Yes | – | Search keyword or phrase for arXiv. |
max_results | integer | No | 20 | Maximum number of papers to fetch. |
proxyConfiguration | object | No | None | Apify proxy settings (optional). |
Example Input
{"search_query": "Machine Learning","max_results": 20,"proxyConfiguration": {"useApifyProxy": true}}------## 📤 Output FormatEach dataset item contains the following fields:| Field | Type | Description ||--------------------|---------------|------------------------------------------------|| `title` | string | Full title of the paper. || `authors` | array[string] | List of authors. || `abstract` | string | Short summary of the paper. || `pdf_url` | string | Direct link to the PDF. || `published` | string | Original publication date (YYYY-MM-DD). || `updated` | string | Last updated date (if applicable). || `primary_category` | string | Main arXiv category (e.g., `cs.AI`). || `doi` | string | Digital Object Identifier (if available). || `source` | string | Always `"arXiv"`. |### Example Output```json[{"title": "Attention Is All You Need","authors": ["Ashish Vaswani", "Noam Shazeer", "Niki Parmar", "Jakob Uszkoreit", "Llion Jones", "Aidan N. Gomez", "Łukasz Kaiser", "Illia Polosukhin"],"abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks...","pdf_url": "https://arxiv.org/pdf/1706.03762.pdf","published": "2017-06-12","updated": "2023-08-01","primary_category": "cs.CL","doi": "10.48550/arXiv.1706.03762","source": "arXiv"}]
⚙️ How It Works
- Takes the user’s search query.
- Fetches results from the official arXiv API.
- Parses the XML response and extracts structured metadata.
- Pushes each paper as an item to the Apify Dataset.
- Applies delays and retries to avoid rate limiting.
🛡️ Proxy Handling
- Automatically uses Apify Proxy if configured in input.
- Helps avoid IP‑based rate limits and "Access Denied" issues.
- Smart fallback ensures reliability without SDK conflicts.
💡 Use Cases
- 📚 Academic research – quickly collect papers on a topic.
- 🤖 AI & ML dataset collection – build training datasets from abstracts.
- 🧠 Knowledge base building – curate a personal library of papers.
- 📊 Research trend analysis – monitor publication trends over time.
- 📰 Content aggregation – create newsletters or feeds for specific fields.
⚠️ Notes
- Uses the official arXiv API – fully compliant and reliable.
- No login or cookies required; publicly available data.
- Rate‑limited with built‑in delays to respect arXiv’s usage policies.
👨💻 Author
Built for developers, researchers, and data enthusiasts using Apify.
🔍 SEO Keywords
arxiv scraper, research papers scraper, academic data extractor, machine learning dataset scraper, ai research scraper, pdf metadata extractor, scientific papers api, research automation tool, arxiv api scraper