Pricing

$12.99/month + usage

arXiv Papers Scraper — Metadata, PDF, Authors & Abstracts

Scrape arXiv research papers with metadata including title, authors, abstract, PDF links, DOI, and categories. Supports keyword search, proxy integration, and structured dataset output for AI, ML, and academic research use

Pricing

$12.99/month + usage

Rating

0.0

(0)

Developer

Scrape Pilot

Actor stats

Bookmarked

Total users

Monthly active users

3 days ago

Last modified

arXiv Papers Scraper — Metadata, PDF, Authors & Abstracts

A fast and reliable Apify Actor to scrape research paper metadata from arXiv. Extract titles, authors, abstracts, PDF links, DOI, and categories using simple search queries.

🚀 Features

🔍 Search arXiv using keywords (e.g., "Machine Learning", "AI")
📄 Extract full metadata:
- Title
- Authors
- Abstract
- PDF URL
- DOI
- Published & Updated dates
- Primary category
⚡ Fast and optimized scraping using the official arXiv API
🌐 Optional proxy support (Apify Proxy compatible)
📦 Clean JSON dataset output
🔄 Retry & delay handling for stable scraping

🧾 Input Schema

The actor accepts the following input in JSON format:

Field	Type	Required	Default	Description
`search_query`	string	Yes	–	Search keyword or phrase for arXiv.
`max_results`	integer	No	`20`	Maximum number of papers to fetch.
`proxyConfiguration`	object	No	None	Apify proxy settings (optional).

Example Input

{
  "search_query": "Machine Learning",
  "max_results": 20,
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}
---

---

## 📤 Output Format

Each dataset item contains the following fields:

| Field              | Type          | Description                                    |
|--------------------|---------------|------------------------------------------------|
| `title`            | string        | Full title of the paper.                       |
| `authors`          | array[string] | List of authors.                               |
| `abstract`         | string        | Short summary of the paper.                    |
| `pdf_url`          | string        | Direct link to the PDF.                        |
| `published`        | string        | Original publication date (YYYY-MM-DD).        |
| `updated`          | string        | Last updated date (if applicable).             |
| `primary_category` | string        | Main arXiv category (e.g., `cs.AI`).           |
| `doi`              | string        | Digital Object Identifier (if available).      |
| `source`           | string        | Always `"arXiv"`.                              |

### Example Output

```json
[
  {
    "title": "Attention Is All You Need",
    "authors": ["Ashish Vaswani", "Noam Shazeer", "Niki Parmar", "Jakob Uszkoreit", "Llion Jones", "Aidan N. Gomez", "Łukasz Kaiser", "Illia Polosukhin"],
    "abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks...",
    "pdf_url": "https://arxiv.org/pdf/1706.03762.pdf",
    "published": "2017-06-12",
    "updated": "2023-08-01",
    "primary_category": "cs.CL",
    "doi": "10.48550/arXiv.1706.03762",
    "source": "arXiv"
  }
]

⚙️ How It Works

Takes the user’s search query.
Fetches results from the official arXiv API.
Parses the XML response and extracts structured metadata.
Pushes each paper as an item to the Apify Dataset.
Applies delays and retries to avoid rate limiting.

🛡️ Proxy Handling

Automatically uses Apify Proxy if configured in input.
Helps avoid IP‑based rate limits and "Access Denied" issues.
Smart fallback ensures reliability without SDK conflicts.

💡 Use Cases

📚 Academic research – quickly collect papers on a topic.
🤖 AI & ML dataset collection – build training datasets from abstracts.
🧠 Knowledge base building – curate a personal library of papers.
📊 Research trend analysis – monitor publication trends over time.
📰 Content aggregation – create newsletters or feeds for specific fields.

⚠️ Notes

Uses the official arXiv API – fully compliant and reliable.
No login or cookies required; publicly available data.
Rate‑limited with built‑in delays to respect arXiv’s usage policies.

👨‍💻 Author

Built for developers, researchers, and data enthusiasts using Apify.

🔍 SEO Keywords

arxiv scraper, research papers scraper, academic data extractor, machine learning dataset scraper, ai research scraper, pdf metadata extractor, scientific papers api, research automation tool, arxiv api scraper

ArXiv Research Paper Scraper

datapilot/arxiv-research-paper-scraper

arXiv Research Paper Scraper retrieves academic paper metadata from the arXiv API based on a keyword. It extracts titles, abstracts, authors with affiliations, DOI, categories, submission dates, and PDF links. Supports proxy usage and outputs structured JSON results for research and data analysis.

Data Pilot

📚 arXiv Article Metadata Scraper - Pay per results

scrapestorm/arxiv-article-metadata-scraper---pay-per-results

Discover top arXiv papers with ⚡fast metadata extraction! Sort by 🔥 relevance 🕒 submission date or 📚 subject area. Get key info like titles, abstracts, authors, PDF links & more. Perfect for 📊 literature reviews, trend tracking, academic research & building high-quality AI training datasets!

Storm_Scraper

5.0

(1)

📚 arXiv Article Metadata Scraper - Cheap

scrapestorm/arxiv-article-metadata-scraper---cheap

Storm_Scraper

Arxiv Paper Intelligence

viralanalyzer/arxiv-paper-intelligence

Search and extract ArXiv papers, abstracts, authors, and citations. Track research trends across any scientific field. AI-powered analysis.

viralanalyzer

5.0

(3)

Mixnet-paper-scraper

bikrambiswas/mixnet-paper-scraper

Scrapes academic papers on Mixnet, Nym, and privacy technology from arXiv and verified research sources. Filters by keyword and year. Returns title, authors, abstract, publication year, and PDF links. Perfect for privacy researchers and developers. Uses arXiv API with fallback Nym papers.

Bikram Biswas

5.0

(1)

arXiv Scraper

artificially/arxiv-scraper

Search and extract academic papers from arXiv.org. Get paper titles, authors, abstracts, categories, and PDF links for AI/ML, physics, math, and more.

Artificially

arXiv Search Scraper 📚

easyapi/arxiv-search-scraper

Extract comprehensive research paper data from arXiv search results. Get detailed metadata including titles, authors, abstracts, categories and more. Perfect for academic research monitoring, trend analysis and building paper databases. 🎓📚

EasyApi

ArXiv Scraper

automation-lab/arxiv-scraper

Scrape ArXiv research papers — titles, authors, abstracts, subjects, submission dates, and PDF links.

Stas Persiianenko

arXiv Scraper

parseforge/arxiv-scraper

Comprehensive arXiv scraper for extracting scholarly article data across physics, math, CS, biology, finance, statistics, engineering, and economics. Automates access to arXiv’s large preprint archive, providing structured metadata for researchers, academics, and data scientists.