Pricing

$0.03 / 1,000 results

Arxiv Citation Network Scraper

A professional Apify Actor that scrapes academic papers from arXiv and builds citation networks. Extract paper metadata, analyze author collaborations, track research trends, and discover emerging topics in science and technology.

Pricing

$0.03 / 1,000 results

Rating

0.0

(0)

Developer

CodePoetry

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

arXiv Citation Network Scraper – Apify Actor

A production-grade Apify Actor that turns arXiv into a structured, analysis‑ready dataset: papers, authors, collaboration networks, and research trends. It is designed to be reliable enough for paying customers building research tools, AI pipelines, and analytics products.

🧩 What You Get When You Pay

High‑quality academic data: Clean, structured paper metadata with authors, categories, dates, and direct PDF links.
Network & trend insights: Built‑in author collaboration networks and topic trends so you don’t have to code analytics yourself.
Stable, monitored actor: Production‑oriented implementation with error handling and tests (test_actor.py) to keep runs predictable.
Time savings: No need to learn the arXiv API or parse Atom feeds / HTML yourself.
Flexible exports: Use the Apify dataset UI to export to JSON, CSV, Excel, or integrate via API.

What Is an Apify Actor?

An Apify Actor is a serverless micro‑app that runs in the Apify cloud. You don’t manage servers or scaling – you just configure input, run the actor, and consume the dataset/API.

What This Actor Does

This actor provides a complete academic research data pipeline:

Discovers papers – Searches arXiv using their official API with flexible filters.
Extracts metadata – Titles, abstracts, authors, categories, publication dates, PDF links.
Builds networks – Co‑authorship and author collaboration structures.
Analyzes trends – Top categories, prolific authors, and monthly publication volumes.
Delivers insights – All data is pushed into an Apify dataset with a friendly output schema.

Key Features

✅ Search by keywords, categories, or date ranges
✅ Structured paper metadata (title, abstract, authors, categories, dates, links)
✅ Author collaboration network analysis
✅ Research trend detection (top categories, authors, monthly volumes)
✅ Direct PDF download links (optional)
✅ No authentication required, uses the official arXiv API
✅ Output schema optimized for Apify UI (nice tables & views)

Typical Use Cases

For Researchers & Academics

Literature Review - Quickly gather papers on specific topics
Author Discovery - Find key researchers and collaboration networks
Trend Analysis - Identify emerging research areas
Citation Tracking - Build citation networks for meta-analysis

For AI & Tech Companies

Training Data - Collect academic papers for AI model training
Research Intelligence - Track competitors and emerging technologies
Talent Discovery - Identify leading researchers for recruitment
Dataset Creation - Build curated research datasets

For Developers & Analysts

Academic Databases - Power search engines and research platforms
Visualization Tools - Feed network graphs and trend dashboards
API Integration - Automated research monitoring systems
Data Analysis - Export to CSV/Excel for custom analysis

How to Run This Actor on Apify

Open the actor on Apify.
In the Input tab, fill in the parameters (or use a template below).
Click Start.
When the run finishes, open the Dataset tab to explore the results in a friendly table view or export them.

Minimal Input Example

{
  "searchQuery": "machine learning",
  "category": "cs.AI",
  "maxPapers": 100,
  "extractCitations": true,
  "includePdfLink": true,
  "dateFrom": "2024-01-01",
  "dateTo": "2024-12-31"
}

Input Parameters

Parameter	Type	Required	Description	Example
`searchQuery`	String	No	Keywords to search for	"neural networks"
`category`	String	No	arXiv category filter	"cs.AI", "cs.LG", "physics.quant-ph"
`maxPapers`	Integer	No	Max papers to scrape (1-1000)	100
`extractCitations`	Boolean	No	Extract citation metadata	true
`includePdfLink`	Boolean	No	Include PDF download URLs	true
`dateFrom`	String	No	Filter papers after date (YYYY-MM-DD)	"2024-01-01"
`dateTo`	String	No	Filter papers before date (YYYY-MM-DD)	"2024-12-31"

Full list of categories

Ready‑Made Input Examples

Find recent AI papers:

{
  "category": "cs.AI",
  "maxPapers": 50,
  "dateFrom": "2024-01-01"
}

Search for quantum computing papers:

{
  "searchQuery": "quantum computing",
  "maxPapers": 30,
  "extractCitations": true
}

Track specific author's work:

{
  "searchQuery": "Yoshua Bengio",
  "maxPapers": 20
}

Build machine learning dataset:

{
  "category": "cs.LG",
  "maxPapers": 500,
  "dateFrom": "2023-01-01",
  "includePdfLink": true
}

Output Format (What You See in the Dataset)

The actor uses a dedicated output_schema.json so that the Apify UI shows clean, labeled columns and views.

Individual Paper Records

Each paper is returned as a structured JSON object:

{
  "arxiv_id": "2401.12345",
  "title": "Advances in Neural Network Architectures",
  "summary": "This paper presents novel approaches to neural network design...",
  "authors": [
    "Jane Doe",
    "John Smith",
    "Alice Johnson"
  ],
  "primary_category": "cs.LG",
  "categories": ["cs.LG", "cs.AI", "stat.ML"],
  "published": "2024-01-15",
  "updated": "2024-01-20",
  "url": "https://arxiv.org/abs/2401.12345",
  "pdf_url": "https://arxiv.org/pdf/2401.12345.pdf",
  "comment": "10 pages, 5 figures, accepted to NeurIPS 2024",
  "citation_data": {
    "arxiv_id": "2401.12345",
    "references_extracted": true,
    "doi": "10.1234/example",
    "journal_reference": "NeurIPS 2024"
  }
}

Author Network Analysis

{
  "type": "author_network",
  "data": {
    "author_papers": {
      "Jane Doe": ["2401.12345", "2312.54321"],
      "John Smith": ["2401.12345"]
    },
    "collaborations": [
      {
        "authors": ["Jane Doe", "John Smith"],
        "count": 3
      }
    ],
    "total_authors": 156,
    "total_collaborations": 423
  },
  "generated_at": "2024-11-18T10:30:00.123456"
}

Trend Analysis

{
  "type": "trend_analysis",
  "data": {
    "top_categories": [
      {"category": "cs.LG", "count": 45},
      {"category": "cs.AI", "count": 38}
    ],
    "top_authors": [
      {"author": "Jane Doe", "papers": 5},
      {"author": "John Smith", "papers": 3}
    ],
    "papers_per_month": {
      "2024-01": 12,
      "2024-02": 15,
      "2024-03": 18
    },
    "total_papers": 100,
    "total_categories": 8,
    "unique_authors": 245
  },
  "generated_at": "2024-11-18T10:30:00.123456"
}

Running Locally (Optional for Developers)

You don’t need this for normal paid use on Apify, but if you want to test or extend the actor locally:

pip install -r requirements.txt
python test_actor.py   # run the test suite
python test_local.py   # simple manual test

Using the Data in Your Product

Build internal research dashboards: plug the dataset into BI tools (Tableau, Power BI, Metabase).
Feed AI & LLM pipelines: use abstracts and metadata as high‑quality training or retrieval corpora.
Power academic search or recommendation features: index papers by topic, author, and time.
Track research signals: monitor new papers in specific categories over time.

Project Structure (For Technical Users)

arxiv-citation-network-scraper/
├── .actor/
│   ├── actor.json           # Actor metadata and configuration
│   └── input_schema.json    # Input form schema for Apify UI
├── src/
│   └── main.py             # Main actor code with scraping logic
├── Dockerfile              # Container configuration
├── requirements.txt        # Python dependencies
├── .gitignore             # Git ignore rules
└── README.md              # This file

How It Works (Under the Hood)

Technical Flow

API Query Construction
- Builds arXiv API query from user parameters
- Supports keyword search, category filters, date ranges
- Uses proper URL encoding and parameter formatting
Data Fetching
- Fetches data from arXiv API (Atom feed format)
- Parses XML/Atom using feedparser library
- Handles pagination and rate limiting
Metadata Extraction
- Extracts paper title, abstract, authors
- Captures categories, dates, arXiv ID
- Generates PDF and abstract page URLs
Citation Analysis
- Scrapes individual paper pages for citation metadata
- Extracts DOI and journal references when available
- Builds citation network data structure
Network Building
- Analyzes author collaboration patterns
- Identifies co-authorship relationships
- Counts collaboration frequency
Trend Analysis
- Aggregates papers by category
- Tracks publication trends over time
- Identifies most prolific authors
Data Output
- Pushes individual paper records to Apify dataset
- Adds network analysis summary
- Includes trend analysis report

Technical Details

Dependencies:

apify (>=2.1.0) - Apify SDK for Python
beautifulsoup4 (4.12.3) - HTML parsing for citation extraction
requests (2.31.0) - HTTP requests
lxml (>=5.3.0) - Fast XML/HTML parser
feedparser (>=6.0.11) - Atom/RSS feed parsing

API Information:

Source: arXiv.org official API
Documentation: https://arxiv.org/help/api
Rate Limits: 3 seconds between requests (handled automatically)
Max Results: 30,000 per query (practical limit ~1000 for performance)

Performance:

~50 papers: 10-20 seconds
~100 papers: 20-40 seconds
~500 papers: 2-3 minutes
Citation extraction adds ~0.5s per paper

Error Handling:

Network errors are caught and logged
Failed paper scrapes don't stop the actor
Graceful degradation for missing metadata
Detailed logging for debugging

Limitations & Notes

arXiv abstract pages don't include full reference lists (would require PDF parsing)
Citation extraction is limited to metadata available on abstract pages
Date filtering is applied post-fetch (API limitations)
Large result sets (>1000 papers) may take several minutes
Some papers may have incomplete metadata
PDF links are direct URLs, not downloaded files

API Integration (For Automation)

Once the actor is in your Apify account, you can start runs and read datasets via the Apify API. This makes it easy to plug the actor into your pipelines, cron jobs, and backend services.

Troubleshooting

No papers found:

Verify your search query and category are correct
Try broadening your search (remove filters)
Check if arXiv API is accessible: https://arxiv.org/help/api

Slow performance:

Reduce maxPapers parameter
Disable extractCitations for faster runs
Consider running during off-peak hours

Missing metadata:

Some papers have incomplete information on arXiv
This is normal and handled gracefully
Check the logs for specific issues

Local execution issues:

Ensure all dependencies are installed: pip install -r requirements.txt
Check Python version: python --version (needs 3.8+)
Verify internet connectivity to arxiv.org

Resources

arXiv.org - The source platform
arXiv API Documentation - Official API docs
Apify Platform - Run and scale actors
Apify Python SDK - SDK documentation
arXiv Category Taxonomy - All categories

FAQ

Q: Is this allowed under arXiv’s terms?
Yes. The actor uses the official arXiv API and respects its documented usage guidelines.

Q: How many papers can I fetch?
Practically up to around 1,000 per run for good performance. You can run the actor multiple times with different filters.

Q: Can I run this on a schedule?
Yes. On Apify you can set a schedule (e.g., daily) to keep your dataset up to date.

Q: Does it download PDFs?
It returns direct pdf_url links. You can then download PDFs in your own pipeline if needed.

Q: Who is this for?
Research teams, data scientists, AI/LLM engineers, and product teams who need reliable, structured arXiv data without maintaining their own scraper.

Ready to turn arXiv into actionable research data? Run the actor on Apify and start exploring.

arXiv Paper Scraper

cloud9_ai/arxiv-paper-scraper

Scrape academic papers from arXiv.org. Search by keyword, browse categories, or get latest papers. Extract titles, abstracts, authors, PDF links, and citation data via arXiv API.

cloud9

Arxiv Paper Scraper

technicaldost/arxiv-paper-scraper

Technical Dost Solutions

ArXiv Academic Paper Scraper

fortuitous_pirate/arxiv-scraper

Scrape academic papers from ArXiv. Extract titles, authors, abstracts, categories, and PDF links. Essential for research and literature reviews.

Fortuitous Pirate

Academic Paper Scraper

labrat011/academic-paper-scraper

Search MILLIONS of academic papers from Semantic Scholar and arXiv by keyword, DOI, or citation graph. Returns titles, authors, abstracts, citation counts, and open access PDFs as clean JSON. Works as an MCP tool for AI agents.

Mick

Arxiv Paper Intelligence

viralanalyzer/arxiv-paper-intelligence

Search and extract ArXiv papers, abstracts, authors, and citations. Track research trends across any scientific field. AI-powered analysis.

viralanalyzer

5.0

ArXiv Paper Scraper

nexgendata/arxiv-scraper

Extract research papers, abstracts, authors, and citations from arXiv.org. Perfect for academic research monitoring, literature reviews, and scientific trend analysis.

Stephan Corbeil

arXiv Search Scraper 📚

easyapi/arxiv-search-scraper

Extract comprehensive research paper data from arXiv search results. Get detailed metadata including titles, authors, abstracts, categories and more. Perfect for academic research monitoring, trend analysis and building paper databases. 🎓📚

EasyApi

ArXiv Research Paper Scraper

datapilot/arxiv-research-paper-scraper

arXiv Research Paper Scraper retrieves academic paper metadata from the arXiv API based on a keyword. It extracts titles, abstracts, authors with affiliations, DOI, categories, submission dates, and PDF links. Supports proxy usage and outputs structured JSON results for research and data analysis.

Data Pilot

arXiv Scraper

artificially/arxiv-scraper

Search and extract academic papers from arXiv.org. Get paper titles, authors, abstracts, categories, and PDF links for AI/ML, physics, math, and more.