Arxiv Citation Network Scraper avatar
Arxiv Citation Network Scraper

Pricing

Pay per event

Go to Apify Store
Arxiv Citation Network Scraper

Arxiv Citation Network Scraper

A professional Apify Actor that scrapes academic papers from arXiv and builds citation networks. Extract paper metadata, analyze author collaborations, track research trends, and discover emerging topics in science and technology.

Pricing

Pay per event

Rating

0.0

(0)

Developer

CodePoetry

CodePoetry

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

arXiv Citation Network Scraper – Apify Actor

A production-grade Apify Actor that turns arXiv into a structured, analysis‑ready dataset: papers, authors, collaboration networks, and research trends. It is designed to be reliable enough for paying customers building research tools, AI pipelines, and analytics products.

🧩 What You Get When You Pay

  • High‑quality academic data: Clean, structured paper metadata with authors, categories, dates, and direct PDF links.
  • Network & trend insights: Built‑in author collaboration networks and topic trends so you don’t have to code analytics yourself.
  • Stable, monitored actor: Production‑oriented implementation with error handling and tests (test_actor.py) to keep runs predictable.
  • Time savings: No need to learn the arXiv API or parse Atom feeds / HTML yourself.
  • Flexible exports: Use the Apify dataset UI to export to JSON, CSV, Excel, or integrate via API.

What Is an Apify Actor?

An Apify Actor is a serverless micro‑app that runs in the Apify cloud. You don’t manage servers or scaling – you just configure input, run the actor, and consume the dataset/API.

What This Actor Does

This actor provides a complete academic research data pipeline:

  1. Discovers papers – Searches arXiv using their official API with flexible filters.
  2. Extracts metadata – Titles, abstracts, authors, categories, publication dates, PDF links.
  3. Builds networks – Co‑authorship and author collaboration structures.
  4. Analyzes trends – Top categories, prolific authors, and monthly publication volumes.
  5. Delivers insights – All data is pushed into an Apify dataset with a friendly output schema.

Key Features

  • ✅ Search by keywords, categories, or date ranges
  • ✅ Structured paper metadata (title, abstract, authors, categories, dates, links)
  • ✅ Author collaboration network analysis
  • ✅ Research trend detection (top categories, authors, monthly volumes)
  • ✅ Direct PDF download links (optional)
  • ✅ No authentication required, uses the official arXiv API
  • ✅ Output schema optimized for Apify UI (nice tables & views)

Typical Use Cases

For Researchers & Academics

  • Literature Review - Quickly gather papers on specific topics
  • Author Discovery - Find key researchers and collaboration networks
  • Trend Analysis - Identify emerging research areas
  • Citation Tracking - Build citation networks for meta-analysis

For AI & Tech Companies

  • Training Data - Collect academic papers for AI model training
  • Research Intelligence - Track competitors and emerging technologies
  • Talent Discovery - Identify leading researchers for recruitment
  • Dataset Creation - Build curated research datasets

For Developers & Analysts

  • Academic Databases - Power search engines and research platforms
  • Visualization Tools - Feed network graphs and trend dashboards
  • API Integration - Automated research monitoring systems
  • Data Analysis - Export to CSV/Excel for custom analysis

How to Run This Actor on Apify

  1. Open the actor on Apify.
  2. In the Input tab, fill in the parameters (or use a template below).
  3. Click Start.
  4. When the run finishes, open the Dataset tab to explore the results in a friendly table view or export them.

Minimal Input Example

{
"searchQuery": "machine learning",
"category": "cs.AI",
"maxPapers": 100,
"extractCitations": true,
"includePdfLink": true,
"dateFrom": "2024-01-01",
"dateTo": "2024-12-31"
}

Input Parameters

ParameterTypeRequiredDescriptionExample
searchQueryStringNoKeywords to search for"neural networks"
categoryStringNoarXiv category filter"cs.AI", "cs.LG", "physics.quant-ph"
maxPapersIntegerNoMax papers to scrape (1-1000)100
extractCitationsBooleanNoExtract citation metadatatrue
includePdfLinkBooleanNoInclude PDF download URLstrue
dateFromStringNoFilter papers after date (YYYY-MM-DD)"2024-01-01"
dateToStringNoFilter papers before date (YYYY-MM-DD)"2024-12-31"
  • cs.AI - Artificial Intelligence
  • cs.LG - Machine Learning
  • cs.CV - Computer Vision
  • cs.CL - Computation and Language (NLP)
  • cs.RO - Robotics
  • physics.quant-ph - Quantum Physics
  • math.CO - Combinatorics
  • stat.ML - Machine Learning (Statistics)

Full list of categories

Ready‑Made Input Examples

Find recent AI papers:

{
"category": "cs.AI",
"maxPapers": 50,
"dateFrom": "2024-01-01"
}

Search for quantum computing papers:

{
"searchQuery": "quantum computing",
"maxPapers": 30,
"extractCitations": true
}

Track specific author's work:

{
"searchQuery": "Yoshua Bengio",
"maxPapers": 20
}

Build machine learning dataset:

{
"category": "cs.LG",
"maxPapers": 500,
"dateFrom": "2023-01-01",
"includePdfLink": true
}

Output Format (What You See in the Dataset)

The actor uses a dedicated output_schema.json so that the Apify UI shows clean, labeled columns and views.

Individual Paper Records

Each paper is returned as a structured JSON object:

{
"arxiv_id": "2401.12345",
"title": "Advances in Neural Network Architectures",
"summary": "This paper presents novel approaches to neural network design...",
"authors": [
"Jane Doe",
"John Smith",
"Alice Johnson"
],
"primary_category": "cs.LG",
"categories": ["cs.LG", "cs.AI", "stat.ML"],
"published": "2024-01-15",
"updated": "2024-01-20",
"url": "https://arxiv.org/abs/2401.12345",
"pdf_url": "https://arxiv.org/pdf/2401.12345.pdf",
"comment": "10 pages, 5 figures, accepted to NeurIPS 2024",
"citation_data": {
"arxiv_id": "2401.12345",
"references_extracted": true,
"doi": "10.1234/example",
"journal_reference": "NeurIPS 2024"
}
}

Author Network Analysis

{
"type": "author_network",
"data": {
"author_papers": {
"Jane Doe": ["2401.12345", "2312.54321"],
"John Smith": ["2401.12345"]
},
"collaborations": [
{
"authors": ["Jane Doe", "John Smith"],
"count": 3
}
],
"total_authors": 156,
"total_collaborations": 423
},
"generated_at": "2024-11-18T10:30:00.123456"
}

Trend Analysis

{
"type": "trend_analysis",
"data": {
"top_categories": [
{"category": "cs.LG", "count": 45},
{"category": "cs.AI", "count": 38}
],
"top_authors": [
{"author": "Jane Doe", "papers": 5},
{"author": "John Smith", "papers": 3}
],
"papers_per_month": {
"2024-01": 12,
"2024-02": 15,
"2024-03": 18
},
"total_papers": 100,
"total_categories": 8,
"unique_authors": 245
},
"generated_at": "2024-11-18T10:30:00.123456"
}

Running Locally (Optional for Developers)

You don’t need this for normal paid use on Apify, but if you want to test or extend the actor locally:

pip install -r requirements.txt
python test_actor.py # run the test suite
python test_local.py # simple manual test

Using the Data in Your Product

  • Build internal research dashboards: plug the dataset into BI tools (Tableau, Power BI, Metabase).
  • Feed AI & LLM pipelines: use abstracts and metadata as high‑quality training or retrieval corpora.
  • Power academic search or recommendation features: index papers by topic, author, and time.
  • Track research signals: monitor new papers in specific categories over time.

Project Structure (For Technical Users)

arxiv-citation-network-scraper/
├── .actor/
│ ├── actor.json # Actor metadata and configuration
│ └── input_schema.json # Input form schema for Apify UI
├── src/
│ └── main.py # Main actor code with scraping logic
├── Dockerfile # Container configuration
├── requirements.txt # Python dependencies
├── .gitignore # Git ignore rules
└── README.md # This file

How It Works (Under the Hood)

Technical Flow

  1. API Query Construction

    • Builds arXiv API query from user parameters
    • Supports keyword search, category filters, date ranges
    • Uses proper URL encoding and parameter formatting
  2. Data Fetching

    • Fetches data from arXiv API (Atom feed format)
    • Parses XML/Atom using feedparser library
    • Handles pagination and rate limiting
  3. Metadata Extraction

    • Extracts paper title, abstract, authors
    • Captures categories, dates, arXiv ID
    • Generates PDF and abstract page URLs
  4. Citation Analysis

    • Scrapes individual paper pages for citation metadata
    • Extracts DOI and journal references when available
    • Builds citation network data structure
  5. Network Building

    • Analyzes author collaboration patterns
    • Identifies co-authorship relationships
    • Counts collaboration frequency
  6. Trend Analysis

    • Aggregates papers by category
    • Tracks publication trends over time
    • Identifies most prolific authors
  7. Data Output

    • Pushes individual paper records to Apify dataset
    • Adds network analysis summary
    • Includes trend analysis report

Technical Details

Dependencies:

  • apify (>=2.1.0) - Apify SDK for Python
  • beautifulsoup4 (4.12.3) - HTML parsing for citation extraction
  • requests (2.31.0) - HTTP requests
  • lxml (>=5.3.0) - Fast XML/HTML parser
  • feedparser (>=6.0.11) - Atom/RSS feed parsing

API Information:

  • Source: arXiv.org official API
  • Documentation: https://arxiv.org/help/api
  • Rate Limits: 3 seconds between requests (handled automatically)
  • Max Results: 30,000 per query (practical limit ~1000 for performance)

Performance:

  • ~50 papers: 10-20 seconds
  • ~100 papers: 20-40 seconds
  • ~500 papers: 2-3 minutes
  • Citation extraction adds ~0.5s per paper

Error Handling:

  • Network errors are caught and logged
  • Failed paper scrapes don't stop the actor
  • Graceful degradation for missing metadata
  • Detailed logging for debugging

Limitations & Notes

  • arXiv abstract pages don't include full reference lists (would require PDF parsing)
  • Citation extraction is limited to metadata available on abstract pages
  • Date filtering is applied post-fetch (API limitations)
  • Large result sets (>1000 papers) may take several minutes
  • Some papers may have incomplete metadata
  • PDF links are direct URLs, not downloaded files

API Integration (For Automation)

Once the actor is in your Apify account, you can start runs and read datasets via the Apify API. This makes it easy to plug the actor into your pipelines, cron jobs, and backend services.

Troubleshooting

No papers found:

  • Verify your search query and category are correct
  • Try broadening your search (remove filters)
  • Check if arXiv API is accessible: https://arxiv.org/help/api

Slow performance:

  • Reduce maxPapers parameter
  • Disable extractCitations for faster runs
  • Consider running during off-peak hours

Missing metadata:

  • Some papers have incomplete information on arXiv
  • This is normal and handled gracefully
  • Check the logs for specific issues

Local execution issues:

  • Ensure all dependencies are installed: pip install -r requirements.txt
  • Check Python version: python --version (needs 3.8+)
  • Verify internet connectivity to arxiv.org

Resources

FAQ

Q: Is this allowed under arXiv’s terms?
Yes. The actor uses the official arXiv API and respects its documented usage guidelines.

Q: How many papers can I fetch?
Practically up to around 1,000 per run for good performance. You can run the actor multiple times with different filters.

Q: Can I run this on a schedule?
Yes. On Apify you can set a schedule (e.g., daily) to keep your dataset up to date.

Q: Does it download PDFs?
It returns direct pdf_url links. You can then download PDFs in your own pipeline if needed.

Q: Who is this for?
Research teams, data scientists, AI/LLM engineers, and product teams who need reliable, structured arXiv data without maintaining their own scraper.


Ready to turn arXiv into actionable research data? Run the actor on Apify and start exploring.