# Academic Papers Scraper - Citations & Metadata (`benthepythondev/crossref-papers-scraper`) Actor

Search 150M+ scholarly works by keyword (or look up by DOI) for structured metadata: title, authors with ORCID, journal, publication date, type, publisher, citation count, subjects, ISSN, volume/issue/pages and URL. Fast and reliable via the public Crossref API.

- **URL**: https://apify.com/benthepythondev/crossref-papers-scraper.md
- **Developed by:** [ben](https://apify.com/benthepythondev) (community)
- **Categories:** Developer tools, AI, Other
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

from $1.50 / 1,000 results

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## 🎓 Academic Papers Scraper — Crossref Citations & Metadata

Search **150M+ scholarly works** by keyword — or look up exact papers by DOI — and get clean, structured metadata for every result: title, authors, journal, publication date, work type, publisher, citation count, subjects, ISSN, volume, issue, pages, language, DOI and resolver URL. It turns a research question into a ready-to-analyze dataset in seconds, with no manual copy-pasting from publisher sites.

The actor is powered by the public **Crossref REST API**, the same registration agency that mints DOIs for most of the world's journals, so it is fast, reliable and needs no browser, no login and no API key. Export to JSON/CSV/Excel, run on a schedule, call via API, or connect to Make, Zapier or n8n.

### 🔎 What is the Academic Papers Scraper?

Crossref is the central index behind academic publishing — nearly every journal article, conference paper, book chapter, dataset and preprint with a DOI is registered there. This actor lets you query that index programmatically. Give it one or more topics (e.g. `machine learning`, `crispr gene editing`) and it returns the top matching works as structured rows, ranked by Crossref relevance. Prefer exact papers? Pass a list of DOIs and it looks each one up directly. You can also restrict results to recent years to focus on the current state of a field.

Because it runs against a clean JSON API rather than scraping HTML, results are consistent, well-typed and never blocked — ideal for reproducible literature reviews and automated research pipelines.

#### What data does it extract?

- **DOI** — the persistent identifier for the work
- **Title** of the paper
- **Authors** — full list of author names, plus an **author_count**
- **Journal** / container title
- **Published date** and **year**
- **Type** — journal-article, proceedings-article, book-chapter, dataset, posted-content, etc.
- **Publisher** — e.g. Springer, Elsevier, IEEE, Wiley
- **Citations** — Crossref `is-referenced-by` count (how often the work is cited)
- **Reference count** — number of works it cites
- **Subjects** — subject/field classifications
- **ISSN** — journal serial number(s)
- **Volume**, **issue** and **pages**
- **Language** of the work
- **URL** — the DOI resolver link (`https://doi.org/…`)
- **Query** — the search term that surfaced the row

### ⬇️ Input

Run it two ways — search by keyword, or look up exact papers by DOI. You can combine both in one run.

| Field | Type | Description |
|-------|------|-------------|
| `searchTerms` | array | Keywords/topics to search, e.g. `machine learning`. One or many. |
| `dois` | array | Optional: exact works to pull by DOI, e.g. `10.1038/nphys1170`. |
| `fromYear` | string | Optional: only return works published from this year onwards, e.g. `2020`. |
| `maxPerTerm` | integer | Max papers to return per search term. Default `20`, up to `100`. |

#### Example input

```json
{
  "searchTerms": ["machine learning", "protein folding"],
  "fromYear": "2020",
  "maxPerTerm": 50
}
````

### ⬆️ Output

Each work is one clean row (view as a **table**, or export **JSON / CSV / Excel**):

```json
{
  "doi": "10.1038/s41586-020-2649-2",
  "title": "Array programming with NumPy",
  "authors": ["Charles R. Harris", "K. Jarrod Millman", "Stéfan J. van der Walt"],
  "author_count": 26,
  "journal": "Nature",
  "published_date": "2020-09-16",
  "year": 2020,
  "type": "journal-article",
  "publisher": "Springer Science and Business Media LLC",
  "citations": 9712,
  "reference_count": 47,
  "subjects": ["Multidisciplinary"],
  "issn": ["0028-0836", "1476-4687"],
  "volume": "585",
  "issue": "7825",
  "pages": "357-362",
  "language": "en",
  "url": "https://doi.org/10.1038/s41586-020-2649-2",
  "query": "machine learning"
}
```

### 💡 Use cases

- 📑 **Literature reviews** — gather every relevant paper on a topic with citations, authors and journals in one pass, then filter and sort in a spreadsheet.
- 📊 **Research dashboards & bibliometrics** — track publication output by journal, year, subject or publisher and visualize trends over time.
- 🔗 **Citation analysis** — rank works by citation count to find the seminal papers in a field and spot rising research.
- 🤖 **RAG / LLM & app pipelines** — feed structured paper metadata into retrieval systems, reference managers or your own research tools.

### ❓ FAQ

**How do I scrape academic papers?** Enter one or more `searchTerms` (or `dois`), set `maxPerTerm`, and Run. You get structured rows with title, authors, journal, date, citations, subjects and DOI.

**Do I need an API key or login?** No. It uses the public Crossref API — just provide keywords or DOIs.

**How many works are covered?** 150M+ records across virtually every scholarly publisher that registers DOIs with Crossref.

**Can I look up a specific paper?** Yes — pass one or more DOIs in `dois` and the actor pulls each exact record.

**Can I restrict results to recent years?** Yes — set `fromYear` (e.g. `2020`) to only return works published from that year onwards.

**Does it include citation counts?** Yes — the Crossref `is-referenced-by` count is returned as `citations`, along with the outgoing `reference_count`.

**Which publishers and document types are included?** All of them — journal articles, conference proceedings, book chapters, datasets, preprints and more, from Springer, Elsevier, IEEE, Wiley, Nature, PLOS and thousands of others.

**Can I run it on a schedule or via API?** Yes — schedule recurring runs on Apify, call it via the API/SDK, or connect it to Make, Zapier or n8n.

**How does pricing work?** Pay per paper returned — no subscription, no fixed monthly fee.

**Is it legal?** It uses the public, openly documented Crossref REST API, which is designed for exactly this kind of metadata retrieval. Use the data responsibly and in line with Crossref's terms.

### 🔗 You might also like

- **[arXiv Scraper](https://apify.com/benthepythondev/arxiv-scraper)** — preprints in physics, CS, math & more.
- **[PubMed Scraper](https://apify.com/benthepythondev/pubmed-scraper)** — biomedical & life-sciences literature.
- **[OpenAlex Scraper](https://apify.com/benthepythondev/openalex-scraper)** — open scholarly graph: works, authors & institutions.
- **[Wikipedia Scraper](https://apify.com/benthepythondev/wikipedia-scraper)** — clean article summaries & metadata.

***

**Keywords:** academic papers scraper, crossref api, crossref scraper, scholarly metadata, citation data, doi lookup, research papers, journal articles, bibliographic data, literature review, citation analysis, bibliometrics, science data, paper metadata, academic search, publication data, scholarly works, research dataset, doi metadata, journal scraper

# Actor input Schema

## `searchTerms` (type: `array`):

Keywords/topics to search scholarly works for, e.g. 'machine learning'.

## `dois` (type: `array`):

Optional: look up specific works by DOI (e.g. '10.1038/nphys1170').

## `fromYear` (type: `string`):

Optional: only return works published from this year onwards, e.g. '2020'.

## `maxPerTerm` (type: `integer`):

Maximum papers to return per search term.

## Actor input object example

```json
{
  "searchTerms": [
    "machine learning"
  ],
  "maxPerTerm": 20
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "searchTerms": [
        "machine learning"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("benthepythondev/crossref-papers-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "searchTerms": ["machine learning"] }

# Run the Actor and wait for it to finish
run = client.actor("benthepythondev/crossref-papers-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "searchTerms": [
    "machine learning"
  ]
}' |
apify call benthepythondev/crossref-papers-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=benthepythondev/crossref-papers-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Academic Papers Scraper - Citations & Metadata",
        "description": "Search 150M+ scholarly works by keyword (or look up by DOI) for structured metadata: title, authors with ORCID, journal, publication date, type, publisher, citation count, subjects, ISSN, volume/issue/pages and URL. Fast and reliable via the public Crossref API.",
        "version": "1.0",
        "x-build-id": "miAUPxDI3mLqa9GE3"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/benthepythondev~crossref-papers-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-benthepythondev-crossref-papers-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/benthepythondev~crossref-papers-scraper/runs": {
            "post": {
                "operationId": "runs-sync-benthepythondev-crossref-papers-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/benthepythondev~crossref-papers-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-benthepythondev-crossref-papers-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "searchTerms": {
                        "title": "Search terms",
                        "type": "array",
                        "description": "Keywords/topics to search scholarly works for, e.g. 'machine learning'.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "dois": {
                        "title": "DOIs (lookup)",
                        "type": "array",
                        "description": "Optional: look up specific works by DOI (e.g. '10.1038/nphys1170').",
                        "items": {
                            "type": "string"
                        }
                    },
                    "fromYear": {
                        "title": "From year",
                        "type": "string",
                        "description": "Optional: only return works published from this year onwards, e.g. '2020'."
                    },
                    "maxPerTerm": {
                        "title": "Max papers per term",
                        "minimum": 1,
                        "maximum": 100,
                        "type": "integer",
                        "description": "Maximum papers to return per search term.",
                        "default": 20
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
