Crossref Scraper
Pricing
Pay per event
Crossref Scraper
Search and extract academic papers, journals, and citations from Crossref's 130M+ scholarly works database.
Pricing
Pay per event
Rating
0.0
(0)
Developer

Stas Persiianenko
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 hours ago
Last modified
Categories
Share
Search and extract academic papers, journal articles, and scholarly works from Crossref's database of over 130 million records. Get titles, authors, DOIs, citation counts, abstracts, and publication metadata.
What does Crossref Scraper do?
Crossref Scraper searches the Crossref REST API — the world's largest database of scholarly metadata — and extracts structured data from matching works. It returns paper titles, author lists, DOIs, citation counts, abstracts, journal names, publication dates, and licensing information.
You can search by keyword, filter by work type (journal article, book chapter, conference paper, etc.), and sort by relevance, publication date, or citation count.
Why scrape Crossref?
Crossref indexes over 130 million scholarly works from 18,000+ publishers. It's the authoritative source for:
- Literature reviews — find relevant papers by keyword and sort by citation impact
- Bibliometric analysis — study citation patterns, publication trends, and research output
- Dataset construction — build training data for academic AI models or recommendation systems
- Research monitoring — track new publications in specific fields or by specific authors
- Citation analysis — identify the most influential papers in any research area
- Publisher intelligence — analyze publication volumes and patterns across journals
How much does it cost?
Crossref Scraper uses pay-per-event pricing:
| Event | Price |
|---|---|
| Run started | $0.001 |
| Paper extracted | $0.001 per paper |
Example costs:
- 20 most-cited papers on "deep learning": ~$0.021
- 100 recent papers on "CRISPR": ~$0.101
- 500 papers across 5 research topics: ~$0.506
Platform costs are minimal — a typical run uses under $0.002 in compute.
Input parameters
| Parameter | Type | Description | Default |
|---|---|---|---|
searchQueries | string[] | Keywords to search for in academic papers | Required |
type | string | Filter by work type: journal-article, book-chapter, proceedings-article, book, dataset, report, dissertation, preprint | All types |
sortBy | string | Sort results: relevance, published (newest first), is-referenced-by-count (most cited) | relevance |
maxResults | integer | Maximum papers per keyword (1–1000) | 50 |
Input example
{"searchQueries": ["machine learning", "deep learning"],"sortBy": "is-referenced-by-count","maxResults": 20}
Output example
Each paper is returned as a JSON object:
{"doi": "10.1038/nature14539","title": "Deep learning","authors": ["Yann LeCun", "Yoshua Bengio", "Geoffrey Hinton"],"abstract": "Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction...","type": "journal-article","publisher": "Springer Science and Business Media LLC","journal": "Nature","publishedDate": "2015-05-27","citationCount": 69717,"referenceCount": 73,"url": "https://doi.org/10.1038/nature14539","subjects": ["Multidisciplinary"],"license": "https://www.springer.com/tdm","language": "en","page": "436-444","volume": "521","issue": "7553","isbn": [],"issn": ["0028-0836", "1476-4687"],"scrapedAt": "2026-03-03T04:20:00.000Z"}
Output fields
| Field | Type | Description |
|---|---|---|
doi | string | Digital Object Identifier — unique paper identifier |
title | string | Paper title |
authors | string[] | List of author names |
abstract | string | Paper abstract (when available) |
type | string | Work type (journal-article, book-chapter, etc.) |
publisher | string | Publisher name |
journal | string | Journal or container title |
publishedDate | string | Publication date (YYYY-MM-DD format) |
citationCount | number | Number of times this work has been cited |
referenceCount | number | Number of references in this work |
url | string | DOI URL linking to the paper |
subjects | string[] | Subject categories |
license | string | License URL |
language | string | Language code |
page | string | Page range |
volume | string | Journal volume |
issue | string | Journal issue |
isbn | string[] | ISBN identifiers (for books) |
issn | string[] | ISSN identifiers (for journals) |
scrapedAt | string | ISO timestamp when data was extracted |
How to use the Crossref Scraper API
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("automation-lab/crossref-scraper").call(run_input={"searchQueries": ["transformer neural network"],"sortBy": "is-referenced-by-count","maxResults": 50,})for item in client.dataset(run["defaultDatasetId"]).iterate_items():authors = ", ".join(item["authors"][:3])print(f"{item['title']}")print(f" {authors} | {item['publishedDate']} | {item['citationCount']} citations")print(f" DOI: {item['doi']}")
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('automation-lab/crossref-scraper').call({searchQueries: ['transformer neural network'],sortBy: 'is-referenced-by-count',maxResults: 50,});const { items } = await client.dataset(run.defaultDatasetId).listItems();items.forEach(item => {console.log(`${item.title} (${item.citationCount} citations)`);console.log(` DOI: ${item.doi}`);});
REST API
curl -X POST "https://api.apify.com/v2/acts/automation-lab/crossref-scraper/runs?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{"searchQueries": ["machine learning"],"sortBy": "is-referenced-by-count","maxResults": 20}'
Integrations
Connect Crossref Scraper to hundreds of apps using built-in integrations:
- Google Sheets — export citation data to spreadsheets for analysis
- Slack / Microsoft Teams — get notifications when scraping completes
- Zapier / Make — trigger workflows with new paper data
- Amazon S3 / Google Cloud Storage — store large research datasets
- Webhook — send results to your own API endpoint
Tips and best practices
- Sort by citations — use
is-referenced-by-countto find the most influential papers in any field. - Filter by type — narrow results to journal articles, conference papers, or books to focus your search.
- Combine keywords — use multiple search terms like
["CRISPR", "gene therapy", "genome editing"]to cover a topic broadly. - Abstract availability — not all papers have abstracts in Crossref. About 30-40% include them.
- Citation counts — these reflect citations tracked by Crossref, which may differ from Google Scholar or Scopus counts.
- Rate limits — the scraper uses Crossref's polite pool (with mailto) for faster responses. No API key needed.
- Up to 1000 per keyword — for larger datasets, use multiple related keywords.
FAQ
Q: How current is the data? A: Crossref data is updated continuously as publishers register new DOIs. Most papers appear within days of publication.
Q: Does it include full paper text? A: No. Crossref stores metadata only — titles, authors, abstracts, DOIs, and citations. For full text, follow the DOI link to the publisher's site.
Q: Do I need an API key? A: No. Crossref's API is completely open. The scraper uses the polite pool for better performance.
Q: What types of works are covered? A: Journal articles, book chapters, conference papers, books, datasets, reports, dissertations, preprints, and more — anything with a DOI registered through Crossref.