Crossref Scholarly Works Scraper avatar

Crossref Scholarly Works Scraper

Pricing

from $3.00 / 1,000 results

Go to Apify Store
Crossref Scholarly Works Scraper

Crossref Scholarly Works Scraper

Extract scholarly works metadata from Crossref — DOIs, titles, authors, journals, publication dates, and citation counts. Filter by query, date range, and work type. No API key required.

Pricing

from $3.00 / 1,000 results

Rating

0.0

(0)

Developer

Compute Edge

Compute Edge

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Categories

Share

Extract scholarly works metadata from Crossref — DOIs, titles, authors, journals, publication dates, and citation counts. Query 135+ million scholarly articles, books, proceedings, and datasets via the Crossref REST API. Perfect for academic research, bibliometric analysis, literature reviews, and citation network studies.

What This Actor Does

This Actor provides a complete interface to the Crossref REST API, the world's largest scholarly work database. It supports four flexible search and filtering options:

  1. Free-Text Search — Search by keyword across titles, abstracts, and metadata (e.g., "machine learning", "COVID-19", "renewable energy")
  2. Publication Date Filtering — Restrict results to works published within a date range
  3. Work Type Filtering — Target specific work types (e.g., journal articles, books, proceedings, datasets)
  4. Pagination & Bulk Extraction — Automatically fetch up to 5,000 records per run using cursor-based pagination

Key Features

  • 135+ million works — Access the complete Crossref dataset
  • Rich metadata — DOI, title, authors, journal/container, publication date, citation counts, references
  • Flexible filtering — Combine free-text search with date range and work type filters
  • High-speed pagination — Cursor-based API ensures fast, stable bulk extracts
  • No authentication required — Public API, free to use
  • Error handling — Graceful fallback for missing or incomplete metadata
  • Batch processing — Efficient extraction for large datasets
Use CaseQuery ExampleWork TypeOutput
Literature Review"climate change mitigation"journal-articleTop 500 recent articles on climate solutions
Citation Network Analysis"neural networks"journal-article + proceedingsPapers by citation count for network mapping
Trend Tracking"AI safety"all typesNew works published in last 30 days
Researcher DatabaseNone (recent works)all typesLatest 1,000 scholarly works across all fields
Book Discovery"sustainable development"bookRecent books on sustainability
Conference Proceedings"machine learning"proceedingsPeer-reviewed conference papers

Getting Started

Step 1: Run the Actor

  1. Choose your input parameters (see below)
  2. Click Start
  3. Results appear in the Dataset tab
  4. Export as JSON or CSV via Apify UI

Step 2: Simple Example — Search Recent Works

To fetch 50 recent works (no search query):

  • Query: (leave blank)
  • Filter From Date: (leave blank)
  • Work Type: (leave blank)
  • Max Results: 50

Results include title, authors, journal, publication date, and DOI for each work.

How to scrape Crossref scholarly works

Tutorial 1: Search for Papers on Machine Learning

Goal: Find the top 100 recent journal articles on machine learning.

Input configuration:

  • Query: machine learning
  • Work Type: journal-article
  • Filter From Date: (leave blank for all time)
  • Max Results: 100

Expected output:

[
{
"doi": "10.1038/nature12373",
"title": "Deep Neural Networks Capture Context-Dependent Neural Activity in the Primate Visual System",
"type": "journal-article",
"publisher": "Nature Publishing Group",
"journal": "Nature",
"publishedDate": "2024-03-15",
"authorsCount": 5,
"firstAuthor": "Antolik Mark",
"citationCount": 1240,
"referenceCount": 45,
"issn": "0028-0836",
"url": "https://doi.org/10.1038/nature12373"
},
...
]

Use case: Build a curated bibliography of the most-cited machine learning papers for a literature review or research project.


Tutorial 2: Track Recent Works in a Specific Domain

Goal: Monitor all scholarly works on renewable energy published in the last 90 days.

Input configuration:

  • Query: renewable energy
  • Filter From Date: 2026-03-21 (90 days before today)
  • Work Type: (leave blank for all types)
  • Max Results: 500

Expected output:

[
{
"doi": "10.1016/j.renene.2026.03.001",
"title": "Advances in Perovskite Solar Cell Efficiency and Stability",
"type": "journal-article",
"publisher": "Elsevier",
"journal": "Renewable Energy",
"publishedDate": "2026-03-20",
"authorsCount": 8,
"firstAuthor": "Liu Chen",
"citationCount": 0,
"referenceCount": 67,
"issn": "0960-1481",
"url": "https://doi.org/10.1016/j.renene.2026.03.001"
},
...
]

Use case: Stay current with emerging research in your domain. Track high-impact journals and new author collaborations. Feed into a data pipeline for weekly research digest emails.


Tutorial 3: Citation Network Analysis

Goal: Extract 200 highly-cited papers on artificial intelligence to map research influence.

Input configuration:

  • Query: artificial intelligence
  • Work Type: journal-article
  • Filter From Date: (leave blank)
  • Max Results: 200

Expected output (sorted by citation count):

[
{
"doi": "10.1145/3495243.3560528",
"title": "Attention Is All You Need",
"type": "journal-article",
"publisher": "ACM",
"journal": "Transactions on Machine Learning Research",
"publishedDate": "2017-12-06",
"authorsCount": 8,
"firstAuthor": "Vaswani Ashish",
"citationCount": 88450,
"referenceCount": 72,
"issn": "",
"url": "https://doi.org/10.1145/3495243.3560528"
},
...
]

Use case: Build a citation network graph showing how papers reference each other. Identify foundational works and research clusters. Track influence trajectories of key researchers.


Input Parameters

All Modes

ParameterTypeDefaultRequiredDescription
querystring(blank)NoFree-text search query (e.g., "machine learning", "COVID-19"). Leave blank to fetch recent works. Case-insensitive.
filterFromDatestring (YYYY-MM-DD)(blank)NoOnly include works published on or after this date (e.g., "2024-01-01"). Leave blank for all dates.
workTypestring(blank)NoFilter by work type. Common values: journal-article, book, proceedings, report, dataset. Leave blank for all types.
maxResultsinteger50NoMaximum works to fetch (1–5,000). Default is 50.

Common Work Types

  • journal-article — Peer-reviewed journal articles
  • proceedings-article or proceedings — Conference proceedings
  • book — Complete books
  • book-chapter — Chapters within books
  • report — Technical reports, white papers
  • dataset — Data publications
  • dissertation — Theses and dissertations
  • component — Article components (figures, tables, appendices)

Full list: Visit https://github.com/CrossRef/rest-api-doc#work-types


Output Schema

Each record contains:

FieldTypeExampleDescription
doistring10.1038/nature12373Digital Object Identifier — unique identifier for the work
titlestringDeep Neural Networks Capture...Title of the work
typestringjournal-articleWork type (journal-article, book, proceedings, etc.)
publisherstringNature Publishing GroupPublisher name
journalstringNatureJournal or container name (empty for books)
publishedDatestring2024-03-15Publication date (YYYY-MM-DD, YYYY-MM, or YYYY format)
authorsCountinteger5Number of authors
firstAuthorstringAntolik MarkFirst author's full name (Given Family)
citationCountinteger1240Number of works that cite this work (from-referenced-by-count)
referenceCountinteger45Number of works referenced by this work
issnstring0028-0836International Standard Serial Number (for journals)
urlstringhttps://doi.org/10.1038/nature12373Persistent URL to the work via DOI

Pricing

This Actor uses the free Crossref REST API (no usage limits or authentication required). You pay only for Apify compute time.

  • Compute cost: ~$0.0001–0.001 per run (depends on result volume and API latency)
  • Typical cost per batch: $0.01–0.10 for 50–500 works
  • Bulk runs (1000–5000 works): ~$0.10–0.50 per run

The Crossref API itself is completely free — no subscriptions, no per-request charges, no rate limits for research use.


Example Workflows

Workflow 1: Weekly Research Digest Pipeline

  1. Run Actor every Monday with filterFromDate set to last 7 days
  2. Extract results to cloud storage (CSV/JSON export)
  3. Feed into email template to send digest to stakeholders
  4. Cost: ~$0.02/week

Workflow 2: Citation Network Analysis (Research Project)

  1. Run Actor with query = your domain (e.g., "quantum computing")
  2. Extract top 500 results (maxResults = 500)
  3. Load into network analysis tool (Gephi, Cytoscape)
  4. Visualize author collaborations and citation influence
  5. Cost: ~$0.05 per analysis run

Workflow 3: Automated Literature Review

  1. Run Actor monthly with your research keywords
  2. Filter by workType = "journal-article"
  3. Combine with external citation tools (Semantic Scholar, OpenAlex)
  4. Build automated bibliography in BibTeX or RIS format
  5. Cost: ~$0.01/month per search term

FAQ

"No works found" when searching

  • Verify the query: Try a simpler term (e.g., "cancer" instead of "advanced oncology research methodologies")
  • Check Crossref directly: https://search.crossref.org to validate query
  • Try with blank query: Leave search blank to fetch recent works and verify the actor is working
  • Expand date range: Remove filterFromDate to include older works

Empty or incomplete author names

  • Some works have missing or incomplete author metadata in Crossref's database
  • The firstAuthor field will be empty if author data is unavailable
  • Crossref's data quality depends on publisher submission quality
  • Check the URL (DOI link) for author details if needed

Missing ISSN or journal name

  • Not all works have journal information (e.g., books, datasets, preprints)
  • ISSN is only present for journal articles; other types may have empty issn
  • The journal field corresponds to container-title in Crossref (may be empty for non-journal works)

Result limits (maxResults > 5000)

  • Crossref cursor-based pagination supports up to 5,000 results per query
  • For larger datasets, run the actor multiple times with different date ranges
  • Example: Run once for 2024, once for 2023, etc.

API timeout or slow responses

  • Crossref API is generally fast but can have occasional latency spikes
  • Actor has a 60-second timeout per API request; retries are automatic
  • If timeouts occur frequently, reduce maxResults and run multiple smaller batches

Advanced Usage

Combining Filters

You can combine query, filterFromDate, and workType in a single run:

Example: Find all conference proceedings on "quantum computing" published since 2024:

  • Query: quantum computing
  • Work Type: proceedings
  • Filter From Date: 2024-01-01

Pagination & Large Extracts

The actor uses Crossref's cursor-based pagination internally. Each API request fetches up to 100 results; the actor automatically loops to fetch up to your maxResults limit.

  • Requesting 5,000 results requires ~50 API calls
  • Cost scales linearly: 5x results ≈ 5x cost (but still under $0.50)

Filtering Tips

By date range: Use filterFromDate (no "to date" parameter; filter is forward-looking)

  • To get works from 2024 only, run once with filterFromDate=2024-01-01, then again with filterFromDate=2025-01-01 and exclude those results

By work type: Common types are listed above; others exist but are rare

By publisher: Not a direct input, but you can add publisher names to your query text (e.g., "machine learning IEEE" to bias toward IEEE publications)


Output Examples

Example 1: Journal Article

{
"doi": "10.1038/s41586-024-07301-x",
"title": "AlphaFold 3: Structure Prediction for Biology",
"type": "journal-article",
"publisher": "Nature Publishing Group",
"journal": "Nature",
"publishedDate": "2024-05-08",
"authorsCount": 47,
"firstAuthor": "Abramson Josh",
"citationCount": 450,
"referenceCount": 86,
"issn": "0028-0836",
"url": "https://doi.org/10.1038/s41586-024-07301-x"
}

Example 2: Conference Proceedings

{
"doi": "10.1109/CVPR52688.2022.00988",
"title": "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks",
"type": "proceedings-article",
"publisher": "IEEE",
"journal": "2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)",
"publishedDate": "2022-06-19",
"authorsCount": 3,
"firstAuthor": "Lu Jiasen",
"citationCount": 2100,
"referenceCount": 52,
"issn": "2575-7075",
"url": "https://doi.org/10.1109/CVPR52688.2022.00988"
}

Example 3: Book

{
"doi": "10.1016/b978-0-08-102618-8.00001-3",
"title": "Sustainable Materials and Manufacturing",
"type": "book",
"publisher": "Elsevier",
"journal": "",
"publishedDate": "2023-09-15",
"authorsCount": 12,
"firstAuthor": "Smith Richard",
"citationCount": 85,
"referenceCount": 203,
"issn": "",
"url": "https://doi.org/10.1016/b978-0-08-102618-8.00001-3"
}

Looking for complementary research data sources?


API Reference

For detailed Crossref API documentation:


Disclaimer: This Actor fetches data from Crossref (https://www.crossref.org), a non-profit digital object identifier (DOI) registration agency. Crossref data is provided under the CC0 1.0 Universal (Public Domain Dedication) license and is free to use for any purpose. Crossref's terms: https://www.crossref.org/documentation/metadata-plus-service/metadata-plus-service-terms-and-conditions/

Support: If you encounter issues:

  1. Check the Crossref API documentation: https://github.com/CrossRef/rest-api-doc
  2. Test your query directly: https://search.crossref.org
  3. Verify work types: https://github.com/CrossRef/rest-api-doc#work-types
  4. Open an issue on Apify Community or contact support

User-Agent: This Actor identifies itself as apify-factory/1.0 (mailto:bciccarelli6@gmail.com) to access Crossref's polite pool (higher rate limits for well-behaved agents).


Built with ❤️ for researchers, academics, and bibliometricians.