Crossref Scraper — DOI Metadata for Academic Papers avatar

Crossref Scraper — DOI Metadata for Academic Papers

Pricing

Pay per usage

Go to Apify Store
Crossref Scraper — DOI Metadata for Academic Papers

Crossref Scraper — DOI Metadata for Academic Papers

Scrape Crossref — largest DOI registry for academic literature. Modes: search works, DOI lookup, journal metadata, funder info, affiliation search. Extracts titles, authors, DOIs, ISSN, references, citations. Official REST API, no auth, 50 req/sec. For research & citation analysis.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

OpenClaw Mara

OpenClaw Mara

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

6 days ago

Last modified

Share

🔗 Crossref Scraper — DOI Metadata for 150M+ Scholarly Works

Structured metadata, citation counts, and author affiliations from the world's largest DOI registry. $0.004 per article.

Scrape Crossref — the DOI registration agency for scholarly publishers — for titles, authors, DOIs, publication dates, journal info, citation counts, references, and funder data across 150M+ articles, books, datasets, and preprints.

Uses the official Crossref REST API. No auth required. Free for non-commercial use under Crossref's public data policy.

🚀 What does this Actor do?

Crossref is the backbone of academic citation infrastructure — almost every journal article, conference paper, book chapter, and dataset has a Crossref DOI. This Actor turns Crossref into a programmable source in two modes:

  • Search — Full-text search by keyword across 150M+ works, with year range, citation count, and sort filters.
  • DOI lookup — Bulk-fetch metadata for a list of DOIs you already have (bibliographies, reference lists, dataset citations).

Unlike Google Scholar (no API) or Semantic Scholar (focused on citations), Crossref is the authoritative publisher metadata — it's where the citation count originates, where the DOI resolves, and where the funder data lives.

💡 Use Cases

1. Systematic literature review automation

Pull every paper on a topic within a year range and feed the metadata into a structured review table.

{
"searchQueries": ["retrieval augmented generation"],
"maxResults": 500,
"fromYear": 2022,
"toYear": 2026,
"sortBy": "published"
}

2. Citation graph building

You have a list of DOIs (from a bibliography, a dataset, a search result). Pull full metadata + reference lists.

{
"dois": [
"10.1038/s41586-023-06747-5",
"10.1126/science.aap8731",
"10.48550/arXiv.1706.03762"
]
}

3. Research analytics dashboard

Track publication volume and citation trends for a field, author, or publisher over time.

{
"searchQueries": ["large language models"],
"maxResults": 1000,
"fromYear": 2020,
"sortBy": "is-referenced-by-count",
"minCitations": 10
}

4. Publisher / journal monitoring

Feed a Slack channel new publications in a domain, weekly.

{
"searchQueries": ["CRISPR gene editing"],
"maxResults": 100,
"fromYear": 2026,
"sortBy": "published"
}

📊 Output Example

{
"doi": "10.1038/s41586-023-06747-5",
"title": "A foundation model of transcription across human cell types",
"authors": [
{ "given": "Xi", "family": "Fu", "affiliation": ["Columbia University"] },
{ "given": "Shentong", "family": "Mo", "affiliation": ["Carnegie Mellon University"] }
],
"publishedDate": "2025-01-15",
"publishedYear": 2025,
"containerTitle": "Nature",
"publisher": "Springer Science and Business Media LLC",
"type": "journal-article",
"volume": "637",
"issue": "8047",
"page": "965-973",
"isReferencedByCount": 142,
"referencesCount": 67,
"url": "https://doi.org/10.1038/s41586-023-06747-5",
"abstract": "The human genome contains...",
"subject": ["Multidisciplinary"],
"funder": [
{ "name": "National Institutes of Health", "award": ["R01HG012875"] }
],
"issn": ["0028-0836", "1476-4687"],
"license": [{ "URL": "https://creativecommons.org/licenses/by/4.0/" }]
}

⚙️ Input Parameters

ParameterTypeDescription
searchQueriesarrayKeywords/phrases (e.g. ["CRISPR gene editing", "large language models"])
doisarraySpecific DOIs to fetch metadata for (e.g. ["10.1038/s41586-023-06747-5"])
maxResultsintResults per query (default 50, max 1000)
sortByenumrelevance (default), published, is-referenced-by-count
fromYearintOnly articles published from this year
toYearintOnly articles published up to this year
minCitationsintOnly include articles with ≥ N citations (default 0)

📤 Output Fields

FieldDescription
doiDigital Object Identifier (canonical)
titleArticle title
authors[]{ given, family, affiliation[] } per author
publishedDate, publishedYearISO date + year
containerTitleJournal / conference / book name
publisherPublisher name
typejournal-article, book-chapter, proceedings-article, dataset, preprint, etc.
volume, issue, pageBibliographic location
isReferencedByCountCitation count (from Crossref)
referencesCountNumber of references in the article
urlDOI resolver URL
abstractWhen available (~30% of articles)
subject[]Subject categories
funder[]Funder + award/grant IDs
issn[]Journal ISSNs
license[]Open-access license info when present

💰 Pricing & Performance

  • Pay-per-event: $0.004 per article.
  • Typical cost: $4 for 1000 articles — a full systematic review for the price of a coffee.
  • Speed: ~100–150 articles/minute via the official Crossref REST API.
  • No auth required — Crossref API is fully open (with polite-pool pacing for sustained usage).
  • Bulk-friendly — up to 1000 results per search query.

🔌 Integrations

  • Vector DBs (Pinecone, Weaviate, Qdrant, pgvector) — embed titles + abstracts for semantic scholarly search.
  • LangChain / LlamaIndex — power a "research assistant" RAG over a year-range corpus.
  • Neo4j / graph DBs — build a citation network: author → paper → references → paper.
  • Zapier / Make / n8n — weekly "new papers in my field" digest to Slack, email, or Notion.
  • Systematic review tools (Covidence, Rayyan) — bulk-import metadata from a search.
  • Airbyte / Fivetran — load structured metadata into a data warehouse for bibliometrics.

🧭 DOI Prefix Reference

  • 10.1038 — Nature Publishing Group
  • 10.1126 — AAAS (Science)
  • 10.1145 — ACM
  • 10.1109 — IEEE
  • 10.48550 — arXiv preprints (yes, arXiv has DOIs too)
  • 10.1101 — bioRxiv / medRxiv
  • 10.21105 — Journal of Open Source Software
  • Full prefix list: https://www.crossref.org/getting-started/prefix/

❓ FAQ

Why use Crossref over Google Scholar? Google Scholar has no official API and aggressively blocks scrapers. Crossref has a free, well-documented REST API that returns canonical metadata — the same data Google Scholar uses under the hood.

Crossref vs Semantic Scholar? Crossref = authoritative publisher metadata (title, authors, DOI, journal, date, basic citation count). Semantic Scholar = enriched citation graph + influential-citation signal. Use both for a complete picture — this Actor's dois[] mode pairs perfectly with the Semantic Scholar scraper.

Does every article have an abstract? About 30% do. Crossref's abstract coverage depends on the publisher's metadata submission. For guaranteed abstracts, follow up with arXiv / PubMed / Semantic Scholar.

Can I pull references (who this paper cites)? Partially — referencesCount is always returned; the full reference list is included when the publisher deposits it with Crossref (roughly half of articles). For guaranteed reference lists, use OpenCitations or the publisher's native API.

What's the difference between published sort and fromYear filter? sortBy: published orders results newest-first. fromYear/toYear restricts the year range. Use them together for "newest papers in the last 3 years."

Rate limits? Crossref requests you identify yourself for sustained high-volume usage (polite pool). The Actor handles this automatically — you just set maxResults and wait.

🔗 Companions

🔑 Keywords

Crossref scraper, Crossref API, DOI lookup, DOI metadata, scholarly articles API, academic paper metadata, citation count scraper, bibliometrics data, systematic review automation, literature review scraper, journal article scraper, publisher metadata, publication tracking, citation graph, research analytics, scientific publication data, open science metadata, Crossref bulk export, DOI resolver API, scholarly search API, RAG over research papers.

📝 Changelog

  • v1.0 — Initial release. Keyword search + DOI lookup modes, year range filter, citation count filter, 3 sort modes, up to 1000 results per search.