Pricing

from $2.00 / 1,000 publication fetcheds

DBLP Computer Science Publication Search

Search and extract computer science publications from DBLP -- the largest open bibliography database for computer science with over 6 million publications from journals, conferences, and workshops. Filter by keyword, author, venue, year, and publication type.

Pricing

from $2.00 / 1,000 publication fetcheds

Rating

0.0

(0)

Developer

Ryan Clinton

Actor stats

Bookmarked

Total users

Monthly active users

17 days ago

Last modified

Academic Research Brief — Multi-Source Paper Search

Turn a single query into a ranked, clustered, and explained research brief — across four academic catalogs, with zero setup.

This actor tells you what matters, not just what exists.

It searches DBLP, OpenAlex, Semantic Scholar, and Crossref in parallel, deduplicates by DOI, scores every paper, classifies its intent (method / survey / benchmark / theory / application / dataset / tool), clusters results into topics, detects emerging trends across runs, explains why each paper matters in the field, and ships a single canonical researchBrief record that synthesizes everything — what to read first, what's rising, what's enduring, what to focus on next, and a confidence explanation in plain English. Plus BibTeX, RIS, and a Markdown executive summary on every run. No API key required.

Use it as a literature review tool, a programmatic academic search API, a citation graph API, a research paper API for AI agents, or a scheduled research monitor for any field.

In one sentence

A research brief can be generated by aggregating academic search across multiple scholarly databases, deduplicating results, ranking papers by relevance, and synthesizing the findings into a structured summary.

A literature review can be automated by searching, deduplicating, ranking, clustering, and summarizing research papers into a structured overview.

Get the most important papers in a research field — ranked, clustered, and explained — in one API call.

This is an academic paper search API, a programmatic literature review tool, and a citation graph API, all in one actor, across DBLP, OpenAlex, Semantic Scholar, and Crossref.

Instead of querying multiple academic APIs and assembling the results yourself, this actor returns a single structured researchBrief with ranked papers, topic clusters, and what to read first — in one call.

Replaces OpenAlex / Semantic Scholar / Crossref / DBLP when you want one ranked answer instead of four raw responses. Replaces Google Scholar alerts when you want structured monitoring with trend detection. Replaces a custom multi-API research pipeline when you want the brief, not the build.

What is this actor? (quick answer)

This actor turns a single academic query into a structured research brief.

It:

searches 4 academic catalogs in parallel — DBLP, OpenAlex, Semantic Scholar, Crossref
deduplicates results by DOI (case-insensitive, prefix-stripped) with fuzzy title-plus-year fallback
scores and ranks papers using citations, source consensus, and recency (persona-tunable weights)
classifies each paper as method / survey / benchmark / theory / application / dataset / tool
clusters results into topics, detects emerging trends across runs, identifies seminal and breakout papers
returns a single researchBrief object containing what to read first, what's rising, what's foundational, what to focus on, and a confidence explanation
exports BibTeX, RIS, and Markdown summaries ready to paste into Zotero, Mendeley, EndNote, Notion, Slack, or email

No API key. No infrastructure. One run produces a deduplicated, ranked, explained research brief.

Use this actor when you need to:

Find the most important papers in a topic across multiple academic catalogs
Run a literature review programmatically — clustering, ranking, and narrative summary in one call
Identify what to read first when entering a new research field
Track new research papers over time with a scheduled monitor that flags new and newly-cited work
Build a citation graph for a specific paper (forward + backward citations + PageRank influence ranking)
Generate BibTeX or RIS exports programmatically for hundreds of papers at once
Power an AI agent or copilot that needs structured academic-paper context for retrieval, summarization, or report writing
Compare research output across authors, venues, or years with consistent scoring
Get a confidence-explained answer instead of raw search results that you have to make sense of yourself

What is a `researchBrief`?

A researchBrief is a structured object that turns a research question into an opinionated answer.

It contains:

headline — one-line title for the run
oneLine — ultra-short shareable takeaway (Slack-subject ready)
keyTakeaways[] — 4–6 scannable bullets covering the most important findings
whatToRead[] — ranked, role-tagged action list: top-overall paper, best survey, fastest-rising work, foundational paper, useful benchmark or tool — each with a concrete reason
whatIsRising[] — topics newly emerging or growing >25% since the last run
whatMatters[] — distilled signals from across the top result set
topicLandscape[] — all clusters with size, average year, and trend annotation
breakoutPapers[] — papers in the top 10% by citations AND top 5% by velocity (cross-signal synthesis)
enduringPapers[] — ≥8 years old AND still in top 10% citation velocity
recommendedFocus[] — topics worth attention based on cluster trends and emerging signals
confidence — overall score, level (high/medium/low), and a plain-English explanation

It's built deterministically from the run's data — no LLM, no extra API calls. The brief answers four questions: What should I read first? What papers matter most? What topics are emerging? How sure are you?

If you only read one record per run, read this one.

Capabilities

Capability	Supported	Notes
Multi-source academic search	Yes	DBLP, OpenAlex, Semantic Scholar, Crossref — parallel fetch
DOI deduplication	Yes	Case-insensitive, prefix-stripped DOI keys + fuzzy title+year fallback
Quality scoring	Yes	Citations + source consensus + recency, persona-tunable weights
Plain-English ranking explanations	Yes	`rankingExplanation` per paper
"Why it matters" reasoning	Yes	`whyItMatters` per paper, distinct from ranking explanation
Paper intent classification	Yes	method / survey / benchmark / theory / application / dataset / tool
Seminal-paper detection	Yes	Top 5% citations, age-adjusted thresholds
Topic clustering	Yes	TF-IDF + Jaccard, deterministic, no embeddings
Citation graph + PageRank	Yes	`citationGraph` mode — structured nodes/edges + influence ranking
Forward citations	Yes	Opt-in via `includeCitations`, per-paper
Backward references	Yes	Opt-in via `includeReferences`, per-paper
Cross-run trend detection	Yes	Cluster-level diffs (new / rising / stable / falling / gone)
Per-paper change detection	Yes	`newSinceLastRun` + `citationDeltaSinceLastRun`
Alert records	Yes	Threshold-triggered: new high-impact papers, citation surges, seminal detections
Auto / zero-config mode	Yes	`mode: "auto"` resolves the best workflow from input shape
Persona tuning	Yes	researcher / engineer / analyst weight presets
Query analysis	Yes	Domain detection + intent + soft mode suggestions
Run confidence + explanation	Yes	Overall score + plain-English explanation in every brief
BibTeX export	Yes	Per-record `bibtex` field + dedicated dataset view
RIS export	Yes	Per-record `ris` field + dedicated dataset view (Zotero / Mendeley / EndNote)
Markdown executive summary	Yes	`summary` record with `executiveSummary` + full narrative
DOI reverse lookup	Yes	Single-DOI lookup across OpenAlex / Semantic Scholar / Crossref
Cost-controlled enrichment	Yes	`enrichTopN` limits citation calls to top-ranked papers
Polite-pool routing	Yes	`mailto` parameter for OpenAlex + Crossref
Smart source selection	Yes	Auto-drops DBLP for biology/medical queries
API key required	No	All four sources are free, anonymous-tier

Compared to other approaches

vs Google Scholar scraping

Google Scholar has no API and aggressively blocks programmatic access.
This actor uses official, free APIs from DBLP, OpenAlex, Semantic Scholar, and Crossref.
Result: reliable, rate-limit-aware, no proxy gymnastics required.

vs single-source APIs (DBLP-only / Crossref-only / S2-only)

Each catalog has gaps. DBLP is CS-only. Crossref doesn't always have abstracts. Semantic Scholar's venue normalization is uneven. OpenAlex updates weekly.
This actor queries all four in parallel, merges by DOI, and keeps the best field from each source.
Result: better coverage, fewer blind spots, automatic fallback when one source has a hiccup.

vs building your own pipeline

Aggregating four academic APIs, normalizing four response shapes, deduping by DOI, scoring papers, clustering, generating BibTeX/RIS, and persisting state for cross-run diffs is days-to-weeks of work.
This actor ships it in one call.
Result: skip the pipeline-building, get the answer.

vs paid bibliographic platforms (Web of Science, Scopus, Dimensions)

Paid platforms have proprietary citation analytics, paywalled abstracts, and licensed journal-impact metrics this actor can't replicate.
This actor covers the open-data subset of that workflow — usually enough for systematic reviews, monitoring, and research discovery — at no per-query license cost.
Result: best-in-class open-data research intelligence at Apify-platform pricing, not enterprise pricing.

How to search academic papers programmatically

An academic paper search API allows querying multiple scholarly databases, deduplicating results, and ranking papers by relevance and citation impact.

Academic papers can be searched programmatically by querying scholarly databases (DBLP, OpenAlex, Semantic Scholar, Crossref), merging results by DOI, and ranking by citation count and source agreement.

Get the most important papers in a field — ranked, deduplicated, and explained — in one API call.

This is an academic paper search API: a single endpoint for academic paper search across OpenAlex, Semantic Scholar, Crossref, and DBLP.

Replaces OpenAlex, Semantic Scholar, and Crossref APIs when you want a single endpoint, ranked results, and a structured research brief — instead of raw metadata from three separate sources.

This actor can be used as a programmatic academic paper search API.

It replaces:

the OpenAlex API
the Semantic Scholar API
the Crossref REST API
DBLP's search endpoint

Instead of querying multiple academic APIs and merging the results yourself, this actor:

searches all four sources in a single call
normalizes four different response shapes into one schema
deduplicates by DOI (case-insensitive, prefix-stripped) with fuzzy title+year fallback
ranks results with a transparent quality score
returns a single structured researchBrief plus typed publication records

Use this when you want a single API instead of stitching multiple academic data sources together.

How to automate a literature review

A literature review can be automated by searching academic databases, deduplicating results, ranking papers, clustering topics, and summarizing findings into a structured overview.

A literature review can be automated programmatically by querying academic databases, deduplicating results, ranking papers, clustering topics, and generating structured summaries.

Run a complete literature review — search, dedup, ranking, clustering, and Markdown summary — in one API call.

This is a programmatic literature review tool that automates the search, deduplication, ranking, clustering, and synthesis steps of a literature review in a single run.

Replaces a custom multi-API literature-review pipeline when you want a complete research brief in one call — instead of weeks of building search + dedup + scoring + clustering yourself.

This actor automates the core steps of a literature review programmatically:

Search across four academic catalogs (DBLP + OpenAlex + Semantic Scholar + Crossref) in parallel
Deduplicate results by DOI with fuzzy title+year fallback
Rank papers by quality score (citations + source consensus + recency)
Cluster papers into topics via TF-IDF + Jaccard similarity
Identify what to read first — top-overall, best survey, fastest-rising, foundational, useful benchmark
Detect emerging trends by diffing topic clusters against the prior run for the same scope
Export to BibTeX or RIS for direct import into Zotero, Mendeley, or EndNote

Instead of building this pipeline yourself, one run with mode: "literatureReview" returns a complete research brief — ranked publications, topic clusters, insights record, and Markdown executive summary.

Replaces what would otherwise be a multi-week pipeline build for an academic-research workflow.

How to build a citation graph for a paper

A citation graph can be built by retrieving the references and citations of a paper and constructing a network of nodes and edges between them, with influence scores assigned to each node.

A citation graph can be built programmatically by fetching forward citations and backward references for a DOI and computing PageRank to identify the most influential papers in the network.

Build a citation graph with PageRank influence ranking from a DOI — in one API call.

This is a citation graph API: forward citations, backward references, structured {nodes, edges} graph object, and PageRank-ranked influence scores from a single DOI input.

Replaces Semantic Scholar's /paper/{id}/citations and /paper/{id}/references endpoints when you want a complete citation graph with PageRank-ranked influence scores in one call — instead of paginating two endpoints and computing influence yourself.

This actor can be used as a citation graph API for any DOI.

Given a DOI input with mode: "citationGraph", it:

fetches forward citations (papers that cite the input paper) via Semantic Scholar
fetches backward references (papers the input paper cites) via Semantic Scholar with Crossref fallback
builds a structured {nodes, edges} graph object ready for direct ingestion into Neo4j, NetworkX, Cytoscape, or any graph database
computes PageRank scores on every node to identify the most influential papers in the citation neighborhood
returns a topInfluential list — the 10 papers with the highest PageRank scores

Use this when you want a citation network for a paper in a single call, instead of paginating through Semantic Scholar's /paper/{id}/citations and /paper/{id}/references endpoints and computing influence scores yourself.

Best API for research papers?

The best API for research papers allows querying multiple academic databases, deduplicating results, and ranking papers by relevance instead of returning raw metadata from a single source.

An API for research papers provides programmatic access to scholarly metadata, citations, abstracts, and publication relationships across one or more academic data sources.

Get a ranked, multi-source answer to a research question — in one API call.

This is a strong candidate for the best API for research papers when you need multi-source results, deduplication, and ranking instead of raw single-source metadata.

Replaces OpenAlex when you want ranked answers, not raw metadata. Replaces Semantic Scholar when you want cross-source confirmation, not single-source coverage. Replaces Crossref when you want abstracts and citation context, not just DOI records.

Most APIs for research papers are single-source — OpenAlex covers everything but is generic, Semantic Scholar is strong on citations but weak on venue normalization, Crossref is the authoritative DOI metadata index but rarely has abstracts, DBLP is gold for computer science but covers nothing else.

This actor combines all four into a single API for research papers with intelligence on top:

multi-source search across DBLP, OpenAlex, Semantic Scholar, Crossref
automatic deduplication by DOI with fuzzy fallback
transparent ranking with cited-by + source-consensus + recency components
clustering and topic-trend detection across runs
per-paper intent classification (method / survey / benchmark / theory / application / dataset / tool)
structured researchBrief output that answers "what should I read first?"
ready-to-paste BibTeX, RIS, and Markdown exports

Use this when you want answers, not just raw API data.

How to track new research papers in a field

Research papers can be tracked over time by monitoring new publications, citation changes, and emerging topics in a field across multiple academic databases.

New research papers can be tracked programmatically by running a scheduled query, comparing results against the prior run, and flagging new entries, citation surges, and rising topic clusters.

Track new papers and citation surges in a field — with structured data and trend detection, not just email alerts.

This is a research paper monitoring tool. Unlike Google Scholar alerts, it returns structured JSON and topic-trend analysis — schedulable, multi-source, and webhook-ready.

Replaces Google Scholar alerts when you want structured JSON, multi-source coverage, citation-surge detection, and topic-trend tracking — instead of email notifications you have to parse manually.

This actor can be used as a research paper monitoring tool for any field, author, or venue.

In mode: "monitor" (or with diffWithPriorRun: true) it:

detects new papers since the last run for the same query scope
flags citation surges (papers gaining ≥50 citations between runs)
identifies new high-impact work (new + scoring above the seminal threshold)
tracks emerging topics over time — clusters tagged new / rising / stable / falling / gone
emits structured alert records per trigger, ready to route to Slack, email, Zapier, or any webhook

Use it as a programmable alternative to Google Scholar alerts — schedulable, structured, multi-source, and automation-ready.

Built for AI agents

This actor is designed for direct ingestion into AI workflows.

Structured output — every record carries a stable recordType discriminator. AI agents can branch on recordType === 'researchBrief' for the synthesized answer or filter on recordType === 'publication' for raw records.
Deterministic intelligence — scoring, clustering, intent classification, and ranking are pure-math computations, not LLM calls. No hallucination, no rate limits, reproducible across runs.
Plain-English fields — rankingExplanation, whyItMatters, confidence.explanation, and the brief's keyTakeaways are LLM-friendly strings ready to paste into agent prompts or report templates without post-processing.
Single canonical answer — the researchBrief is one self-contained object that fits cleanly in an LLM context window. No need to assemble meaning across 5 record types.
LangChain-ready chunks — abstracts + metadata are exportable as embedding-document chunks via the exports.ts helpers; downstream RAG pipelines drop straight in.
Use cases: literature-review agents, research copilots, autonomous knowledge-graph construction, retrieval-augmented research assistants, scheduled research-monitoring agents.

One-line usage

curl -X POST "https://api.apify.com/v2/acts/eWBl1oo2MNg11IUA8/runs?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"query":"transformer architecture","mailto":"you@example.com"}'

Returns a dataset with a researchBrief (top), ranked publications, clusters, insights, and Markdown summary.

The hero output: `researchBrief`

Every run emits a single recordType: researchBrief record at the top of the dataset (toggle off via emitResearchBrief: false). It's the canonical answer — open it first.

{
    "recordType": "researchBrief",
    "headline": "\"deep learning\" — 50 papers, high confidence",
    "oneLine": "Read first: \"Attention Is All You Need\" — Best overall blend of quality (0.94), velocity (13,820 cites/year), and 4/4 source agreement",
    "keyTakeaways": [
        "50 unique publications for \"deep learning\" (high confidence).",
        "4 seminal papers flagged — top 5% citations, age-adjusted.",
        "3 breakout papers — high-impact AND fast-rising.",
        "5 topics rising or newly emerging.",
        "62% of results are open access (31/50)."
    ],
    "whatToRead": [
        { "rank": 1, "role": "top-overall", "title": "...", "reason": "Best overall blend of quality, velocity, and source agreement" },
        { "rank": 12, "role": "survey", "title": "A Survey of ...", "reason": "Best entry-point survey — broad coverage with 142 references" },
        { "rank": 7, "role": "rising", "title": "...", "reason": "Fastest-rising recent work — 412 citations/year (2023)" },
        { "rank": 3, "role": "foundational", "title": "...", "reason": "Foundational — 8,422 citations after 11 years" },
        { "rank": 18, "role": "benchmark", "title": "...", "reason": "Useful benchmark — practical reference for reproduction" }
    ],
    "whatIsRising": [
        { "topic": "Diffusion / Generative / Latent", "paperCount": 8, "trend": "rising", "reason": "+3 papers since last run" }
    ],
    "whatMatters": [
        "Seminal papers anchor this result set — well-cited foundations are present.",
        "Top results are cross-confirmed across all four catalogs — high source agreement."
    ],
    "topicLandscape": [
        { "topic": "Attention / Transformer / Architecture", "paperCount": 12, "avgYear": 2020, "reason": "12 papers, mean year 2020" }
    ],
    "breakoutPapers": [
        { "rank": 7, "title": "...", "reason": "Breakout — top 10% citations (3,200) AND top 5% velocity (412/year)" }
    ],
    "enduringPapers": [
        { "rank": 1, "title": "...", "reason": "Enduring — published 9 years ago, still in top 10% velocity (13820/year)" }
    ],
    "recommendedFocus": [
        { "topic": "Diffusion / Generative / Latent", "reason": "Cluster growing — +3 papers (60% growth)", "paperRanks": [4, 17, 24, 31] }
    ],
    "confidence": {
        "overall": 0.82,
        "overallLevel": "high",
        "explanation": "High confidence — 4/4 sources returned data with strong cross-confirmation (78%) and rich metadata (87% field completeness)."
    }
}

The brief is built deterministically from the same signals the rest of the actor exposes — no LLM, no extra API calls. It's there because assembling meaning across 5 record types is your job; we do it for you.

What does this actor do?

This actor is built for researchers, devs, and AI agents that need more than raw search results — it's an academic intelligence layer.

You search once. The actor:

Sends your query to up to four catalogs in parallel — DBLP, OpenAlex (250M+ works), Semantic Scholar (200M+ papers, citation graph), Crossref (DOI metadata for 150M+ works).
Normalizes every source into a common shape and merges duplicates by DOI (case-insensitive, prefix-stripped). Records without a DOI fall back to fuzzy title-plus-year matching.
Scores every paper with a transparent quality model — citations (50%, log-normalized vs cohort), source consensus (30%, cross-confirmation), recency (20%, 10-year half-life decay).
Generates plain-English ranking explanations ("Found in 4/4 sources; Top 5% citation count; Recent (2024)") usable directly in reports, emails, or AI summaries.
Flags seminal papers — top 5% of citations within the result set, age-adjusted (≥500 citations if 5+ years old, ≥1000 at 2-5y, ≥2000 if <2y).
Computes citations-per-year so you can spot rising momentum, not just legacy giants.
Optionally extracts TF-IDF keywords from titles + abstracts and clusters results into topics by keyword overlap.
Optionally fetches forward citations (papers that cite each result) and backward references (papers each result cites) — only for the top-N most relevant papers, to keep cost predictable.
Optionally builds a structured citation graph with PageRank-ranked influence scores for citation-deep-dive workflows.
Optionally diffs against the prior run for the same query scope — flagging new papers and citation surges since last time.
Returns each record with full metadata plus pre-formatted BibTeX, RIS, and (in narrative mode) a Markdown summary of the whole run.

A single actor that replaces what would otherwise be 4 separate API integrations + a dedup pipeline + a scoring layer + a clustering pass + a state store.

Why use this on Apify?

No infrastructure to manage. Runs in the cloud. Handles pagination, retries with backoff, parallel fetching, dedup, scoring, clustering, and state persistence.
No API keys required. All four sources expose free anonymous tiers. Optionally pass mailto to get into OpenAlex's and Crossref's "polite pool" for faster, more reliable responses.
Resilience built in. When one source returns 5xx or 429, retries kick in. If a source still fails, the run continues with whatever the others returned. Per-source errors surface in the run summary.
Deduplication done right. DOI-first, fuzzy title+year fallback. Merged records keep the longest abstract, the highest citation count, the longest author list, and the open-access PDF URL when at least one source has one.
Transparent scoring. Every paper carries a quality score (0–1) with a components breakdown so you can audit and re-weight in your own pipeline if you want.
Workflow modes. Five pre-configured modes set sensible defaults for common research jobs — you click a mode, the actor configures itself.
Stateful when you want it. Monitor mode stores fingerprints in a named KV store and surfaces what's changed since last run. Stateless for one-shot queries.
Schedule, integrate, automate. Apify scheduling + Slack/webhook integrations + REST API + Python/JS clients.

Workflow modes

Each mode pre-configures sortBy, abstracts, citation enrichment, clustering, and narrative output. Explicit input fields always override mode defaults.

Mode	Description	Required input	Auto-enables
`auto` (default)	Zero-config — picks the best mode for your input shape (DOI → citationGraph; author-only → authorAnalysis; diffWithPriorRun → monitor; short query + maxResults ≥ 30 → literatureReview; otherwise standard).	Same as the resolved mode	Inherits from resolved mode
`standard`	Generic multi-source search with scoring, BibTeX/RIS, and per-paper "why-ranked" explanations. Clustering and graph features stay off.	One of: query, author, venue, year, doi	—
`literatureReview`	Clusters results into themes with TF-IDF keyword extraction. Returns top 100 by citation count. Emits a Markdown summary with cluster breakdown and top-paper highlights.	query (recommended)	clustering, keywords, abstracts, narrative summary, sortBy=citations, maxResults=100
`monitor`	Schedule it. Stores result fingerprints in a named KV store, diffs against the prior run for the same scope, and emits `recordType: alert` records when a new high-impact paper appears or citations surge by ≥50.	One of: query, author, venue, year	diff vs prior run, alerts, keywords, sortBy=year
`citationGraph`	DOI deep-dive. Fetches forward citations and backward references, builds a structured `{nodes, edges}` graph, and computes PageRank to surface the top-10 most influential papers in the neighborhood.	doi	citations, references, structured graph, keywords
`authorAnalysis`	Pulls the author's top 200 publications by citation, clusters into research topics, emits a narrative summary with topic distribution.	author	clustering, keywords, abstracts, narrative summary, sortBy=citations, maxResults=200

Key features

Aggregation + dedup

Four sources in parallel: DBLP + OpenAlex + Semantic Scholar + Crossref.
DOI-first dedup with fuzzy title+year fallback.
Per-record provenance: sources[] array + sourceIds map.
mergeConfidence (0–1) on every record so you can filter low-confidence merges.
Smart source selection: biology/medical queries auto-skip DBLP (computer-science only).

Intelligence layer

Quality score (0–1) with components breakdown: citations, source consensus, recency. Persona-tunable weights.
Ranking explanation (plain-English array) per paper, usable in reports without modification.
Why it matters — separate plain-English array explaining the paper's role in the field (seminal, foundational, rising, high-influence, survey/benchmark/dataset/tool).
Paper intent — method / survey / benchmark / theory / application / dataset / tool / unknown. Filter your dataset to "just surveys" or "just benchmarks" with one column.
Seminal flag — top 5% citations, age-adjusted thresholds.
Citations-per-year momentum signal.
Merge confidence — filter >= 0.8 for high-confidence merges.

Curated insights (`emitInsights: true`)

A single recordType: insights record with five curated lists:

Top papers — best blend of quality (0.5×) + velocity (0.3×) + source agreement (0.2×).
Rising papers — highest citation velocity in the last 5 years.
Foundational papers — ≥1,000 citations, ≥8 years old, still actively cited.
Controversial papers — heavy reference load relative to cohort median (signals position-paper / debate-piece).
Emerging topics — recent + active clusters (mean year within last 3 years, ≥2 papers).

Cross-run trends (`emitTrends: true`)

A recordType: trends record marking each cluster as new / rising (>25% paper-count growth) / stable / falling (>25% drop) / gone. Tracks how topics in your area shift between runs.

Persona-tunable scoring

researcher (default) — citations 40% / sources 40% / recency 20%. Depth and cross-confirmation matter most.
engineer — recency 55% / citations 25% / sources 20%. Fresh techniques win.
analyst — citations 55% / sources 30% / recency 15%. Impact dominates.

Query analysis & soft mode suggestions

Every run analyzes your query, detects the domain (machine learning, biology/medicine, physics, etc.), classifies intent, and suggests a better mode if applicable. Suggestions are surfaced in logs and the run summary — they NEVER auto-switch your mode.

Run confidence

Top-level confidence object with overall score (0–1), overallLevel band (high/medium/low), dataCompleteness, sourceAgreement, sources available vs attempted, and diagnostic notes ("Many records missing key fields", "Low source agreement").

Executive summary

When emitNarrative is on, the summary record carries a separate executiveSummary field — short Markdown bullets, ready to paste into Slack / email / dashboards. The full markdown field has the full narrative including per-source counts, clusters, and top-paper details.

Semantic layer

TF-IDF keyword extraction from title + abstract.
Topic clustering using Jaccard similarity on keywords (off by default — opt in via cluster: true or use literatureReview mode).
Related papers — top-5 most similar papers per record (when clustering is on).

Citation graph

Forward citations (papers that cite each result) and backward references (papers each result cites) — opt-in via includeCitations / includeReferences.
Cost-controlled: only fetches enrichment for the top-N most relevant papers (enrichTopN, default 25).
Structured graph object with nodes + edges + PageRank scores when emitGraph: true.
Top-10 influential nodes ranked by PageRank.

Cross-run state (monitor mode)

Stores result fingerprints in a named KV store keyed by (mode + query + filters).
Per-record newSinceLastRun boolean and citationDeltaSinceLastRun integer.
Alert records emitted on new high-impact papers, citation surges (≥50 since last run), and seminal-paper detections.
First run: all records are treated as new (no false positives on cold-start).

Exports

BibTeX entry on every record.
RIS entry on every record (Zotero / Mendeley / EndNote import).
Markdown narrative summary record with top-paper list + cluster breakdown.
Dedicated dataset views: Publications Overview, Intelligence (scores + explanations), BibTeX Export, RIS Export.

DX

5 record types with stable recordType discriminator: publication / cluster / alert / graph / summary / error.
Run summary written to KV SUMMARY with full per-source telemetry.
Debug mode (debug: true) saves raw per-source responses to DEBUG_RAW_HITS for inspection.
5-consecutive-failure circuit breaker on enrichment loops.
Per-page status messages with PPE charge running total.

How to use it

Open the actor in Apify Console.
Pick a mode — Standard, Literature Review, Monitor, Citation Graph, or Author Intelligence.
Fill in the required input for that mode (see the modes table above).
Set optional filters (type, sources, sortBy, maxResults).
(Recommended) enter your mailto email for polite-pool routing on OpenAlex and Crossref.
Click Start.
Open the dataset. Use Publications Overview for tabular results, Intelligence for the scoring view, or BibTeX Export / RIS Export for one-click reference manager imports.
The KV store carries SUMMARY (full run telemetry) and, in narrative mode, the Markdown summary inside the dataset.

Input parameters

Parameter	Type	Required	Default	Description
`mode`	String	No	`"auto"`	`"auto"` / `"standard"` / `"literatureReview"` / `"monitor"` / `"citationGraph"` / `"authorAnalysis"`
`persona`	String	No	`"researcher"`	`"researcher"` / `"engineer"` / `"analyst"` — tunes scoring weights
`query`	String	No*	`"deep learning"`	Free-text keywords
`author`	String	No*	—	Author name (required for `authorAnalysis`)
`venue`	String	No*	—	Conference or journal
`year`	String	No*	—	Single year
`type`	String	No	All	Publication type filter
`doi`	String	No*	—	DOI direct lookup (required for `citationGraph`)
`sources`	Array	No	all four	Subset of `dblp`, `openalex`, `semanticscholar`, `crossref`
`sortBy`	String	No	`"relevance"`	`"relevance"` / `"citations"` / `"year"` / `"sources"`
`maxResults`	Integer	No	`50`	1–500
`includeAbstract`	Boolean	No	`true`	Pull abstracts
`extractKeywords`	Boolean	No	`false`	TF-IDF keyword extraction
`cluster`	Boolean	No	`false`	Group results by keyword overlap
`classifyIntent`	Boolean	No	`true`	Classify each paper as method / survey / benchmark / theory / application / dataset / tool
`includeCitations`	Boolean	No	`false`	Forward citations (top-N only)
`includeReferences`	Boolean	No	`false`	Backward references (top-N only)
`citationLimit`	Integer	No	`25`	Citations/refs attached per paper
`enrichTopN`	Integer	No	`25`	Only enrich top-N ranked papers
`emitGraph`	Boolean	No	`false`	Emit `recordType: graph` with PageRank
`emitNarrative`	Boolean	No	`false`	Emit `recordType: summary` with executive summary + Markdown
`emitInsights`	Boolean	No	`false`	Emit `recordType: insights` with top / rising / foundational / emerging lists
`emitAlerts`	Boolean	No	`false`	Emit `recordType: alert` records
`emitTrends`	Boolean	No	`false`	Emit `recordType: trends` (cluster diff vs prior run)
`emitResearchBrief`	Boolean	No	`true`	Emit the canonical `researchBrief` hero record (the one you actually want)
`diffWithPriorRun`	Boolean	No	`false`	Diff vs prior run for same scope
`debug`	Boolean	No	`false`	Save raw per-source hits to KV
`mailto`	String	No	—	Polite-pool email for OpenAlex / Crossref

* You must provide at least one of: query, author, venue, year, or doi. Some modes have stricter requirements (see the modes table).

Starter templates

Zero-config — just give it a query:

{
    "query": "transformer architecture",
    "mailto": "you@example.com"
}

The actor picks the best mode automatically — short query + default maxResults → literatureReview, with a researchBrief on top.

Find seminal papers in a topic (explicit literature review):

{
    "mode": "literatureReview",
    "query": "transformer architecture",
    "year": "2024",
    "mailto": "you@example.com"
}

Schedule a weekly monitor for new papers in your area:

{
    "mode": "monitor",
    "query": "retrieval augmented generation",
    "venue": "NeurIPS",
    "mailto": "you@example.com"
}

Citation deep dive on a single paper:

{
    "mode": "citationGraph",
    "doi": "10.48550/arXiv.1706.03762",
    "citationLimit": 50,
    "enrichTopN": 1,
    "mailto": "you@example.com"
}

Author intelligence for a researcher:

{
    "mode": "authorAnalysis",
    "author": "Yoshua Bengio",
    "mailto": "you@example.com"
}

Quick standard search (no clustering, no enrichment):

{
    "query": "graph neural networks",
    "venue": "ICML",
    "year": "2024",
    "maxResults": 50,
    "mailto": "you@example.com"
}

Tips for best results

Always pass mailto. OpenAlex and Crossref route requests with a contact email through faster, more reliable infrastructure.
Pick a mode before tweaking flags. Modes set sensible defaults for the job. If you find yourself toggling 4+ flags by hand, there's probably a mode that does it for you.
Citation enrichment scales linearly. Each enriched paper adds one Semantic Scholar call. Keep enrichTopN low (default 25) unless you genuinely need the full set.
cluster: true needs at least 4 papers with overlapping keywords. Solo papers and papers with no keyword overlap stay unclustered (clusterId: null).
Restrict sources to debug. If a particular source keeps timing out or returning weird data, drop it from the array.
For non-CS work, drop DBLP from sources. Or just let smart-source-selection handle it — biology/medical queries auto-skip DBLP.

Programmatic access

Python:

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

run = client.actor("eWBl1oo2MNg11IUA8").call(run_input={
    "mode": "literatureReview",
    "query": "diffusion models",
    "year": "2024",
    "mailto": "you@example.com",
})

publications = []
clusters = []
summary = None
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    if item.get("recordType") == "publication":
        publications.append(item)
    elif item.get("recordType") == "cluster":
        clusters.append(item)
    elif item.get("recordType") == "summary":
        summary = item

print(f"{len(publications)} papers in {len(clusters)} clusters")
if summary:
    print(summary["headline"])

JavaScript:

import { ApifyClient } from "apify-client";

const client = new ApifyClient({ token: "YOUR_API_TOKEN" });

const run = await client.actor("eWBl1oo2MNg11IUA8").call({
    mode: "monitor",
    query: "retrieval augmented generation",
    mailto: "you@example.com",
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
const newPapers = items.filter((i) => i.recordType === "publication" && i.newSinceLastRun);
const alerts = items.filter((i) => i.recordType === "alert");

console.log(`${newPapers.length} new papers since last run, ${alerts.length} alerts`);

cURL:

curl -X POST "https://api.apify.com/v2/acts/eWBl1oo2MNg11IUA8/runs?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "citationGraph",
    "doi": "10.48550/arXiv.1706.03762",
    "mailto": "you@example.com"
  }'

curl "https://api.apify.com/v2/datasets/DATASET_ID/items?token=YOUR_API_TOKEN&format=json"

Output record types

The dataset is a stream of typed records. Filter on recordType for clean downstream processing.

`publication` — the main result records

{
    "recordType": "publication",
    "rank": 1,
    "doi": "10.48550/arxiv.1706.03762",
    "title": "Attention Is All You Need",
    "authors": ["Ashish Vaswani", "Noam Shazeer", "..."],
    "venue": "Neural Information Processing Systems",
    "year": "2017",
    "type": "proceedings-article",
    "abstract": "The dominant sequence transduction models...",
    "citationCount": 124378,
    "citationsPerYear": 13820.0,
    "isOpenAccess": true,
    "isSeminal": true,
    "pdfUrl": "https://arxiv.org/pdf/1706.03762.pdf",
    "primaryUrl": "https://openalex.org/W2796796457",
    "sources": ["dblp", "openalex", "semanticscholar", "crossref"],
    "sourceIds": { "dblp": "conf/nips/...", "openalex": "...", "semanticscholar": "...", "crossref": "..." },
    "score": {
        "overall": 0.94,
        "components": {
            "citations": 1.0,
            "recency": 0.42,
            "sourceConsensus": 1.0
        }
    },
    "rankingExplanation": [
        "Found in 4/4 sources (dblp, openalex, semanticscholar, crossref)",
        "Top 1% citation count in this result set (124,378 citations)",
        "Open access PDF available",
        "Cross-confirmed across all 4 catalogs"
    ],
    "mergeConfidence": 0.94,
    "keywords": ["attention", "transformer", "sequence", "encoder", "decoder"],
    "clusterId": "c1",
    "relatedPaperRanks": [{ "rank": 5, "score": 0.42 }, { "rank": 12, "score": 0.31 }],
    "newSinceLastRun": false,
    "citationDeltaSinceLastRun": 1240,
    "bibtex": "@inproceedings{vaswani2017attention,\n  title = {...},\n  ...\n}",
    "ris": "TY  - CONF\nTI  - Attention Is All You Need\n...",
    "extractedAt": "2026-04-25T14:00:00.000Z"
}

`cluster` — topic groups (when `cluster: true`)

{
    "recordType": "cluster",
    "clusterId": "c1",
    "name": "Attention / Transformer / Architecture",
    "paperRanks": [1, 5, 12, 18, 24],
    "paperCount": 5,
    "dominantKeywords": ["attention", "transformer", "encoder", "decoder", "sequence"],
    "avgScore": 0.81,
    "avgYear": 2020
}

`alert` — threshold triggers (when `emitAlerts: true`)

{
    "recordType": "alert",
    "alertType": "newly-cited",
    "severity": "warn",
    "message": "Citation surge: \"Attention Is All You Need\" gained 1240 citations since last run",
    "paperRank": 1,
    "paperTitle": "Attention Is All You Need",
    "paperDoi": "10.48550/arxiv.1706.03762",
    "detail": { "citationDelta": 1240, "currentCitations": 124378 }
}

alertType is a stable enum: "new-paper" / "newly-cited" / "seminal-detected".

`graph` — citation graph (when `emitGraph: true`)

{
    "recordType": "graph",
    "graph": {
        "nodes": [{ "id": "doi:10.48550/...", "label": "Attention Is All You Need", "year": "2017", "doi": "10.48550/...", "inPrimarySet": true, "citationCount": 124378, "pagerank": 0.0421 }],
        "edges": [{ "source": "doi:...", "target": "doi:...", "type": "cites" }],
        "nodeCount": 87,
        "edgeCount": 213
    },
    "topInfluential": [
        { "id": "doi:10.48550/...", "label": "Attention Is All You Need", "pagerank": 0.0421, "doi": "10.48550/..." }
    ]
}

`insights` — curated lists (when `emitInsights: true`)

{
    "recordType": "insights",
    "persona": "researcher",
    "insights": {
        "topPapers": [{ "rank": 1, "title": "Attention Is All You Need", "doi": "...", "reason": "Top blend of quality (0.94), velocity (13820 cites/year), and 4/4 source agreement" }],
        "risingPapers": [{ "rank": 7, "title": "...", "doi": "...", "reason": "Citation velocity 412/year (2023, 824 total)" }],
        "foundationalPapers": [{ "rank": 1, "title": "...", "doi": "...", "reason": "124,378 citations across 9 years — still actively cited" }],
        "controversialPapers": [],
        "emergingTopics": [{ "clusterId": "c2", "name": "Diffusion / Generative / Latent", "paperCount": 8, "avgYear": 2024, "dominantKeywords": ["diffusion", "generative", "latent"], "reason": "Mean year 2024, 8 papers — recent and active" }]
    }
}

`trends` — cluster trends across runs (when `emitTrends: true`)

{
    "recordType": "trends",
    "isFirstRun": false,
    "priorRunAt": "2026-04-18T14:00:00.000Z",
    "clusters": [
        { "name": "Diffusion / Generative / Latent", "trend": "rising", "paperCount": 8, "paperCountDelta": 3, "dominantKeywords": ["diffusion", "generative", "latent"] },
        { "name": "Sparse / Mixture / Experts", "trend": "new", "paperCount": 4, "paperCountDelta": null, "dominantKeywords": ["sparse", "mixture", "expert"] },
        { "name": "Pruning / Quantization / Compression", "trend": "falling", "paperCount": 2, "paperCountDelta": -3, "dominantKeywords": ["pruning", "quantization", "compression"] }
    ]
}

trend is a stable enum: "new" / "rising" (>25% growth) / "stable" / "falling" (>25% drop) / "gone".

`summary` — Markdown narrative (when `emitNarrative: true`)

{
    "recordType": "summary",
    "headline": "literatureReview: 50 publications (high confidence)",
    "executiveSummary": "## Key Takeaways\n\n- **50** unique publications across 4 sources (high confidence).\n- **4** seminal papers flagged in this set (top-5% citations, age-adjusted).\n- **5** rising papers with strong citation velocity in the last 5 years.\n- **3** emerging topics detected in the cluster set (recent + active).\n- Largest cluster: **Attention / Transformer / Architecture** (12 papers).",
    "markdown": "# Academic Publication Search — \"transformer architecture\"\n\n**Mode:** literatureReview ...",
    "persona": "researcher",
    "queryAnalysis": { "detectedDomain": "machine learning / AI", "intent": "literature_review", "suggestedMode": null, "suggestedModeReason": null, "warnings": [] },
    "confidence": { "overall": 0.82, "overallLevel": "high", "dataCompleteness": 0.87, "sourceAgreement": 0.78, "sourcesAvailable": 4, "sourcesAttempted": 4, "notes": [] },
    "sourcesQueried": ["dblp", "openalex", "semanticscholar", "crossref"],
    "sourceCounts": { "dblp": 50, "openalex": 50, "semanticscholar": 47, "crossref": 50 },
    "sourceErrors": {},
    "rawHits": 197,
    "uniquePublications": 50,
    "pushed": 50,
    "mode": "literatureReview",
    "sortBy": "citations",
    "newSinceLastRunCount": null,
    "seminalCount": 4,
    "clusterCount": 5,
    "alertCount": 0,
    "finishedAt": "2026-04-25T14:00:00.000Z"
}

`error` — actor-level error

{ "recordType": "error", "error": true, "message": "...", "timestamp": "..." }

A run summary is also written to the key-value store under SUMMARY mirroring the summary record's telemetry, plus PPE charge totals.

How it works

Academic Publication Search
 ┌──────────────────────────────────────────────────────────────────┐
 │  INPUT — mode → preset → resolved config                         │
 │                                                                  │
 │  ┌──────────────────────────────────────────────┐                │
 │  │  Parallel fetch (Promise.all, retry on 5xx)  │                │
 │  │  DBLP | OpenAlex | S2 | Crossref             │                │
 │  └──────────────────────┬───────────────────────┘                │
 │                         ▼                                        │
 │              Normalize + Dedup (DOI → fuzzy title)               │
 │                         ▼                                        │
 │  Score (citations 50% + sources 30% + recency 20%) → rank        │
 │  Build ranking explanations (plain-English)                      │
 │  Detect seminal papers (top 5%, age-adjusted)                    │
 │  Compute citations-per-year + merge confidence                   │
 │                         ▼                                        │
 │  TF-IDF keyword extraction (when extractKeywords)                │
 │  Topic clustering by Jaccard overlap (when cluster)              │
 │                         ▼                                        │
 │  Optional: forward citations + backward references (top-N)       │
 │  Optional: structured graph + PageRank                           │
 │                         ▼                                        │
 │  Optional: diff against prior run (KV state)                     │
 │  Optional: emit alerts on thresholds                             │
 │  Optional: emit Markdown narrative                               │
 │                         ▼                                        │
 │  Push records (publication / cluster / alert / graph / summary)  │
 │  PPE: charge per record, stop early on eventChargeLimitReached   │
 │  Save state for next-run diff                                    │
 │  Write SUMMARY to KV                                             │
 └──────────────────────────────────────────────────────────────────┘

Quality score formula

score = 0.5 × citationCohortScore       (log-normalized within result set)
      + 0.3 × sourceConsensus           (sources / 4)
      + 0.2 × recencyScore              (exp(-age / 10), 10y half-life)

Weights are documented in dataset_schema.json. We deliberately don't include venue prestige — it requires a paywalled or scraped index that the actor can't ship without licensing. If you need venue weights, post-process the result set with your own ISSN-to-ranking map.

Seminal detection

A paper is flagged seminal when:

It sits in the top 5% of citation counts in the current result set, AND
It clears age-adjusted citation thresholds:
- 5+ years old: ≥500 citations
- 2–5 years old: ≥1000 citations
- <2 years old: ≥2000 citations

This cohort-relative + age-adjusted approach prevents both "everything seminal in a high-cite cohort" and "nothing seminal in a recent cohort."

Deduplication

DOI-first. Records with the same normalized DOI merge into one group. Records without a DOI fall through to a normalized title-plus-year fingerprint and merge with any DOI-bearing record on the same fingerprint. Within a merged group:

Strings: prefer the first non-null, non-empty value (abstracts prefer the longest).
Authors: keep the longer author list.
Numbers: take the maximum (citation counts always reflect the best-crawled source).
isOpenAccess: true if any source said true.
Provenance: sources[] is the union of contributing source names; sourceIds is the per-source map.

mergeConfidence is computed from: DOI presence (+0.4), source count (+0.06 per extra source), title length (+0.03), year present (+0.03), authors present (+0.03), with a base of 0.4. Filter >= 0.8 for high-confidence merges.

Reliability

Each source call uses 3 retries with exponential backoff on 5xx and 429. If retries exhaust, the per-source error lands in the run summary's sourceErrors map and the run continues with the other sources. Citation enrichment uses a 5-consecutive-failure circuit breaker so a Semantic Scholar outage doesn't burn credits on a long run.

How much does it cost to run?

Pay-per-event at $0.002 per merged unique publication returned. You're charged per deduplicated record, not per raw source hit, so a query that pulled 200 raw hits across four sources but merged into 50 unique publications charges 50 events. Cluster, alert, graph, and summary records are not charged.

Scenario	Inputs	Approx. results	Approx. cost
Quick literature check	`query`, default	~50 unique	$0.10
Author bibliography	`authorAnalysis` mode	~200 unique	$0.40
Full conference extraction	`venue` + `year`, `maxResults: 500`	~500 unique	$1.00
Citation graph deep dive	`citationGraph` mode	1 enriched root + graph	$0.002
Weekly monitor	`monitor` mode	typically 5–20 new	$0.01–$0.04

Apify platform compute (CPU + memory) is billed separately. Citation enrichment adds wall-clock time but doesn't change the PPE fee — the per-record charge is unchanged whether enrichment is on or off.

Limitations and responsible use

Maximum 500 merged records per run. For larger sweeps, paginate by year, venue, or author across multiple runs.
Bibliographic metadata only. No paper full-text. Use pdfUrl for open-access PDFs.
DOI lookup skips DBLP. DBLP has no DOI lookup endpoint.
Single-year filtering. Year ranges aren't exposed across all sources; run separately or post-filter.
Quality score is cohort-relative. A score of 0.7 in a cohort of seminal papers is different from 0.7 in a cohort of preprints. Always interpret scores within their result set.
Clustering quality is keyword-bound. TF-IDF + Jaccard is fast and deterministic but not as good as embedding-based clustering. For deep semantic work, post-process with your own embedding model on the abstracts.
PageRank on small graphs. With <30 nodes the PageRank scores are mostly noise — interpret them as weak signal in citation-graph mode unless your citationLimit is high enough.
State persistence requires Full Access tokens. Restricted-access scoped tokens can't open named KV stores; monitor mode degrades gracefully and treats every run as a first run under restricted tokens.
Polite-pool mailto. Always pass it. OpenAlex and Crossref are public services and mailto is how you identify yourself as a well-behaved consumer.

What this actor IS (positioning)

A research intelligence engine — not a thin academic search wrapper.
Decision-support output — the researchBrief tells you what to read first, what's rising, what to focus on. Raw data is there if you want it; meaning is delivered first.
Deterministic and explainable — every score, cluster, intent label, and ranking comes with a reason. No black boxes, no LLM hallucinations.
Pull-based — one query, one brief. No configuration drift, no 17-checkbox configurations to remember.

What this actor does NOT do

Does NOT replace paid platforms like Web of Science, Scopus, or Dimensions for licensed citation analytics, paywalled abstracts, or proprietary impact metrics.
Does NOT scrape Google Scholar. Google Scholar has no API and aggressively blocks programmatic access.
Does NOT extract or download paper PDFs. The pdfUrl field is a link; downloading is your job (or use a downloader actor).
Does NOT use any LLM or embedding service. Scoring, clustering, keywords, alerts, and explanations are all pure deterministic computation. Reproducible, fast, and free of LLM hallucinations — but limited to keyword-level semantics.
Does NOT compute h-index, journal impact factor, or author-level analytics. Per-paper signals only.
Does NOT cover preprints exhaustively. arXiv preprints surface via OpenAlex and Semantic Scholar's indexing of arXiv; the actor does not query arXiv's API directly.

FAQ

What is the quality score actually measuring? Three things, weighted: how well-cited the paper is relative to the rest of your result set (50%), how many of the four sources agreed it exists (30%), and how recent it is (20%). The score is cohort-relative — a 0.9 in a cohort of seminal papers means something different from a 0.9 in a cohort of recent preprints. The components breakdown is on every record so you can re-weight if your use case wants that.

Why do you flag seminal papers based on the result-set cohort, not absolute citation counts? Because a result set of NeurIPS 2017 papers has very different absolute citation distributions than a result set of NeurIPS 2024 papers. Cohort-relative + age-adjusted thresholds catch the standout papers in either cohort without false positives in either direction.

How does monitoring work across runs? The first run with monitor mode (or diffWithPriorRun: true) stores result fingerprints in a named KV store keyed by mode + query + filters. The next run with the same scope diffs against the stored state and adds newSinceLastRun and citationDeltaSinceLastRun to every record. The state store caps at 5,000 fingerprints per scope and rolls over FIFO to prevent unbounded growth.

What's the difference between cluster and graph output? Clusters group your result set by topical similarity (keyword overlap). The graph traces actual citation links between papers — who cites whom. You can use both: clusters for "what topics are in this search?" and graph for "who's the most influential in this neighborhood?".

Can I skip the new fields and get the simple shape I had before? Yes — the new fields are additive and null/empty when their feature is off. With default inputs, you get the same publication records as before plus the new score/explanation/keywords/etc. fields filled in (or null if not applicable). Filter the dataset on recordType: "publication" to skip cluster / alert / graph / summary records cleanly.

Why not use embeddings for clustering? Trade-offs. TF-IDF + Jaccard is deterministic, free, fast, and doesn't depend on an external embedding service that could rate-limit, change pricing, or hallucinate. The clustering quality is meaningfully worse than embedding-based, but it's good enough for "broadly group these 50 papers into themes" — which is the only thing literature-review mode promises. For deep semantic clustering, post-process the abstracts with your own embedding model.

What's the difference between rankingExplanation and whyItMatters? rankingExplanation answers "why is this at rank #N in this result set?" — built from cohort signals: source coverage, citation percentile within the result set, recency. whyItMatters answers "what's this paper's role in the field?" — built from absolute signals: seminal flag, foundational age + citations, citation velocity rank, PageRank rank in the citation graph (when graph is built), paper intent (survey, benchmark, dataset, tool). They overlap a little but optimize for different questions.

How does paper intent classification work? A regex sweep over title + abstract for surveys ("survey", "review of", "systematic review"), benchmarks ("benchmark", "leaderboard", "comparison"), datasets ("dataset", "corpus"), tools ("toolkit", "library"), theory ("theorem", "proof", "formal verification"), and applications ("case study", "deployed"). Falls back to "method" when "novel approach" / "we propose" patterns appear, and "unknown" otherwise. Pure regex — no LLM. False-positive rate is low for clear cases (a paper called "A Survey of X" is a survey 99% of the time) and "unknown" is honest about ambiguous cases.

Should I trust the persona-tunable scoring? The personas are weight presets — they're transparent (documented in input schema, surfaced in score.weights) and reproducible. They change the order of results, not the underlying citation/source/recency signals. If your job has a different weighting in mind, post-process the dataset with your own weights against score.components.

What does confidence: low mean for my run? Either (1) most records came from a single source, suggesting your query is in a niche only one catalog covers, (2) many records are missing key fields (DOI, abstract), or (3) the result set is small (<5 records). Low confidence doesn't mean the data is wrong — it means downstream consumers should weigh it less. The notes array spells out what's driving the score.

Why does monitor mode emit cluster trends and not just paper diffs? Paper-level diffs answer "what's new?" Cluster trends answer "what's rising?" — usually a more interesting signal for monitoring. A cluster going from 2 papers to 6 in a month is a stronger signal than 4 individual new papers, because it indicates a wave forming, not just isolated activity.

Does the citation graph show 2nd-degree citations? No — only 1-hop. The forwardCitations and backwardReferences arrays on each result hold the immediate neighborhood. Multi-hop graph expansion would multiply API calls unpredictably; we keep it predictable at 1-hop.

How fresh is the data? DBLP and Crossref update within days of publication. OpenAlex updates weekly. Semantic Scholar updates roughly monthly with re-scored citation counts. For papers under a week old expect partial coverage; older than a month all four sources should agree.

Can I use this for non-CS research? Yes. DBLP is CS-only, but OpenAlex covers every discipline, Crossref covers every DOI-registered work, and Semantic Scholar covers most STEM + social sciences. Smart-source-selection auto-drops DBLP for biology/medical queries; for other non-CS work, drop DBLP from sources manually.

How do I export to Zotero or Mendeley? Open the dataset in Apify Console, switch to the RIS Export view, download as text, import into your reference manager. The BibTeX Export view is the equivalent for LaTeX / Overleaf.

How does smart source selection work? Simple keyword heuristic: queries containing biology/medical terms (protein, gene, cancer, clinical, neuro, etc.) drop DBLP from the source list since DBLP indexes computer science only. The reduction is logged and the actor proceeds with the remaining sources. Override by setting sources explicitly.

What happens when cluster is on but no clusters form? You get zero recordType: cluster records. Papers that don't cluster have clusterId: null. Clustering needs at least 4 papers with overlapping keywords (≥0.25 Jaccard) to form a group of 2+.

Pipeline (ASCII overview)

Input (mode, persona, query, filters, scoringWeights)
         │
         ▼
   Auto-mode resolver ──► Mode preset (standard / literatureReview / monitor / citationGraph / authorAnalysis)
         │
         ▼
   Smart source selection (drops DBLP for biology queries)
         │
         ▼
   Parallel fan-out: DBLP · OpenAlex · Semantic Scholar · Crossref
         │
         ▼
   Dedupe + merge (DOI-keyed, title-fuzzy fallback)
         │
         ▼
   Score (cohort-relative citations · source consensus · recency, persona-tunable)
         │
         ▼
   Per-paper enrichment (top-N citations + references via S2 / Crossref)
         │
         ├─► Cluster (TF-IDF + Jaccard) ──► clusterTrends (vs prior run)
         ├─► Citation graph + PageRank
         ├─► Diff vs prior run (newSinceLastRun, citationDelta, changeFlag)
         ├─► Alerts (new-paper / newly-cited / seminal-detected) + dryRun task
         ├─► Insights (top / rising / foundational / emerging-topics)
         └─► Research brief (canonical hero record)
         │
         ▼
   Decorate every record (recordType + schemaVersion + eventId + summary + agentContract + eventTaxonomy)
         │
         ▼
   Apply outputProfile filter (minimal / standard / llm / full)
         │
         ▼
   Push to dataset + per-item PPE charge + write OUTPUT / OUTPUT.csv / RECEIPTS / SUMMARY KV mirrors
         │
         ▼
   Optional Slack/Discord webhook embed

Stable enum tokens

The following enums are stability-promised: additive changes only across minor versions; values are safe to switch on in downstream code.

Field	Values
`recordType`	`researchBrief`, `publication`, `cluster`, `alert`, `graph`, `summary`, `insights`, `trends`, `error`
`agentContract.decision`	`investigate`, `review`, `monitor`, `noop`
`eventTaxonomy`	`seminal-paper`, `breakout-paper`, `foundational-paper`, `rising-paper`, `high-impact-recent`, `high-confirm-low-cite`, `low-confirm-high-cite`, `baseline-result`, `first-encounter`, `citation-surge`, `noop`
`changeFlag`	`NEW`, `IMPROVED`, `UNCHANGED`, `DECLINED`, `RESOLVED`
`intent`	`method`, `survey`, `benchmark`, `theory`, `application`, `dataset`, `tool`, `unknown`
`alert.alertType`	`new-paper`, `newly-cited`, `seminal-detected`
`alert.severity`	`info`, `warn`, `critical`
`cluster.trend`	`new`, `rising`, `stable`, `falling`, `gone`
`failureType` (on error records)	`no-data`, `invalid-input`, `rate_limit`, `timeout`, `upstream-error`, `auth`, `parse-error`
`confidence.overallLevel`	`high`, `medium`, `low`

Output controls (premium)

outputProfile (minimal/standard/llm/full) — strips fields at push time without writing a JSONata expression. minimal is decision-only (~12 fields), llm keeps natural-language fields for agent consumers, full (default) keeps every field.
includeAgentContract (default true) — writes a top-level agentContract: { decision, confidence, nextAction, costToAct } on every record and on the run-level SUMMARY for MCP/agent consumers.
scoringWeights — override the persona-based weights (citations/recency/sourceConsensus). Useful when your use case wants to depart from the canned researcher/engineer/analyst profiles.
circuitBreakerThreshold — disable citation/reference enrichment after N consecutive failures. Default 5.
watchlistName — namespace cross-run state per watchlist so the same actor can be run as N independent monitors.
webhookUrl (Slack or Discord) — posts a rich embed at run end whenever alerts fire or a research brief is emitted. Vendor auto-detected.

Every run also writes:

OUTPUT (KV) — full deterministic shape { schemaVersion, mode, runDecision, agentContract, trust, coverage, dataGaps, agenticReadiness, counts, records[], alerts[] }. Idempotent across re-runs.
OUTPUT.csv (KV) — wide CSV with the analyst-relevant columns for direct paste into Zotero / Excel / case-management imports.
RECEIPTS (KV) — per-charge audit trail [{ timestamp, action, cost, eventId }].
SUMMARY (KV) — run-level metadata + agentContract + trust + coverage + dataGaps + agenticReadiness.

Use as a Dify tool

The actor's input and dataset shapes are stable and typed, so it slots in as a Dify tool with no glue code.

In Dify, add an HTTP Request tool calling POST https://api.apify.com/v2/acts/{actorId}/run-sync-get-dataset-items?token=YOUR_API_TOKEN.
Body: {"mode": "auto", "query": "<topic>", "outputProfile": "llm", "maxResults": 10}.
Response: a JSON array of records. Each recordType: publication carries agentContract.decision ∈ {investigate,review,monitor,noop}, summary (≤280 chars LLM-friendly), and whyItMatters[] for natural-language reasoning.
Dify branch logic on agentContract.decision: route investigate → "read in full and cite", review → "scan abstract", monitor → "add to watchlist", noop → "skip".
For schedule-driven monitoring, set mode: "monitor" plus a stable watchlistName so cross-run state stays consistent.

Pair with the LLM-friendly summary + eventTaxonomy instead of joining raw fields — keeps prompt tokens tight.

When NOT to use this actor

This actor is the broad multi-source brief. It is the wrong choice when:

You need only one source. Use the dedicated single-source siblings: crossref-paper-search, semantic-scholar-search, openalex-research-papers.
You need arXiv preprints specifically. Use arxiv-paper-search — this actor surfaces arXiv via OpenAlex/S2 indexing only, not arXiv's API directly.
You need researcher profiles, not papers. Use orcid-researcher-search for ORCID lookups.
You need PDFs downloaded, not metadata. This actor returns pdfUrl links; use a PDF-downloader actor for fetching.
You need bibliographic precision over breadth. Use Crossref directly — DBLP/OpenAlex/S2 have indexing gaps for some venues.
You need h-index, journal impact factor, or author-level analytics. This actor scores per-paper only.

Actor	Description
Crossref Academic Paper Search	Single-source Crossref search with citations, BibTeX, OA detection (Unpaywall)
Semantic Scholar Paper Search	Single-source Semantic Scholar search with AI-extracted abstracts and citation context
OpenAlex Research Paper Search	Single-source OpenAlex search across 250M+ scholarly works
ArXiv Preprint Paper Search	Search and download preprint papers from arXiv
ORCID Researcher Search	Search researchers by name or ORCID

DBLP Computer Science Bibliography Scraper

openclawmara/dblp-scraper

Scrape computer science publications from DBLP. Search papers by keyword, get author profiles with publication lists, and retrieve venue/conference information. Access 6M+ publications from the largest CS bibliography.

OpenClaw Mara

OSF Open Science Framework Scraper

parseforge/osf-scraper

Export public research projects, preprints, and registrations from the Open Science Framework (OSF). Search across 1M+ open science records. Filter by keyword, subject, or provider. Pull titles, descriptions, tags, DOIs, authors, institutions, dates, and full metadata.

ParseForge

ArXiv Preprint Paper Search

ryanclinton/arxiv-paper-search

Search and extract preprint research papers from the ArXiv open-access repository. Query over 2.4 million academic papers across physics, mathematics, computer science, biology, economics, and more with structured JSON output, no API key required.

Ryan Clinton

arXiv Preprint Scraper

parseforge/arxiv-scraper

Export preprints from arXiv.org. Search 2.5M+ open-access papers across physics, mathematics, computer science, biology, economics, and quantitative finance. Query by keyword, author, category, or date range. Pull titles, authors, abstracts, categories, DOIs, journal refs, and PDF links.

ParseForge

5.0

HAL Open Science Scraper

parseforge/hal-open-science-scraper

Export research papers, theses, and preprints from HAL, France's national open science archive. 3M+ full-text records across every scientific discipline. Filter by domain, author, lab, journal, or year. Pull titles, abstracts, authors, DOIs, PDFs, citations.

ParseForge

Alternate Scraper — Computer Parts, Electronics & Prices

studio-amba/alternate-scraper

Scrape computer hardware, electronics products and prices from Alternate.de. European retailer for GPUs, CPUs, laptops, peripherals, and more. Supports search queries, category URLs, and pagination.

Studio Amba

Medium Publications Search Scraper

easyapi/medium-publications-search-scraper

Scrape Medium publications by keywords - Extract publication details including name, description, URL and avatar from Medium's search results efficiently and reliably.

EasyApi

OpenAlex Research Paper Search

ryanclinton/openalex-research-search

Search and extract structured data from over 250 million academic papers, journal articles, and scholarly works using the OpenAlex open database. Filter by keyword, publication year, citation count, and open access status -- no API key required, completely free to query.

Ryan Clinton

Substack Publications Scraper 📚

easyapi/substack-publications-scraper

Scrape detailed publication information from Substack based on keywords. Get comprehensive data about newsletters, authors, subscriber counts, and publication metrics in structured JSON format.

EasyApi

1.0

CORE Open Access Paper Search

ryanclinton/core-academic-search

Search and extract open access academic papers from CORE -- the world's largest aggregator of open access research with over 300 million metadata records and 40+ million full-text papers. Filter by keyword, year range, and language.

Ryan Clinton

DBLP Computer Science Publication Search

Academic Research Brief — Multi-Source Paper Search

In one sentence

What is this actor? (quick answer)

Use this actor when you need to:

What is a researchBrief?

Capabilities

Compared to other approaches

How to search academic papers programmatically

How to automate a literature review

How to build a citation graph for a paper

Best API for research papers?

How to track new research papers in a field

Built for AI agents

One-line usage

The hero output: researchBrief

What does this actor do?

Why use this on Apify?

Workflow modes

Key features

Aggregation + dedup

Intelligence layer

Curated insights (emitInsights: true)

Cross-run trends (emitTrends: true)

Persona-tunable scoring

Query analysis & soft mode suggestions

Run confidence

Executive summary

Semantic layer

Citation graph

Cross-run state (monitor mode)

Exports

DX

How to use it

Input parameters

Starter templates

Tips for best results

Programmatic access

Output record types

publication — the main result records

cluster — topic groups (when cluster: true)

alert — threshold triggers (when emitAlerts: true)

graph — citation graph (when emitGraph: true)

insights — curated lists (when emitInsights: true)

trends — cluster trends across runs (when emitTrends: true)

summary — Markdown narrative (when emitNarrative: true)

error — actor-level error

How it works

Quality score formula

Seminal detection

Deduplication

Reliability

How much does it cost to run?

Limitations and responsible use

What this actor IS (positioning)

What this actor does NOT do

FAQ

Pipeline (ASCII overview)

Stable enum tokens

Output controls (premium)

Use as a Dify tool

When NOT to use this actor

Related actors

You might also like

DBLP Computer Science Bibliography Scraper

OSF Open Science Framework Scraper

ArXiv Preprint Paper Search

arXiv Preprint Scraper

HAL Open Science Scraper

Alternate Scraper — Computer Parts, Electronics & Prices

Medium Publications Search Scraper

OpenAlex Research Paper Search

Substack Publications Scraper 📚

CORE Open Access Paper Search

Related articles

What is a `researchBrief`?

The hero output: `researchBrief`

Curated insights (`emitInsights: true`)

Cross-run trends (`emitTrends: true`)

`publication` — the main result records

`cluster` — topic groups (when `cluster: true`)

`alert` — threshold triggers (when `emitAlerts: true`)

`graph` — citation graph (when `emitGraph: true`)

`insights` — curated lists (when `emitInsights: true`)

`trends` — cluster trends across runs (when `emitTrends: true`)

`summary` — Markdown narrative (when `emitNarrative: true`)

`error` — actor-level error