DBLP Computer Science Publication Search avatar

DBLP Computer Science Publication Search

Pricing

from $2.00 / 1,000 publication fetcheds

Go to Apify Store
DBLP Computer Science Publication Search

DBLP Computer Science Publication Search

Search and extract computer science publications from DBLP -- the largest open bibliography database for computer science with over 6 million publications from journals, conferences, and workshops. Filter by keyword, author, venue, year, and publication type.

Pricing

from $2.00 / 1,000 publication fetcheds

Rating

0.0

(0)

Developer

ryan clinton

ryan clinton

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

Academic Research Brief — Multi-Source Paper Search

Turn a single query into a ranked, clustered, and explained research brief — across four academic catalogs, with zero setup.

This actor tells you what matters, not just what exists.

It searches DBLP, OpenAlex, Semantic Scholar, and Crossref in parallel, deduplicates by DOI, scores every paper, classifies its intent (method / survey / benchmark / theory / application / dataset / tool), clusters results into topics, detects emerging trends across runs, explains why each paper matters in the field, and ships a single canonical researchBrief record that synthesizes everything — what to read first, what's rising, what's enduring, what to focus on next, and a confidence explanation in plain English. Plus BibTeX, RIS, and a Markdown executive summary on every run. No API key required.

Use it as a literature review tool, a programmatic academic search API, a citation graph API, a research paper API for AI agents, or a scheduled research monitor for any field.


In one sentence

A research brief can be generated by aggregating academic search across multiple scholarly databases, deduplicating results, ranking papers by relevance, and synthesizing the findings into a structured summary.

A literature review can be automated by searching, deduplicating, ranking, clustering, and summarizing research papers into a structured overview.

Get the most important papers in a research field — ranked, clustered, and explained — in one API call.

This is an academic paper search API, a programmatic literature review tool, and a citation graph API, all in one actor, across DBLP, OpenAlex, Semantic Scholar, and Crossref.

Instead of querying multiple academic APIs and assembling the results yourself, this actor returns a single structured researchBrief with ranked papers, topic clusters, and what to read first — in one call.

Replaces OpenAlex / Semantic Scholar / Crossref / DBLP when you want one ranked answer instead of four raw responses. Replaces Google Scholar alerts when you want structured monitoring with trend detection. Replaces a custom multi-API research pipeline when you want the brief, not the build.


What is this actor? (quick answer)

This actor turns a single academic query into a structured research brief.

It:

  • searches 4 academic catalogs in parallel — DBLP, OpenAlex, Semantic Scholar, Crossref
  • deduplicates results by DOI (case-insensitive, prefix-stripped) with fuzzy title-plus-year fallback
  • scores and ranks papers using citations, source consensus, and recency (persona-tunable weights)
  • classifies each paper as method / survey / benchmark / theory / application / dataset / tool
  • clusters results into topics, detects emerging trends across runs, identifies seminal and breakout papers
  • returns a single researchBrief object containing what to read first, what's rising, what's foundational, what to focus on, and a confidence explanation
  • exports BibTeX, RIS, and Markdown summaries ready to paste into Zotero, Mendeley, EndNote, Notion, Slack, or email

No API key. No infrastructure. One run produces a deduplicated, ranked, explained research brief.


Use this actor when you need to:

  • Find the most important papers in a topic across multiple academic catalogs
  • Run a literature review programmatically — clustering, ranking, and narrative summary in one call
  • Identify what to read first when entering a new research field
  • Track new research papers over time with a scheduled monitor that flags new and newly-cited work
  • Build a citation graph for a specific paper (forward + backward citations + PageRank influence ranking)
  • Generate BibTeX or RIS exports programmatically for hundreds of papers at once
  • Power an AI agent or copilot that needs structured academic-paper context for retrieval, summarization, or report writing
  • Compare research output across authors, venues, or years with consistent scoring
  • Get a confidence-explained answer instead of raw search results that you have to make sense of yourself

What is a researchBrief?

A researchBrief is a structured object that turns a research question into an opinionated answer.

It contains:

  • headline — one-line title for the run
  • oneLine — ultra-short shareable takeaway (Slack-subject ready)
  • keyTakeaways[] — 4–6 scannable bullets covering the most important findings
  • whatToRead[] — ranked, role-tagged action list: top-overall paper, best survey, fastest-rising work, foundational paper, useful benchmark or tool — each with a concrete reason
  • whatIsRising[] — topics newly emerging or growing >25% since the last run
  • whatMatters[] — distilled signals from across the top result set
  • topicLandscape[] — all clusters with size, average year, and trend annotation
  • breakoutPapers[] — papers in the top 10% by citations AND top 5% by velocity (cross-signal synthesis)
  • enduringPapers[] — ≥8 years old AND still in top 10% citation velocity
  • recommendedFocus[] — topics worth attention based on cluster trends and emerging signals
  • confidence — overall score, level (high/medium/low), and a plain-English explanation

It's built deterministically from the run's data — no LLM, no extra API calls. The brief answers four questions: What should I read first? What papers matter most? What topics are emerging? How sure are you?

If you only read one record per run, read this one.


Capabilities

CapabilitySupportedNotes
Multi-source academic searchYesDBLP, OpenAlex, Semantic Scholar, Crossref — parallel fetch
DOI deduplicationYesCase-insensitive, prefix-stripped DOI keys + fuzzy title+year fallback
Quality scoringYesCitations + source consensus + recency, persona-tunable weights
Plain-English ranking explanationsYesrankingExplanation per paper
"Why it matters" reasoningYeswhyItMatters per paper, distinct from ranking explanation
Paper intent classificationYesmethod / survey / benchmark / theory / application / dataset / tool
Seminal-paper detectionYesTop 5% citations, age-adjusted thresholds
Topic clusteringYesTF-IDF + Jaccard, deterministic, no embeddings
Citation graph + PageRankYescitationGraph mode — structured nodes/edges + influence ranking
Forward citationsYesOpt-in via includeCitations, per-paper
Backward referencesYesOpt-in via includeReferences, per-paper
Cross-run trend detectionYesCluster-level diffs (new / rising / stable / falling / gone)
Per-paper change detectionYesnewSinceLastRun + citationDeltaSinceLastRun
Alert recordsYesThreshold-triggered: new high-impact papers, citation surges, seminal detections
Auto / zero-config modeYesmode: "auto" resolves the best workflow from input shape
Persona tuningYesresearcher / engineer / analyst weight presets
Query analysisYesDomain detection + intent + soft mode suggestions
Run confidence + explanationYesOverall score + plain-English explanation in every brief
BibTeX exportYesPer-record bibtex field + dedicated dataset view
RIS exportYesPer-record ris field + dedicated dataset view (Zotero / Mendeley / EndNote)
Markdown executive summaryYessummary record with executiveSummary + full narrative
DOI reverse lookupYesSingle-DOI lookup across OpenAlex / Semantic Scholar / Crossref
Cost-controlled enrichmentYesenrichTopN limits citation calls to top-ranked papers
Polite-pool routingYesmailto parameter for OpenAlex + Crossref
Smart source selectionYesAuto-drops DBLP for biology/medical queries
API key requiredNoAll four sources are free, anonymous-tier

Compared to other approaches

vs Google Scholar scraping

  • Google Scholar has no API and aggressively blocks programmatic access.
  • This actor uses official, free APIs from DBLP, OpenAlex, Semantic Scholar, and Crossref.
  • Result: reliable, rate-limit-aware, no proxy gymnastics required.

vs single-source APIs (DBLP-only / Crossref-only / S2-only)

  • Each catalog has gaps. DBLP is CS-only. Crossref doesn't always have abstracts. Semantic Scholar's venue normalization is uneven. OpenAlex updates weekly.
  • This actor queries all four in parallel, merges by DOI, and keeps the best field from each source.
  • Result: better coverage, fewer blind spots, automatic fallback when one source has a hiccup.

vs building your own pipeline

  • Aggregating four academic APIs, normalizing four response shapes, deduping by DOI, scoring papers, clustering, generating BibTeX/RIS, and persisting state for cross-run diffs is days-to-weeks of work.
  • This actor ships it in one call.
  • Result: skip the pipeline-building, get the answer.

vs paid bibliographic platforms (Web of Science, Scopus, Dimensions)

  • Paid platforms have proprietary citation analytics, paywalled abstracts, and licensed journal-impact metrics this actor can't replicate.
  • This actor covers the open-data subset of that workflow — usually enough for systematic reviews, monitoring, and research discovery — at no per-query license cost.
  • Result: best-in-class open-data research intelligence at Apify-platform pricing, not enterprise pricing.

How to search academic papers programmatically

An academic paper search API allows querying multiple scholarly databases, deduplicating results, and ranking papers by relevance and citation impact.

Academic papers can be searched programmatically by querying scholarly databases (DBLP, OpenAlex, Semantic Scholar, Crossref), merging results by DOI, and ranking by citation count and source agreement.

Get the most important papers in a field — ranked, deduplicated, and explained — in one API call.

This is an academic paper search API: a single endpoint for academic paper search across OpenAlex, Semantic Scholar, Crossref, and DBLP.

Replaces OpenAlex, Semantic Scholar, and Crossref APIs when you want a single endpoint, ranked results, and a structured research brief — instead of raw metadata from three separate sources.

This actor can be used as a programmatic academic paper search API.

It replaces:

  • the OpenAlex API
  • the Semantic Scholar API
  • the Crossref REST API
  • DBLP's search endpoint

Instead of querying multiple academic APIs and merging the results yourself, this actor:

  • searches all four sources in a single call
  • normalizes four different response shapes into one schema
  • deduplicates by DOI (case-insensitive, prefix-stripped) with fuzzy title+year fallback
  • ranks results with a transparent quality score
  • returns a single structured researchBrief plus typed publication records

Use this when you want a single API instead of stitching multiple academic data sources together.


How to automate a literature review

A literature review can be automated by searching academic databases, deduplicating results, ranking papers, clustering topics, and summarizing findings into a structured overview.

A literature review can be automated programmatically by querying academic databases, deduplicating results, ranking papers, clustering topics, and generating structured summaries.

Run a complete literature review — search, dedup, ranking, clustering, and Markdown summary — in one API call.

This is a programmatic literature review tool that automates the search, deduplication, ranking, clustering, and synthesis steps of a literature review in a single run.

Replaces a custom multi-API literature-review pipeline when you want a complete research brief in one call — instead of weeks of building search + dedup + scoring + clustering yourself.

This actor automates the core steps of a literature review programmatically:

  1. Search across four academic catalogs (DBLP + OpenAlex + Semantic Scholar + Crossref) in parallel
  2. Deduplicate results by DOI with fuzzy title+year fallback
  3. Rank papers by quality score (citations + source consensus + recency)
  4. Cluster papers into topics via TF-IDF + Jaccard similarity
  5. Identify what to read first — top-overall, best survey, fastest-rising, foundational, useful benchmark
  6. Detect emerging trends by diffing topic clusters against the prior run for the same scope
  7. Export to BibTeX or RIS for direct import into Zotero, Mendeley, or EndNote

Instead of building this pipeline yourself, one run with mode: "literatureReview" returns a complete research brief — ranked publications, topic clusters, insights record, and Markdown executive summary.

Replaces what would otherwise be a multi-week pipeline build for an academic-research workflow.


How to build a citation graph for a paper

A citation graph can be built by retrieving the references and citations of a paper and constructing a network of nodes and edges between them, with influence scores assigned to each node.

A citation graph can be built programmatically by fetching forward citations and backward references for a DOI and computing PageRank to identify the most influential papers in the network.

Build a citation graph with PageRank influence ranking from a DOI — in one API call.

This is a citation graph API: forward citations, backward references, structured {nodes, edges} graph object, and PageRank-ranked influence scores from a single DOI input.

Replaces Semantic Scholar's /paper/{id}/citations and /paper/{id}/references endpoints when you want a complete citation graph with PageRank-ranked influence scores in one call — instead of paginating two endpoints and computing influence yourself.

This actor can be used as a citation graph API for any DOI.

Given a DOI input with mode: "citationGraph", it:

  • fetches forward citations (papers that cite the input paper) via Semantic Scholar
  • fetches backward references (papers the input paper cites) via Semantic Scholar with Crossref fallback
  • builds a structured {nodes, edges} graph object ready for direct ingestion into Neo4j, NetworkX, Cytoscape, or any graph database
  • computes PageRank scores on every node to identify the most influential papers in the citation neighborhood
  • returns a topInfluential list — the 10 papers with the highest PageRank scores

Use this when you want a citation network for a paper in a single call, instead of paginating through Semantic Scholar's /paper/{id}/citations and /paper/{id}/references endpoints and computing influence scores yourself.


Best API for research papers?

The best API for research papers allows querying multiple academic databases, deduplicating results, and ranking papers by relevance instead of returning raw metadata from a single source.

An API for research papers provides programmatic access to scholarly metadata, citations, abstracts, and publication relationships across one or more academic data sources.

Get a ranked, multi-source answer to a research question — in one API call.

This is a strong candidate for the best API for research papers when you need multi-source results, deduplication, and ranking instead of raw single-source metadata.

Replaces OpenAlex when you want ranked answers, not raw metadata. Replaces Semantic Scholar when you want cross-source confirmation, not single-source coverage. Replaces Crossref when you want abstracts and citation context, not just DOI records.

Most APIs for research papers are single-source — OpenAlex covers everything but is generic, Semantic Scholar is strong on citations but weak on venue normalization, Crossref is the authoritative DOI metadata index but rarely has abstracts, DBLP is gold for computer science but covers nothing else.

This actor combines all four into a single API for research papers with intelligence on top:

  • multi-source search across DBLP, OpenAlex, Semantic Scholar, Crossref
  • automatic deduplication by DOI with fuzzy fallback
  • transparent ranking with cited-by + source-consensus + recency components
  • clustering and topic-trend detection across runs
  • per-paper intent classification (method / survey / benchmark / theory / application / dataset / tool)
  • structured researchBrief output that answers "what should I read first?"
  • ready-to-paste BibTeX, RIS, and Markdown exports

Use this when you want answers, not just raw API data.


How to track new research papers in a field

Research papers can be tracked over time by monitoring new publications, citation changes, and emerging topics in a field across multiple academic databases.

New research papers can be tracked programmatically by running a scheduled query, comparing results against the prior run, and flagging new entries, citation surges, and rising topic clusters.

Track new papers and citation surges in a field — with structured data and trend detection, not just email alerts.

This is a research paper monitoring tool. Unlike Google Scholar alerts, it returns structured JSON and topic-trend analysis — schedulable, multi-source, and webhook-ready.

Replaces Google Scholar alerts when you want structured JSON, multi-source coverage, citation-surge detection, and topic-trend tracking — instead of email notifications you have to parse manually.

This actor can be used as a research paper monitoring tool for any field, author, or venue.

In mode: "monitor" (or with diffWithPriorRun: true) it:

  • detects new papers since the last run for the same query scope
  • flags citation surges (papers gaining ≥50 citations between runs)
  • identifies new high-impact work (new + scoring above the seminal threshold)
  • tracks emerging topics over time — clusters tagged new / rising / stable / falling / gone
  • emits structured alert records per trigger, ready to route to Slack, email, Zapier, or any webhook

Use it as a programmable alternative to Google Scholar alerts — schedulable, structured, multi-source, and automation-ready.


Built for AI agents

This actor is designed for direct ingestion into AI workflows.

  • Structured output — every record carries a stable recordType discriminator. AI agents can branch on recordType === 'researchBrief' for the synthesized answer or filter on recordType === 'publication' for raw records.
  • Deterministic intelligence — scoring, clustering, intent classification, and ranking are pure-math computations, not LLM calls. No hallucination, no rate limits, reproducible across runs.
  • Plain-English fieldsrankingExplanation, whyItMatters, confidence.explanation, and the brief's keyTakeaways are LLM-friendly strings ready to paste into agent prompts or report templates without post-processing.
  • Single canonical answer — the researchBrief is one self-contained object that fits cleanly in an LLM context window. No need to assemble meaning across 5 record types.
  • LangChain-ready chunks — abstracts + metadata are exportable as embedding-document chunks via the exports.ts helpers; downstream RAG pipelines drop straight in.
  • Use cases: literature-review agents, research copilots, autonomous knowledge-graph construction, retrieval-augmented research assistants, scheduled research-monitoring agents.

One-line usage

curl -X POST "https://api.apify.com/v2/acts/eWBl1oo2MNg11IUA8/runs?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{"query":"transformer architecture","mailto":"you@example.com"}'

Returns a dataset with a researchBrief (top), ranked publications, clusters, insights, and Markdown summary.


The hero output: researchBrief

Every run emits a single recordType: researchBrief record at the top of the dataset (toggle off via emitResearchBrief: false). It's the canonical answer — open it first.

{
"recordType": "researchBrief",
"headline": "\"deep learning\" — 50 papers, high confidence",
"oneLine": "Read first: \"Attention Is All You Need\" — Best overall blend of quality (0.94), velocity (13,820 cites/year), and 4/4 source agreement",
"keyTakeaways": [
"50 unique publications for \"deep learning\" (high confidence).",
"4 seminal papers flagged — top 5% citations, age-adjusted.",
"3 breakout papers — high-impact AND fast-rising.",
"5 topics rising or newly emerging.",
"62% of results are open access (31/50)."
],
"whatToRead": [
{ "rank": 1, "role": "top-overall", "title": "...", "reason": "Best overall blend of quality, velocity, and source agreement" },
{ "rank": 12, "role": "survey", "title": "A Survey of ...", "reason": "Best entry-point survey — broad coverage with 142 references" },
{ "rank": 7, "role": "rising", "title": "...", "reason": "Fastest-rising recent work — 412 citations/year (2023)" },
{ "rank": 3, "role": "foundational", "title": "...", "reason": "Foundational — 8,422 citations after 11 years" },
{ "rank": 18, "role": "benchmark", "title": "...", "reason": "Useful benchmark — practical reference for reproduction" }
],
"whatIsRising": [
{ "topic": "Diffusion / Generative / Latent", "paperCount": 8, "trend": "rising", "reason": "+3 papers since last run" }
],
"whatMatters": [
"Seminal papers anchor this result set — well-cited foundations are present.",
"Top results are cross-confirmed across all four catalogs — high source agreement."
],
"topicLandscape": [
{ "topic": "Attention / Transformer / Architecture", "paperCount": 12, "avgYear": 2020, "reason": "12 papers, mean year 2020" }
],
"breakoutPapers": [
{ "rank": 7, "title": "...", "reason": "Breakout — top 10% citations (3,200) AND top 5% velocity (412/year)" }
],
"enduringPapers": [
{ "rank": 1, "title": "...", "reason": "Enduring — published 9 years ago, still in top 10% velocity (13820/year)" }
],
"recommendedFocus": [
{ "topic": "Diffusion / Generative / Latent", "reason": "Cluster growing — +3 papers (60% growth)", "paperRanks": [4, 17, 24, 31] }
],
"confidence": {
"overall": 0.82,
"overallLevel": "high",
"explanation": "High confidence — 4/4 sources returned data with strong cross-confirmation (78%) and rich metadata (87% field completeness)."
}
}

The brief is built deterministically from the same signals the rest of the actor exposes — no LLM, no extra API calls. It's there because assembling meaning across 5 record types is your job; we do it for you.


What does this actor do?

This actor is built for researchers, devs, and AI agents that need more than raw search results — it's an academic intelligence layer.

You search once. The actor:

  1. Sends your query to up to four catalogs in parallel — DBLP, OpenAlex (250M+ works), Semantic Scholar (200M+ papers, citation graph), Crossref (DOI metadata for 150M+ works).
  2. Normalizes every source into a common shape and merges duplicates by DOI (case-insensitive, prefix-stripped). Records without a DOI fall back to fuzzy title-plus-year matching.
  3. Scores every paper with a transparent quality model — citations (50%, log-normalized vs cohort), source consensus (30%, cross-confirmation), recency (20%, 10-year half-life decay).
  4. Generates plain-English ranking explanations ("Found in 4/4 sources; Top 5% citation count; Recent (2024)") usable directly in reports, emails, or AI summaries.
  5. Flags seminal papers — top 5% of citations within the result set, age-adjusted (≥500 citations if 5+ years old, ≥1000 at 2-5y, ≥2000 if <2y).
  6. Computes citations-per-year so you can spot rising momentum, not just legacy giants.
  7. Optionally extracts TF-IDF keywords from titles + abstracts and clusters results into topics by keyword overlap.
  8. Optionally fetches forward citations (papers that cite each result) and backward references (papers each result cites) — only for the top-N most relevant papers, to keep cost predictable.
  9. Optionally builds a structured citation graph with PageRank-ranked influence scores for citation-deep-dive workflows.
  10. Optionally diffs against the prior run for the same query scope — flagging new papers and citation surges since last time.
  11. Returns each record with full metadata plus pre-formatted BibTeX, RIS, and (in narrative mode) a Markdown summary of the whole run.

A single actor that replaces what would otherwise be 4 separate API integrations + a dedup pipeline + a scoring layer + a clustering pass + a state store.


Why use this on Apify?

  • No infrastructure to manage. Runs in the cloud. Handles pagination, retries with backoff, parallel fetching, dedup, scoring, clustering, and state persistence.
  • No API keys required. All four sources expose free anonymous tiers. Optionally pass mailto to get into OpenAlex's and Crossref's "polite pool" for faster, more reliable responses.
  • Resilience built in. When one source returns 5xx or 429, retries kick in. If a source still fails, the run continues with whatever the others returned. Per-source errors surface in the run summary.
  • Deduplication done right. DOI-first, fuzzy title+year fallback. Merged records keep the longest abstract, the highest citation count, the longest author list, and the open-access PDF URL when at least one source has one.
  • Transparent scoring. Every paper carries a quality score (0–1) with a components breakdown so you can audit and re-weight in your own pipeline if you want.
  • Workflow modes. Five pre-configured modes set sensible defaults for common research jobs — you click a mode, the actor configures itself.
  • Stateful when you want it. Monitor mode stores fingerprints in a named KV store and surfaces what's changed since last run. Stateless for one-shot queries.
  • Schedule, integrate, automate. Apify scheduling + Slack/webhook integrations + REST API + Python/JS clients.

Workflow modes

Each mode pre-configures sortBy, abstracts, citation enrichment, clustering, and narrative output. Explicit input fields always override mode defaults.

ModeDescriptionRequired inputAuto-enables
auto (default)Zero-config — picks the best mode for your input shape (DOI → citationGraph; author-only → authorAnalysis; diffWithPriorRun → monitor; short query + maxResults ≥ 30 → literatureReview; otherwise standard).Same as the resolved modeInherits from resolved mode
standardGeneric multi-source search with scoring, BibTeX/RIS, and per-paper "why-ranked" explanations. Clustering and graph features stay off.One of: query, author, venue, year, doi
literatureReviewClusters results into themes with TF-IDF keyword extraction. Returns top 100 by citation count. Emits a Markdown summary with cluster breakdown and top-paper highlights.query (recommended)clustering, keywords, abstracts, narrative summary, sortBy=citations, maxResults=100
monitorSchedule it. Stores result fingerprints in a named KV store, diffs against the prior run for the same scope, and emits recordType: alert records when a new high-impact paper appears or citations surge by ≥50.One of: query, author, venue, yeardiff vs prior run, alerts, keywords, sortBy=year
citationGraphDOI deep-dive. Fetches forward citations and backward references, builds a structured {nodes, edges} graph, and computes PageRank to surface the top-10 most influential papers in the neighborhood.doicitations, references, structured graph, keywords
authorAnalysisPulls the author's top 200 publications by citation, clusters into research topics, emits a narrative summary with topic distribution.authorclustering, keywords, abstracts, narrative summary, sortBy=citations, maxResults=200

Key features

Aggregation + dedup

  • Four sources in parallel: DBLP + OpenAlex + Semantic Scholar + Crossref.
  • DOI-first dedup with fuzzy title+year fallback.
  • Per-record provenance: sources[] array + sourceIds map.
  • mergeConfidence (0–1) on every record so you can filter low-confidence merges.
  • Smart source selection: biology/medical queries auto-skip DBLP (computer-science only).

Intelligence layer

  • Quality score (0–1) with components breakdown: citations, source consensus, recency. Persona-tunable weights.
  • Ranking explanation (plain-English array) per paper, usable in reports without modification.
  • Why it matters — separate plain-English array explaining the paper's role in the field (seminal, foundational, rising, high-influence, survey/benchmark/dataset/tool).
  • Paper intentmethod / survey / benchmark / theory / application / dataset / tool / unknown. Filter your dataset to "just surveys" or "just benchmarks" with one column.
  • Seminal flag — top 5% citations, age-adjusted thresholds.
  • Citations-per-year momentum signal.
  • Merge confidence — filter >= 0.8 for high-confidence merges.

Curated insights (emitInsights: true)

A single recordType: insights record with five curated lists:

  • Top papers — best blend of quality (0.5×) + velocity (0.3×) + source agreement (0.2×).
  • Rising papers — highest citation velocity in the last 5 years.
  • Foundational papers — ≥1,000 citations, ≥8 years old, still actively cited.
  • Controversial papers — heavy reference load relative to cohort median (signals position-paper / debate-piece).
  • Emerging topics — recent + active clusters (mean year within last 3 years, ≥2 papers).

A recordType: trends record marking each cluster as new / rising (>25% paper-count growth) / stable / falling (>25% drop) / gone. Tracks how topics in your area shift between runs.

Persona-tunable scoring

  • researcher (default) — citations 40% / sources 40% / recency 20%. Depth and cross-confirmation matter most.
  • engineer — recency 55% / citations 25% / sources 20%. Fresh techniques win.
  • analyst — citations 55% / sources 30% / recency 15%. Impact dominates.

Query analysis & soft mode suggestions

Every run analyzes your query, detects the domain (machine learning, biology/medicine, physics, etc.), classifies intent, and suggests a better mode if applicable. Suggestions are surfaced in logs and the run summary — they NEVER auto-switch your mode.

Run confidence

Top-level confidence object with overall score (0–1), overallLevel band (high/medium/low), dataCompleteness, sourceAgreement, sources available vs attempted, and diagnostic notes ("Many records missing key fields", "Low source agreement").

Executive summary

When emitNarrative is on, the summary record carries a separate executiveSummary field — short Markdown bullets, ready to paste into Slack / email / dashboards. The full markdown field has the full narrative including per-source counts, clusters, and top-paper details.

Semantic layer

  • TF-IDF keyword extraction from title + abstract.
  • Topic clustering using Jaccard similarity on keywords (off by default — opt in via cluster: true or use literatureReview mode).
  • Related papers — top-5 most similar papers per record (when clustering is on).

Citation graph

  • Forward citations (papers that cite each result) and backward references (papers each result cites) — opt-in via includeCitations / includeReferences.
  • Cost-controlled: only fetches enrichment for the top-N most relevant papers (enrichTopN, default 25).
  • Structured graph object with nodes + edges + PageRank scores when emitGraph: true.
  • Top-10 influential nodes ranked by PageRank.

Cross-run state (monitor mode)

  • Stores result fingerprints in a named KV store keyed by (mode + query + filters).
  • Per-record newSinceLastRun boolean and citationDeltaSinceLastRun integer.
  • Alert records emitted on new high-impact papers, citation surges (≥50 since last run), and seminal-paper detections.
  • First run: all records are treated as new (no false positives on cold-start).

Exports

  • BibTeX entry on every record.
  • RIS entry on every record (Zotero / Mendeley / EndNote import).
  • Markdown narrative summary record with top-paper list + cluster breakdown.
  • Dedicated dataset views: Publications Overview, Intelligence (scores + explanations), BibTeX Export, RIS Export.

DX

  • 5 record types with stable recordType discriminator: publication / cluster / alert / graph / summary / error.
  • Run summary written to KV SUMMARY with full per-source telemetry.
  • Debug mode (debug: true) saves raw per-source responses to DEBUG_RAW_HITS for inspection.
  • 5-consecutive-failure circuit breaker on enrichment loops.
  • Per-page status messages with PPE charge running total.

How to use it

  1. Open the actor in Apify Console.
  2. Pick a mode — Standard, Literature Review, Monitor, Citation Graph, or Author Intelligence.
  3. Fill in the required input for that mode (see the modes table above).
  4. Set optional filters (type, sources, sortBy, maxResults).
  5. (Recommended) enter your mailto email for polite-pool routing on OpenAlex and Crossref.
  6. Click Start.
  7. Open the dataset. Use Publications Overview for tabular results, Intelligence for the scoring view, or BibTeX Export / RIS Export for one-click reference manager imports.
  8. The KV store carries SUMMARY (full run telemetry) and, in narrative mode, the Markdown summary inside the dataset.

Input parameters

ParameterTypeRequiredDefaultDescription
modeStringNo"auto""auto" / "standard" / "literatureReview" / "monitor" / "citationGraph" / "authorAnalysis"
personaStringNo"researcher""researcher" / "engineer" / "analyst" — tunes scoring weights
queryStringNo*"deep learning"Free-text keywords
authorStringNo*Author name (required for authorAnalysis)
venueStringNo*Conference or journal
yearStringNo*Single year
typeStringNoAllPublication type filter
doiStringNo*DOI direct lookup (required for citationGraph)
sourcesArrayNoall fourSubset of dblp, openalex, semanticscholar, crossref
sortByStringNo"relevance""relevance" / "citations" / "year" / "sources"
maxResultsIntegerNo501–500
includeAbstractBooleanNotruePull abstracts
extractKeywordsBooleanNofalseTF-IDF keyword extraction
clusterBooleanNofalseGroup results by keyword overlap
classifyIntentBooleanNotrueClassify each paper as method / survey / benchmark / theory / application / dataset / tool
includeCitationsBooleanNofalseForward citations (top-N only)
includeReferencesBooleanNofalseBackward references (top-N only)
citationLimitIntegerNo25Citations/refs attached per paper
enrichTopNIntegerNo25Only enrich top-N ranked papers
emitGraphBooleanNofalseEmit recordType: graph with PageRank
emitNarrativeBooleanNofalseEmit recordType: summary with executive summary + Markdown
emitInsightsBooleanNofalseEmit recordType: insights with top / rising / foundational / emerging lists
emitAlertsBooleanNofalseEmit recordType: alert records
emitTrendsBooleanNofalseEmit recordType: trends (cluster diff vs prior run)
emitResearchBriefBooleanNotrueEmit the canonical researchBrief hero record (the one you actually want)
diffWithPriorRunBooleanNofalseDiff vs prior run for same scope
debugBooleanNofalseSave raw per-source hits to KV
mailtoStringNoPolite-pool email for OpenAlex / Crossref

* You must provide at least one of: query, author, venue, year, or doi. Some modes have stricter requirements (see the modes table).

Starter templates

Zero-config — just give it a query:

{
"query": "transformer architecture",
"mailto": "you@example.com"
}

The actor picks the best mode automatically — short query + default maxResults → literatureReview, with a researchBrief on top.

Find seminal papers in a topic (explicit literature review):

{
"mode": "literatureReview",
"query": "transformer architecture",
"year": "2024",
"mailto": "you@example.com"
}

Schedule a weekly monitor for new papers in your area:

{
"mode": "monitor",
"query": "retrieval augmented generation",
"venue": "NeurIPS",
"mailto": "you@example.com"
}

Citation deep dive on a single paper:

{
"mode": "citationGraph",
"doi": "10.48550/arXiv.1706.03762",
"citationLimit": 50,
"enrichTopN": 1,
"mailto": "you@example.com"
}

Author intelligence for a researcher:

{
"mode": "authorAnalysis",
"author": "Yoshua Bengio",
"mailto": "you@example.com"
}

Quick standard search (no clustering, no enrichment):

{
"query": "graph neural networks",
"venue": "ICML",
"year": "2024",
"maxResults": 50,
"mailto": "you@example.com"
}

Tips for best results

  • Always pass mailto. OpenAlex and Crossref route requests with a contact email through faster, more reliable infrastructure.
  • Pick a mode before tweaking flags. Modes set sensible defaults for the job. If you find yourself toggling 4+ flags by hand, there's probably a mode that does it for you.
  • Citation enrichment scales linearly. Each enriched paper adds one Semantic Scholar call. Keep enrichTopN low (default 25) unless you genuinely need the full set.
  • cluster: true needs at least 4 papers with overlapping keywords. Solo papers and papers with no keyword overlap stay unclustered (clusterId: null).
  • Restrict sources to debug. If a particular source keeps timing out or returning weird data, drop it from the array.
  • For non-CS work, drop DBLP from sources. Or just let smart-source-selection handle it — biology/medical queries auto-skip DBLP.

Programmatic access

Python:

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("eWBl1oo2MNg11IUA8").call(run_input={
"mode": "literatureReview",
"query": "diffusion models",
"year": "2024",
"mailto": "you@example.com",
})
publications = []
clusters = []
summary = None
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
if item.get("recordType") == "publication":
publications.append(item)
elif item.get("recordType") == "cluster":
clusters.append(item)
elif item.get("recordType") == "summary":
summary = item
print(f"{len(publications)} papers in {len(clusters)} clusters")
if summary:
print(summary["headline"])

JavaScript:

import { ApifyClient } from "apify-client";
const client = new ApifyClient({ token: "YOUR_API_TOKEN" });
const run = await client.actor("eWBl1oo2MNg11IUA8").call({
mode: "monitor",
query: "retrieval augmented generation",
mailto: "you@example.com",
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
const newPapers = items.filter((i) => i.recordType === "publication" && i.newSinceLastRun);
const alerts = items.filter((i) => i.recordType === "alert");
console.log(`${newPapers.length} new papers since last run, ${alerts.length} alerts`);

cURL:

curl -X POST "https://api.apify.com/v2/acts/eWBl1oo2MNg11IUA8/runs?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"mode": "citationGraph",
"doi": "10.48550/arXiv.1706.03762",
"mailto": "you@example.com"
}'
curl "https://api.apify.com/v2/datasets/DATASET_ID/items?token=YOUR_API_TOKEN&format=json"

Output record types

The dataset is a stream of typed records. Filter on recordType for clean downstream processing.

publication — the main result records

{
"recordType": "publication",
"rank": 1,
"doi": "10.48550/arxiv.1706.03762",
"title": "Attention Is All You Need",
"authors": ["Ashish Vaswani", "Noam Shazeer", "..."],
"venue": "Neural Information Processing Systems",
"year": "2017",
"type": "proceedings-article",
"abstract": "The dominant sequence transduction models...",
"citationCount": 124378,
"citationsPerYear": 13820.0,
"isOpenAccess": true,
"isSeminal": true,
"pdfUrl": "https://arxiv.org/pdf/1706.03762.pdf",
"primaryUrl": "https://openalex.org/W2796796457",
"sources": ["dblp", "openalex", "semanticscholar", "crossref"],
"sourceIds": { "dblp": "conf/nips/...", "openalex": "...", "semanticscholar": "...", "crossref": "..." },
"score": {
"overall": 0.94,
"components": {
"citations": 1.0,
"recency": 0.42,
"sourceConsensus": 1.0
}
},
"rankingExplanation": [
"Found in 4/4 sources (dblp, openalex, semanticscholar, crossref)",
"Top 1% citation count in this result set (124,378 citations)",
"Open access PDF available",
"Cross-confirmed across all 4 catalogs"
],
"mergeConfidence": 0.94,
"keywords": ["attention", "transformer", "sequence", "encoder", "decoder"],
"clusterId": "c1",
"relatedPaperRanks": [{ "rank": 5, "score": 0.42 }, { "rank": 12, "score": 0.31 }],
"newSinceLastRun": false,
"citationDeltaSinceLastRun": 1240,
"bibtex": "@inproceedings{vaswani2017attention,\n title = {...},\n ...\n}",
"ris": "TY - CONF\nTI - Attention Is All You Need\n...",
"extractedAt": "2026-04-25T14:00:00.000Z"
}

cluster — topic groups (when cluster: true)

{
"recordType": "cluster",
"clusterId": "c1",
"name": "Attention / Transformer / Architecture",
"paperRanks": [1, 5, 12, 18, 24],
"paperCount": 5,
"dominantKeywords": ["attention", "transformer", "encoder", "decoder", "sequence"],
"avgScore": 0.81,
"avgYear": 2020
}

alert — threshold triggers (when emitAlerts: true)

{
"recordType": "alert",
"alertType": "newly-cited",
"severity": "warn",
"message": "Citation surge: \"Attention Is All You Need\" gained 1240 citations since last run",
"paperRank": 1,
"paperTitle": "Attention Is All You Need",
"paperDoi": "10.48550/arxiv.1706.03762",
"detail": { "citationDelta": 1240, "currentCitations": 124378 }
}

alertType is a stable enum: "new-paper" / "newly-cited" / "seminal-detected".

graph — citation graph (when emitGraph: true)

{
"recordType": "graph",
"graph": {
"nodes": [{ "id": "doi:10.48550/...", "label": "Attention Is All You Need", "year": "2017", "doi": "10.48550/...", "inPrimarySet": true, "citationCount": 124378, "pagerank": 0.0421 }],
"edges": [{ "source": "doi:...", "target": "doi:...", "type": "cites" }],
"nodeCount": 87,
"edgeCount": 213
},
"topInfluential": [
{ "id": "doi:10.48550/...", "label": "Attention Is All You Need", "pagerank": 0.0421, "doi": "10.48550/..." }
]
}

insights — curated lists (when emitInsights: true)

{
"recordType": "insights",
"persona": "researcher",
"insights": {
"topPapers": [{ "rank": 1, "title": "Attention Is All You Need", "doi": "...", "reason": "Top blend of quality (0.94), velocity (13820 cites/year), and 4/4 source agreement" }],
"risingPapers": [{ "rank": 7, "title": "...", "doi": "...", "reason": "Citation velocity 412/year (2023, 824 total)" }],
"foundationalPapers": [{ "rank": 1, "title": "...", "doi": "...", "reason": "124,378 citations across 9 years — still actively cited" }],
"controversialPapers": [],
"emergingTopics": [{ "clusterId": "c2", "name": "Diffusion / Generative / Latent", "paperCount": 8, "avgYear": 2024, "dominantKeywords": ["diffusion", "generative", "latent"], "reason": "Mean year 2024, 8 papers — recent and active" }]
}
}
{
"recordType": "trends",
"isFirstRun": false,
"priorRunAt": "2026-04-18T14:00:00.000Z",
"clusters": [
{ "name": "Diffusion / Generative / Latent", "trend": "rising", "paperCount": 8, "paperCountDelta": 3, "dominantKeywords": ["diffusion", "generative", "latent"] },
{ "name": "Sparse / Mixture / Experts", "trend": "new", "paperCount": 4, "paperCountDelta": null, "dominantKeywords": ["sparse", "mixture", "expert"] },
{ "name": "Pruning / Quantization / Compression", "trend": "falling", "paperCount": 2, "paperCountDelta": -3, "dominantKeywords": ["pruning", "quantization", "compression"] }
]
}

trend is a stable enum: "new" / "rising" (>25% growth) / "stable" / "falling" (>25% drop) / "gone".

summary — Markdown narrative (when emitNarrative: true)

{
"recordType": "summary",
"headline": "literatureReview: 50 publications (high confidence)",
"executiveSummary": "## Key Takeaways\n\n- **50** unique publications across 4 sources (high confidence).\n- **4** seminal papers flagged in this set (top-5% citations, age-adjusted).\n- **5** rising papers with strong citation velocity in the last 5 years.\n- **3** emerging topics detected in the cluster set (recent + active).\n- Largest cluster: **Attention / Transformer / Architecture** (12 papers).",
"markdown": "# Academic Publication Search — \"transformer architecture\"\n\n**Mode:** literatureReview ...",
"persona": "researcher",
"queryAnalysis": { "detectedDomain": "machine learning / AI", "intent": "literature_review", "suggestedMode": null, "suggestedModeReason": null, "warnings": [] },
"confidence": { "overall": 0.82, "overallLevel": "high", "dataCompleteness": 0.87, "sourceAgreement": 0.78, "sourcesAvailable": 4, "sourcesAttempted": 4, "notes": [] },
"sourcesQueried": ["dblp", "openalex", "semanticscholar", "crossref"],
"sourceCounts": { "dblp": 50, "openalex": 50, "semanticscholar": 47, "crossref": 50 },
"sourceErrors": {},
"rawHits": 197,
"uniquePublications": 50,
"pushed": 50,
"mode": "literatureReview",
"sortBy": "citations",
"newSinceLastRunCount": null,
"seminalCount": 4,
"clusterCount": 5,
"alertCount": 0,
"finishedAt": "2026-04-25T14:00:00.000Z"
}

error — actor-level error

{ "recordType": "error", "error": true, "message": "...", "timestamp": "..." }

A run summary is also written to the key-value store under SUMMARY mirroring the summary record's telemetry, plus PPE charge totals.


How it works

Academic Publication Search
┌──────────────────────────────────────────────────────────────────┐
INPUT — mode → preset → resolved config │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Parallel fetch (Promise.all, retry on 5xx) │ │
│ │ DBLP | OpenAlex | S2 | Crossref │ │
│ └──────────────────────┬───────────────────────┘ │
│ ▼ │
│ Normalize + Dedup (DOI → fuzzy title)
│ ▼ │
Score (citations 50% + sources 30% + recency 20%) → rank │
│ Build ranking explanations (plain-English)
│ Detect seminal papers (top 5%, age-adjusted)
│ Compute citations-per-year + merge confidence │
│ ▼ │
TF-IDF keyword extraction (when extractKeywords)
│ Topic clustering by Jaccard overlap (when cluster)
│ ▼ │
│ Optional: forward citations + backward references (top-N)
│ Optional: structured graph + PageRank │
│ ▼ │
│ Optional: diff against prior run (KV state)
│ Optional: emit alerts on thresholds │
│ Optional: emit Markdown narrative │
│ ▼ │
│ Push records (publication / cluster / alert / graph / summary)
PPE: charge per record, stop early on eventChargeLimitReached │
│ Save state for next-run diff │
│ Write SUMMARY to KV
└──────────────────────────────────────────────────────────────────┘

Quality score formula

score = 0.5 × citationCohortScore (log-normalized within result set)
+ 0.3 × sourceConsensus (sources / 4)
+ 0.2 × recencyScore (exp(-age / 10), 10y half-life)

Weights are documented in dataset_schema.json. We deliberately don't include venue prestige — it requires a paywalled or scraped index that the actor can't ship without licensing. If you need venue weights, post-process the result set with your own ISSN-to-ranking map.

Seminal detection

A paper is flagged seminal when:

  1. It sits in the top 5% of citation counts in the current result set, AND
  2. It clears age-adjusted citation thresholds:
    • 5+ years old: ≥500 citations
    • 2–5 years old: ≥1000 citations
    • <2 years old: ≥2000 citations

This cohort-relative + age-adjusted approach prevents both "everything seminal in a high-cite cohort" and "nothing seminal in a recent cohort."

Deduplication

DOI-first. Records with the same normalized DOI merge into one group. Records without a DOI fall through to a normalized title-plus-year fingerprint and merge with any DOI-bearing record on the same fingerprint. Within a merged group:

  • Strings: prefer the first non-null, non-empty value (abstracts prefer the longest).
  • Authors: keep the longer author list.
  • Numbers: take the maximum (citation counts always reflect the best-crawled source).
  • isOpenAccess: true if any source said true.
  • Provenance: sources[] is the union of contributing source names; sourceIds is the per-source map.

mergeConfidence is computed from: DOI presence (+0.4), source count (+0.06 per extra source), title length (+0.03), year present (+0.03), authors present (+0.03), with a base of 0.4. Filter >= 0.8 for high-confidence merges.

Reliability

Each source call uses 3 retries with exponential backoff on 5xx and 429. If retries exhaust, the per-source error lands in the run summary's sourceErrors map and the run continues with the other sources. Citation enrichment uses a 5-consecutive-failure circuit breaker so a Semantic Scholar outage doesn't burn credits on a long run.


How much does it cost to run?

Pay-per-event at $0.002 per merged unique publication returned. You're charged per deduplicated record, not per raw source hit, so a query that pulled 200 raw hits across four sources but merged into 50 unique publications charges 50 events. Cluster, alert, graph, and summary records are not charged.

ScenarioInputsApprox. resultsApprox. cost
Quick literature checkquery, default~50 unique$0.10
Author bibliographyauthorAnalysis mode~200 unique$0.40
Full conference extractionvenue + year, maxResults: 500~500 unique$1.00
Citation graph deep divecitationGraph mode1 enriched root + graph$0.002
Weekly monitormonitor modetypically 5–20 new$0.01–$0.04

Apify platform compute (CPU + memory) is billed separately. Citation enrichment adds wall-clock time but doesn't change the PPE fee — the per-record charge is unchanged whether enrichment is on or off.


Limitations and responsible use

  • Maximum 500 merged records per run. For larger sweeps, paginate by year, venue, or author across multiple runs.
  • Bibliographic metadata only. No paper full-text. Use pdfUrl for open-access PDFs.
  • DOI lookup skips DBLP. DBLP has no DOI lookup endpoint.
  • Single-year filtering. Year ranges aren't exposed across all sources; run separately or post-filter.
  • Quality score is cohort-relative. A score of 0.7 in a cohort of seminal papers is different from 0.7 in a cohort of preprints. Always interpret scores within their result set.
  • Clustering quality is keyword-bound. TF-IDF + Jaccard is fast and deterministic but not as good as embedding-based clustering. For deep semantic work, post-process with your own embedding model on the abstracts.
  • PageRank on small graphs. With <30 nodes the PageRank scores are mostly noise — interpret them as weak signal in citation-graph mode unless your citationLimit is high enough.
  • State persistence requires Full Access tokens. Restricted-access scoped tokens can't open named KV stores; monitor mode degrades gracefully and treats every run as a first run under restricted tokens.
  • Polite-pool mailto. Always pass it. OpenAlex and Crossref are public services and mailto is how you identify yourself as a well-behaved consumer.

What this actor IS (positioning)

  • A research intelligence engine — not a thin academic search wrapper.
  • Decision-support output — the researchBrief tells you what to read first, what's rising, what to focus on. Raw data is there if you want it; meaning is delivered first.
  • Deterministic and explainable — every score, cluster, intent label, and ranking comes with a reason. No black boxes, no LLM hallucinations.
  • Pull-based — one query, one brief. No configuration drift, no 17-checkbox configurations to remember.

What this actor does NOT do

  • Does NOT replace paid platforms like Web of Science, Scopus, or Dimensions for licensed citation analytics, paywalled abstracts, or proprietary impact metrics.
  • Does NOT scrape Google Scholar. Google Scholar has no API and aggressively blocks programmatic access.
  • Does NOT extract or download paper PDFs. The pdfUrl field is a link; downloading is your job (or use a downloader actor).
  • Does NOT use any LLM or embedding service. Scoring, clustering, keywords, alerts, and explanations are all pure deterministic computation. Reproducible, fast, and free of LLM hallucinations — but limited to keyword-level semantics.
  • Does NOT compute h-index, journal impact factor, or author-level analytics. Per-paper signals only.
  • Does NOT cover preprints exhaustively. arXiv preprints surface via OpenAlex and Semantic Scholar's indexing of arXiv; the actor does not query arXiv's API directly.

FAQ

What is the quality score actually measuring? Three things, weighted: how well-cited the paper is relative to the rest of your result set (50%), how many of the four sources agreed it exists (30%), and how recent it is (20%). The score is cohort-relative — a 0.9 in a cohort of seminal papers means something different from a 0.9 in a cohort of recent preprints. The components breakdown is on every record so you can re-weight if your use case wants that.

Why do you flag seminal papers based on the result-set cohort, not absolute citation counts? Because a result set of NeurIPS 2017 papers has very different absolute citation distributions than a result set of NeurIPS 2024 papers. Cohort-relative + age-adjusted thresholds catch the standout papers in either cohort without false positives in either direction.

How does monitoring work across runs? The first run with monitor mode (or diffWithPriorRun: true) stores result fingerprints in a named KV store keyed by mode + query + filters. The next run with the same scope diffs against the stored state and adds newSinceLastRun and citationDeltaSinceLastRun to every record. The state store caps at 5,000 fingerprints per scope and rolls over FIFO to prevent unbounded growth.

What's the difference between cluster and graph output? Clusters group your result set by topical similarity (keyword overlap). The graph traces actual citation links between papers — who cites whom. You can use both: clusters for "what topics are in this search?" and graph for "who's the most influential in this neighborhood?".

Can I skip the new fields and get the simple shape I had before? Yes — the new fields are additive and null/empty when their feature is off. With default inputs, you get the same publication records as before plus the new score/explanation/keywords/etc. fields filled in (or null if not applicable). Filter the dataset on recordType: "publication" to skip cluster / alert / graph / summary records cleanly.

Why not use embeddings for clustering? Trade-offs. TF-IDF + Jaccard is deterministic, free, fast, and doesn't depend on an external embedding service that could rate-limit, change pricing, or hallucinate. The clustering quality is meaningfully worse than embedding-based, but it's good enough for "broadly group these 50 papers into themes" — which is the only thing literature-review mode promises. For deep semantic clustering, post-process the abstracts with your own embedding model.

What's the difference between rankingExplanation and whyItMatters? rankingExplanation answers "why is this at rank #N in this result set?" — built from cohort signals: source coverage, citation percentile within the result set, recency. whyItMatters answers "what's this paper's role in the field?" — built from absolute signals: seminal flag, foundational age + citations, citation velocity rank, PageRank rank in the citation graph (when graph is built), paper intent (survey, benchmark, dataset, tool). They overlap a little but optimize for different questions.

How does paper intent classification work? A regex sweep over title + abstract for surveys ("survey", "review of", "systematic review"), benchmarks ("benchmark", "leaderboard", "comparison"), datasets ("dataset", "corpus"), tools ("toolkit", "library"), theory ("theorem", "proof", "formal verification"), and applications ("case study", "deployed"). Falls back to "method" when "novel approach" / "we propose" patterns appear, and "unknown" otherwise. Pure regex — no LLM. False-positive rate is low for clear cases (a paper called "A Survey of X" is a survey 99% of the time) and "unknown" is honest about ambiguous cases.

Should I trust the persona-tunable scoring? The personas are weight presets — they're transparent (documented in input schema, surfaced in score.weights) and reproducible. They change the order of results, not the underlying citation/source/recency signals. If your job has a different weighting in mind, post-process the dataset with your own weights against score.components.

What does confidence: low mean for my run? Either (1) most records came from a single source, suggesting your query is in a niche only one catalog covers, (2) many records are missing key fields (DOI, abstract), or (3) the result set is small (<5 records). Low confidence doesn't mean the data is wrong — it means downstream consumers should weigh it less. The notes array spells out what's driving the score.

Why does monitor mode emit cluster trends and not just paper diffs? Paper-level diffs answer "what's new?" Cluster trends answer "what's rising?" — usually a more interesting signal for monitoring. A cluster going from 2 papers to 6 in a month is a stronger signal than 4 individual new papers, because it indicates a wave forming, not just isolated activity.

Does the citation graph show 2nd-degree citations? No — only 1-hop. The forwardCitations and backwardReferences arrays on each result hold the immediate neighborhood. Multi-hop graph expansion would multiply API calls unpredictably; we keep it predictable at 1-hop.

How fresh is the data? DBLP and Crossref update within days of publication. OpenAlex updates weekly. Semantic Scholar updates roughly monthly with re-scored citation counts. For papers under a week old expect partial coverage; older than a month all four sources should agree.

Can I use this for non-CS research? Yes. DBLP is CS-only, but OpenAlex covers every discipline, Crossref covers every DOI-registered work, and Semantic Scholar covers most STEM + social sciences. Smart-source-selection auto-drops DBLP for biology/medical queries; for other non-CS work, drop DBLP from sources manually.

How do I export to Zotero or Mendeley? Open the dataset in Apify Console, switch to the RIS Export view, download as text, import into your reference manager. The BibTeX Export view is the equivalent for LaTeX / Overleaf.

How does smart source selection work? Simple keyword heuristic: queries containing biology/medical terms (protein, gene, cancer, clinical, neuro, etc.) drop DBLP from the source list since DBLP indexes computer science only. The reduction is logged and the actor proceeds with the remaining sources. Override by setting sources explicitly.

What happens when cluster is on but no clusters form? You get zero recordType: cluster records. Papers that don't cluster have clusterId: null. Clustering needs at least 4 papers with overlapping keywords (≥0.25 Jaccard) to form a group of 2+.


ActorDescription
Crossref Academic Paper SearchSingle-source Crossref search with citations, BibTeX, OA detection (Unpaywall)
Semantic Scholar Paper SearchSingle-source Semantic Scholar search with AI-extracted abstracts and citation context
OpenAlex Research Paper SearchSingle-source OpenAlex search across 250M+ scholarly works
ArXiv Preprint Paper SearchSearch and download preprint papers from arXiv
ORCID Researcher SearchSearch researchers by name or ORCID