Europe PMC — Biomedical Knowledge Graph & Literature Mining avatar

Europe PMC — Biomedical Knowledge Graph & Literature Mining

Pricing

from $2.00 / 1,000 paper fetcheds

Go to Apify Store
Europe PMC — Biomedical Knowledge Graph & Literature Mining

Europe PMC — Biomedical Knowledge Graph & Literature Mining

Turn a biomedical topic into a knowledge graph and evidence corpus from Europe PMC. Mines genes, diseases, chemicals, organisms and deposited datasets (GEO, ENA, PDB) from full text, builds entity co-occurrence networks, tracks emerging entities, and exports Neo4j/Gephi CSV. No API key.

Pricing

from $2.00 / 1,000 paper fetcheds

Rating

0.0

(0)

Developer

Ryan Clinton

Ryan Clinton

Maintained by Community

Actor stats

0

Bookmarked

5

Total users

0

Monthly active users

10 days ago

Last modified

Share

Turn a biomedical topic into an analysis-ready evidence corpus from Europe PMC -- over 40 million articles aggregated from PubMed, PubMed Central (PMC), and preprint servers including bioRxiv and medRxiv. Beyond clean metadata, the actor mines the biology out of the full text: genes, proteins, diseases, chemicals, and organisms (ontology-grounded), plus the deposited dataset accessions (ENA, PDB, UniProt, GEO) behind each paper. Add preprint awareness, full-text access level, and run-level corpus analytics, and you get a structured biomedical dataset -- and a biomedical knowledge graph -- in one run. No API key required.

Unlike PubMed wrappers that return papers, this actor returns the biological entities, datasets, trends, and graph relationships behind a biomedical topic.

Europe PMC Biomedical Knowledge Graph and Literature Mining


Common workflows

JobInput
Build a biomedical knowledge graph{ "query": "CRISPR", "includeNetworks": true }
Discover the datasets behind a field{ "query": "single cell sequencing", "entityRollup": true }
Detect emerging genes / diseases / targets{ "query": "CAR-T therapy", "entityRollup": true }
Full-text entity mining for a RAG corpus{ "query": "Alzheimer's disease", "includeMinedEntities": true, "outputProfile": "compact" }
Find preprints before peer review{ "query": "base editing", "queryPreset": "preprints_only" }
Pull only open-access full text{ "query": "tumor microenvironment", "queryPreset": "open_full_text" }

Ready-to-run examples

One-click, pre-configured versions of the common jobs — open one, change the query, run:

See all on the examples page.

What each job returns

JobKey inputWhat you get back
Build a knowledge graphincludeNetworksedge records (entity co-occurrence) + nodes.csv / edges.csv
Map the research landscapeentityRollup + includeNetworksentity records + summary with top entities, emerging entities, graph size
Dataset discoveryqueryPreset: datasets_availableaccession entities (GEO/ENA/PDB/UniProt) + the papers that use them
Emerging genes & targetsentityRollupentity records with a recent-vs-prior trend + emergingEntities
Entities for RAGincludeMinedEntities + outputProfile: compactcompact per-paper minedEntities (genes, diseases, chemicals, organisms)
Clinical trialsqueryPreset: clinical_trialsclinical-trial publications with abstracts, MeSH, full-text links
PreprintsqueryPreset: preprints_onlybioRxiv / medRxiv preprints with isPreprint + access level
Systematic reviewsqueryPreset: reviews_onlyreview and systematic-review articles
Open-access full textqueryPreset: open_full_textpapers with accessLevel: open-fulltext + fullTextUrl
Recent high-impactqueryPreset: recent_high_impactlast-5-years papers sorted by citation count

What you get from one query

Run { "query": "glioblastoma", "entityRollup": true, "includeNetworks": true } and a single run returns, with no manual extraction, spreadsheet work, or graph-building in between:

  • The relevant papers -- with abstracts, MeSH terms, full-text links, preprint status, and access level.
  • Every gene, disease, chemical, and organism mined from the full text, each as an aggregate record with paper counts, citations, and a recent-vs-prior trend.
  • The public datasets (GEO, ENA, PDB, UniProt accessions) referenced across the corpus -- which dominate the field, what biology co-occurs with each, and when they were active.
  • The fastest-rising entities -- e.g. for glioblastoma, genes like EGFRvIII or TP53 climbing in recent literature -- computed deterministically from publication dates.
  • A knowledge graph -- typed entity nodes and co-occurrence relationships as dataset records plus nodes.csv / edges.csv, ready to import into Neo4j, Gephi, Cytoscape, or a GraphRAG / agent-memory pipeline.

The run-level summary record consolidates this into one object -- corpus composition, top entities, top datasets, emerging entities, publication timeline, and knowledge-graph size (networkStats) -- the whole landscape behind your query in a single record.

Example records

{ "recordType": "entity", "entityType": "gene", "name": "TP53", "paperCount": 124, "trend": "rising" }
{ "recordType": "edge", "source": "TP53", "target": "glioblastoma", "relationship": "CO_OCCURS", "weight": 27 }
{ "recordType": "summary", "totalPapers": 500, "topEntities": {}, "emergingEntities": {}, "networkStats": { "nodeCount": 2134, "edgeCount": 11800 } }

That is the structural synthesis a literature review, a drug-discovery target scan, or a bioinformatics knowledge-base build normally does by hand after exporting -- delivered as one structured dataset.

From 500 papers to the knowledge graph behind them

The whole research landscape from one query: top genes, diseases, datasets, emerging entities


Why Europe PMC?

Most biomedical search tools wrap PubMed, which gives you citations and abstracts. Europe PMC carries the same coverage plus the assets that make structural synthesis possible in one place:

  • Text-mined biological entities -- genes, proteins, diseases, chemicals, organisms, ontology-grounded (RXNORM, UniProt, OBO), extracted from full text.
  • Deposited data accessions -- the GEO, ENA, PDB, and UniProt datasets behind each paper, for dataset discovery and reproducibility.
  • Preprints -- bioRxiv, medRxiv, and Research Square, which PubMed does not index.
  • Full-text links and open-access status, grants/funder linkage, and MeSH terms -- all in a single keyless API.

That combination is why this actor can return a knowledge graph and an entity landscape, not just a list of papers -- the raw material for drug discovery, life-sciences competitive scans, bioinformatics pipelines, and scientific dataset discovery.

Questions this actor answers: which genes dominate, which datasets matter, which entities are emerging


  • Get the biology, not just the papers -- one run returns the genes, diseases, chemicals, and datasets behind a topic, how they connect (co-occurrence network), and which are rising -- the structural synthesis researchers normally do by hand in Excel after exporting.
  • Access 40M+ publications in one search -- Europe PMC unifies PubMed (MED), PMC full-text (PMC), and preprints (PPR) into a single searchable index, eliminating the need to query multiple databases separately.
  • No API key or authentication needed -- the Europe PMC REST API is completely free and open, so you can start extracting data immediately without registration or credentials.
  • Rich structured metadata -- every result includes PMID, PMCID, DOI, full author lists with affiliations, abstract text, MeSH subject headings, citation counts, publication types, and direct full-text URLs.
  • Automated pagination and data transformation -- the actor handles cursor-based pagination, nested API response parsing, and output normalization so you get clean, flat JSON records ready for analysis.
  • Schedule recurring literature monitoring -- run the actor daily or weekly with date range filters to automatically track new publications on any biomedical topic.
  • Export anywhere -- results are stored in standard Apify datasets that export to JSON, CSV, Excel, Google Sheets, or feed directly into downstream workflows via webhooks and the Apify API.

This actor vs the alternatives

CapabilityThis actorRaw Europe PMC APIGeneric scraper
Clean, paginated Apify datasetYesManualMaybe
Preprint classification (isPreprint)YesManualNo
Full-text access level (accessLevel)YesManualNo
Text-mined biological entitiesYesSeparate APINo
Entity-centric rollup recordsYesNoNo
Emerging-entity trends (recent vs prior)YesNoNo
Entity co-occurrence network + graph CSVYesNoNo
Deposited-data accessionsYesSeparate APINo
Funder linkageYesManualNo
Corpus-composition analyticsYesNoNo
CSV-flat / agent-compact outputYesNoNo

Metadata, or insight: capability comparison vs a basic paper-search actor


Key features

  • Advanced query support -- free-text search plus one-click presets (clinical trials, reviews, preprints, open full text, datasets available) so you don't need to know the query syntax; power users can still use field operators like TITLE:"term", AUTH:"name", and Boolean AND/OR/NOT
  • Author filtering -- narrow results to a specific researcher by name using the dedicated author filter field
  • Journal filtering -- restrict searches to publications from a specific journal title
  • Date range filtering -- specify start and end dates in YYYY-MM-DD format to target a publication window
  • Open access filtering -- toggle a single checkbox to return only freely available open access publications
  • Source database selection -- choose between All sources, PubMed (MED) for MEDLINE citations, PMC for full-text articles, or Preprints (PPR) for bioRxiv/medRxiv content
  • Flexible sort options -- sort results by relevance, citation count (most cited first), or publication date (most recent first)
  • Full-text URL extraction -- automatically finds the best available full-text link for each article, preferring HTML over PDF over any other format
  • MeSH term extraction -- returns Medical Subject Headings for each article, enabling standardized topic classification and filtering
  • Up to 500 results per run -- cursor-based pagination collects large result sets efficiently with page sizes up to 1,000 per API call

What you also get: full-text-mined entities, dataset discovery, knowledge graph, emerging entities


Using the Apify Console

  1. Go to the Europe PMC Literature Search actor page on Apify.
  2. Click Start to open the input configuration form.
  3. Enter your search query in the Search Query field (e.g., CRISPR gene editing).
  4. Optionally fill in Author Name, Journal Name, Date From, Date To, Open Access Only, and Source Database filters.
  5. Select your preferred Sort By option -- Relevance, Most Cited, or Most Recent.
  6. Set the Max Results value (1 to 500, default is 50).
  7. Click Start to run the actor.
  8. When the run finishes, open the Dataset tab to view, download, or export results in JSON, CSV, or Excel format.

Using the Apify API or CLI

apify call ryanclinton/europe-pmc-search \
--input='{"query":"CRISPR gene editing","openAccessOnly":true,"sortBy":"CITED desc","maxResults":100}'

Input parameters

ParameterTypeRequiredDefaultDescription
queryStringYes--Search query. Supports free text and field syntax like TITLE:"term", AUTH:"name", DOI:10.xxx
queryPresetStringNogeneralExpert query construction without the syntax: clinical_trials, reviews_only, preprints_only, open_full_text, datasets_available, recent_high_impact. Adds the right Europe PMC filters on top of your query
authorStringNo--Filter by author name (e.g., "Smith J")
journalStringNo--Filter by journal name (e.g., "Nature")
dateFromStringNo--Start date in YYYY-MM-DD format
dateToStringNo--End date in YYYY-MM-DD format
openAccessOnlyBooleanNofalseOnly return open access publications
sourceStringNoAllSource database: All, PubMed (MED), PMC Full Text (PMC), or Preprints (PPR)
sortByStringNoRELEVANCESort order: RELEVANCE, CITED desc (most cited), or P_PDATE_D desc (most recent)
maxResultsIntegerNo50Maximum number of results to return (1--500)
includeMinedEntitiesBooleanNofalseFetch text-mined biological entities (genes, proteins, diseases, chemicals, organisms, data accessions) extracted from each paper's full text via the Europe PMC Annotations API. Adds the minedEntities and accessions fields. Makes extra API calls (one batch per ~8 papers).
entityRollupBooleanNofalseEmit per-entity aggregate records (recordType: "entity") across the result set — genes, diseases, chemicals, organisms, and datasets with paper counts, summed citations, and example papers. Enables entity mining automatically.
includeNetworksBooleanNofalseEmit entity co-occurrence edges (recordType: "edge") plus nodes.csv / edges.csv in the key-value store, ready for Neo4j / Gephi / Cytoscape / GraphRAG. Enables entity mining automatically.
outputProfileStringNostandardstandard returns the full record; compact drops the authors list, abstract, and per-paper entity arrays (keeping counts) for lean agent/LLM use
flattenForCsvBooleanNofalseFlatten nested arrays/objects to delimited strings (authors, MeSH, funders, mined entities become single columns) for clean CSV / spreadsheet export
emitSummaryBooleanNotrueAppend a run-level summary record with corpus composition (preprint vs peer-reviewed share, open-access share, top MeSH topics, top funders, top mined entities, most-referenced datasets), also mirrored to the SUMMARY key-value store key

Example input

{
"query": "machine learning drug discovery",
"author": "Zhang",
"journal": "Nature",
"dateFrom": "2023-01-01",
"dateTo": "2025-12-31",
"openAccessOnly": true,
"source": "MED",
"sortBy": "CITED desc",
"maxResults": 100
}

Tips for effective queries

  • Combine free text with field operators for precision: TITLE:"deep learning" AND AUTH:"Chen".
  • Use the dedicated author and journal filter fields instead of embedding them in the query string -- the actor builds the correct Lucene syntax for you.
  • Set dateFrom to a recent date and schedule recurring runs to build an automated new-publication alert pipeline.
  • Filter by source PMC when you need articles with guaranteed full-text availability.
  • Filter by source PPR to find preprints from bioRxiv and medRxiv before they are formally published.

Output

The dataset contains paper records (one per publication) and, by default, a single run-level summary record. Each paper record carries full publication metadata plus Europe-PMC-native intelligence: preprint status, full-text access level, funder linkage, and (when entity mining is enabled) the biological entities and data accessions mined from the full text.

Entity records aggregated across the corpus: gene, disease, dataset with paper counts and trend

Example output (paper record)

{
"recordType": "paper",
"pmid": "37648796",
"pmcid": "PMC10564893",
"doi": "10.1038/s41586-023-06468-x",
"title": "Base editing of haematopoietic stem cells rescues sickle cell disease in mice",
"authorString": "Newby GA, Yen JS, Woodard KJ, Mayuranathan T, Lazzarotto CR, Li Y...",
"authors": [
{
"fullName": "Newby GA",
"firstName": "Gregory A",
"lastName": "Newby",
"affiliation": "Merkin Institute, Broad Institute of Harvard and MIT, Cambridge, MA"
}
],
"journalTitle": "Nature",
"journalVolume": "623",
"journalIssue": "7985",
"pageInfo": "295-302",
"pubYear": "2023",
"firstPublicationDate": "2023-08-30",
"abstractText": "Sickle cell disease (SCD) is caused by a point mutation in the beta-globin gene...",
"citedByCount": 147,
"isOpenAccess": true,
"inPMC": true,
"inEPMC": true,
"source": "MED",
"isPreprint": false,
"publicationStatus": "peer-reviewed",
"accessLevel": "open-fulltext",
"pubType": ["research-article", "Journal Article"],
"meshTerms": ["CRISPR-Cas Systems", "Sickle Cell Disease", "Hematopoietic Stem Cells"],
"grants": [{ "agency": "National Institutes of Health", "grantId": "R01HL156647" }],
"funders": ["National Institutes of Health"],
"accessions": [{ "name": "GSE181897", "count": 2, "uri": "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE181897" }],
"accessionCount": 1,
"minedEntities": {
"genesProteins": [{ "name": "HBB", "count": 6 }],
"diseases": [{ "name": "sickle cell disease", "count": 4 }],
"chemicals": [],
"organisms": [{ "name": "Mus musculus", "count": 3 }],
"cellTypes": [{ "name": "hematopoietic stem cell", "count": 2 }],
"goTerms": [],
"other": []
},
"minedEntityCount": 5,
"fullTextUrl": "https://europepmc.org/articles/PMC10564893",
"europePmcUrl": "https://europepmc.org/article/MED/37648796",
"summary": "Peer-reviewed: \"Base editing of haematopoietic stem cells rescues sickle cell disease in mice\" (2023, Nature). 147 citations; open-access full text; 5 mined entities; 1 data accessions",
"extractedAt": "2026-02-19T14:30:00.000Z"
}

The summary record (emitted once at the end, recordType: "summary") carries corpus composition: totalPapers, preprintCount / peerReviewedCount, openAccessCount, preprintSharePct, openAccessSharePct, yearRange, publicationTimeline (per-year counts), topJournals, topAuthors, topMeshTerms, topFunders, topEntities (genes / diseases / chemicals / organisms), topReferencedDatasets, emergingEntities (rising genes/diseases/chemicals/datasets, recent vs prior window), avgCompletenessScore, and runStats (API request / retry / failure counts + duration).

Entity-centric output (entityRollup)

With entityRollup: true, the dataset also carries recordType: "entity" records — one per gene, disease, chemical, organism, and dataset (accession) found across the result set, with paperCount, totalCitations, firstYear/lastYear, a recent-vs-prior trend, and (when includeNetworks is on) topCoOccurring. Filter entityType == "accession" for the dataset-discovery view ("which datasets dominate this topic, what biology co-occurs with them, and when were they active").

{ "recordType": "entity", "entityType": "gene", "name": "TP53", "paperCount": 124, "totalCitations": 8932, "firstYear": 2014, "lastYear": 2026, "recentPaperCount": 41, "priorPaperCount": 23, "trend": "rising", "topCoOccurring": [{ "name": "glioblastoma", "entityType": "disease", "weight": 27 }], "exampleCanonicalIds": ["doi:10.1038/s41586-023-06468-x"] }

When entityRollup is on and the corpus spans 2+ years, the summary record also carries emergingEntities — genes, diseases, chemicals, and datasets rising in the recent window vs the prior window, computed deterministically from publication dates (no LLM, no cross-run state).

Co-occurrence network (includeNetworks)

With includeNetworks: true, the dataset carries recordType: "edge" records — entity co-occurrence links with a typed relationship, a weight (papers mentioning both), and normalizedWeight (0-100) — and the key-value store gets nodes.csv + edges.csv whose headers follow the neo4j-admin import convention (id:ID / :LABEL / :START_ID / :END_ID / :TYPE), so the graph imports directly into Neo4j, Gephi, Cytoscape, or a GraphRAG pipeline.

{ "recordType": "edge", "source": "TP53", "sourceType": "gene", "target": "glioblastoma", "targetType": "disease", "relationship": "CO_OCCURS", "weight": 27, "normalizedWeight": 64 }

This is the build-a-biomedical-knowledge-graph job in one run: point it at a topic, get typed entity nodes (genes, diseases, chemicals, organisms, datasets) and co-occurrence relationships ready for a graph database or agent retrieval memory — no extraction or graph-building step in between.

Output fields reference (paper record)

FieldTypeDescription
recordTypeStringpaper, summary, or error
pmidStringPubMed identifier
pmcidStringPubMed Central identifier
doiStringDigital Object Identifier
titleStringPublication title
authorStringStringComma-separated author names
authorsArrayStructured author objects with fullName, firstName, lastName, affiliation
journalTitleStringName of the journal
journalVolumeStringJournal volume number
journalIssueStringJournal issue number
pageInfoStringPage range (e.g., "295-302")
pubYearStringPublication year
firstPublicationDateStringDate of first publication (YYYY-MM-DD)
abstractTextStringFull abstract text
citedByCountNumberNumber of citations in Europe PMC
isOpenAccessBooleanWhether the article is open access
inPMCBooleanWhether the article is in PubMed Central
inEPMCBooleanWhether the article is in Europe PMC
sourceStringSource database (MED, PMC, PPR, etc.)
canonicalIdStringSingle best cross-system join key (doi: > pmid: > pmcid: > source:id)
isPreprintBooleanTrue for preprints (source PPR — bioRxiv / medRxiv / Research Square)
publicationStatusStringpreprint or peer-reviewed
isReviewBooleanPublication type includes "review" (convenience filter, not an evidence grade)
isClinicalTrialBooleanPublication type includes "clinical trial" (convenience filter, not an evidence grade)
accessLevelStringopen-fulltext, restricted-fulltext, or abstract-only
pubTypeArrayPublication types (e.g., "research-article", "Review")
meshTermsArrayMedical Subject Heading terms
grantsArrayFunder / grant linkage: { agency, grantId }
fundersArrayDistinct funding agency names
accessionsArrayDeposited dataset / sequence accessions mined from the paper (ENA, PDB, UniProt, GEO): { name, count, uri }. Requires includeMinedEntities
accessionCountNumberCount of distinct data accessions
minedEntitiesObjectText-mined biological entities bucketed by type (genesProteins, diseases, chemicals, organisms, cellTypes, goTerms). null unless includeMinedEntities is on
minedEntityCountNumberTotal distinct mined entities (including accessions)
completenessObjectMetadata-completeness signal: score (0-1, fraction of 7 key fields present) plus hasDoi, hasAbstract, hasMeshTerms, hasAffiliations, hasFullText, hasFunding, hasDataAccessions. Filter reliable records by completeness.score
fullTextUrlStringBest available full-text URL (HTML preferred over PDF)
europePmcUrlStringDirect link to the article on Europe PMC
summaryStringPlain-English one-line summary of the record
extractedAtStringISO 8601 timestamp of when the data was extracted

Use cases

  • Systematic literature reviews -- collect all publications matching specific criteria for structured evidence synthesis in medical or scientific research
  • Research trend analysis -- track publication volume, citation patterns, and emerging topics across biomedical fields over time
  • Competitor intelligence for pharma -- monitor publications from competing research groups or pharmaceutical companies working on similar drug targets
  • Grant application preparation -- quickly survey existing literature on a topic to establish research gaps and justify funding proposals
  • Clinical evidence gathering -- find clinical trial publications and reviews relevant to specific treatments, diseases, or medical devices
  • Preprint monitoring -- filter by source PPR to track bioRxiv and medRxiv preprints before they appear in peer-reviewed journals
  • Author publication tracking -- follow a specific researcher's output by combining author name filters with date ranges
  • Knowledge graph construction -- extract structured metadata including MeSH terms, authors, and citations to build biomedical knowledge graphs and network analyses
  • Open access content mining -- filter for open access articles to build text mining datasets for NLP, machine learning, or AI training
  • Journal benchmarking -- compare publication volume and citation impact across journals in a specific research area

API & integrations

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("RMqjhlGfzi7ScjOGH").call(run_input={
"query": "CRISPR gene editing",
"openAccessOnly": True,
"sortBy": "CITED desc",
"maxResults": 100,
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(f"{item['title']} -- {item['citedByCount']} citations")

JavaScript

import { ApifyClient } from "apify-client";
const client = new ApifyClient({ token: "YOUR_API_TOKEN" });
const run = await client.actor("RMqjhlGfzi7ScjOGH").call({
query: "CRISPR gene editing",
openAccessOnly: true,
sortBy: "CITED desc",
maxResults: 100,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
console.log(`${item.title} -- ${item.citedByCount} citations`);
});

cURL

curl "https://api.apify.com/v2/acts/RMqjhlGfzi7ScjOGH/runs" \
-X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-d '{
"query": "CRISPR gene editing",
"openAccessOnly": true,
"sortBy": "CITED desc",
"maxResults": 100
}'

Platform integrations

  • Apify Schedules -- run daily or weekly to monitor new publications on a topic
  • Webhooks -- trigger downstream processing when a run completes (e.g., email alerts, Slack notifications)
  • Zapier / Make -- connect Apify to thousands of apps for automated literature monitoring workflows
  • Google Sheets -- export results directly to a spreadsheet for collaborative review
  • Amazon S3 / Google Cloud Storage -- push datasets to cloud storage for archival or further processing

Use in Dify

Drop this actor into Dify workflows via the Apify plugin's Run Actor node. Each paper returns classified, structured JSON — publicationStatus (preprint / peer-reviewed), accessLevel (open-fulltext / restricted-fulltext / abstract-only), and the text-mined biology — so a downstream if/else node branches on stable enums instead of parsing prose. A generic scraper pointed at a publisher page returns rendered HTML; this returns the classification and the entities.

  • Actor ID: ryanclinton/europe-pmc-search
  • Sample input (build a full-text-aware evidence pipeline with entity mining):
{
"query": "base editing sickle cell",
"includeMinedEntities": true,
"openAccessOnly": false,
"sortBy": "CITED desc",
"maxResults": 50
}
  • Branching example — a Dify if/else node routes each paper record by the fields above:
    • accessLevel == "open-fulltext" → fetch the body via fullTextUrl and feed a RAG index.
    • isPreprint == true → flag as not-yet-peer-reviewed before citing.
    • accessionCount > 0 → route to a data-reuse / reproducibility step using the accessions list.
    • completeness.score >= 0.7 → accept; below → send to a manual-review branch.
    • recordType == "entity" → route gene / disease / dataset aggregates to a knowledge-graph or trend node; recordType == "edge" → a network node.
    • recordType == "summary" → branch the run-level corpus record (preprint share, top entities, top funders) to a reporting node.
  • Presets and modes Dify can leverage: queryPreset (e.g. datasets_available, clinical_trials) constructs the right query without Lucene syntax; includeMinedEntities turns on the per-paper minedEntities block (genes, diseases, chemicals, organisms) and the accessions list; entityRollup adds entity-aggregate records and includeNetworks adds co-occurrence edges + graph CSVs; outputProfile: "compact" returns a lean record for token-bounded agent steps; canonicalId is the stable key for joining or deduping across nodes. The mined-entity arrays are usable verbatim by a downstream node — no LLM rewriting needed.

How it works

  1. Parse input -- reads the search query and optional filters (author, journal, dates, open access, source database)
  2. Build Lucene query -- constructs a Europe PMC query string combining free text with field-specific operators like AUTH:"name", JOURNAL:"title", FIRST_PDATE:[from TO to], OPEN_ACCESS:y, and SRC:source
  3. Query the API -- sends the request to the Europe PMC REST API at https://www.ebi.ac.uk/europepmc/webservices/rest/search with resultType=core for full metadata
  4. Paginate with cursors -- uses cursor-based pagination (cursorMark) with page sizes up to 1,000 to efficiently collect large result sets
  5. Transform results -- normalizes nested API responses into flat, consistent output records with 23 structured fields
  6. Extract full-text URLs -- finds the best available full-text link for each article, preferring HTML over PDF over any other format
  7. Push to dataset -- stores each batch of transformed records in the Apify dataset as they are collected
Input Query + Filters
|
v
[Build Lucene Query]
|
v
[Europe PMC REST API] <--cursorMark-- [Next Page?]
| ^
v |
[Transform Results] --> [Push to Dataset] -+
|
v
Clean JSON Output (up to 500 records)

Intelligence stack: query, Europe PMC search, entity mining, accession extraction, co-occurrence network, knowledge graph


Performance & cost

ScenarioResultsApprox. DurationApify Platform Cost
Quick search505--10 seconds< $0.01
Medium batch20015--30 seconds< $0.01
Maximum batch50030--60 seconds~$0.01
Scheduled daily run50/day5--10 seconds/run< $0.30/month
  • Memory requirement: 256 MB (minimum Apify tier)
  • The Europe PMC API is completely free with no usage limits for reasonable query volumes
  • Cost is driven entirely by Apify compute time, which is minimal for this API-only actor
  • No browser rendering or proxy infrastructure required

Limitations

  • Maximum 500 results per run -- the actor caps output at 500 records to keep runs fast and manageable. For larger datasets, run multiple queries with narrower filters.
  • Abstract only for paywalled articles -- the actor provides full metadata and abstracts for all articles, but full-text content behind paywalls requires separate institutional access.
  • Citation counts may lag -- the citedByCount field reflects Europe PMC's citation index, which may not be as current as Google Scholar or other citation databases.
  • Preprint metadata may be sparse -- preprints from bioRxiv and medRxiv may lack MeSH terms, full author affiliations, or other metadata that is added during peer review and indexing.
  • API rate limits -- while the Europe PMC API has no formal authentication requirement, extremely high-frequency requests may be throttled. The actor uses reasonable page sizes and sequential requests to avoid this.
  • Date filtering uses first publication date -- the FIRST_PDATE field may differ from the journal publication date for articles that appeared as early releases or preprints first.
  • No full-text download -- the actor extracts metadata and links but does not download or parse the full text of articles.

Responsible use

  • Respect publisher terms -- while Europe PMC metadata is freely available, full-text articles may be subject to publisher copyright. Always check the license before redistributing or text mining full-text content.
  • Cite your sources -- if you use data from this actor in research publications or reports, cite the original articles and acknowledge Europe PMC as the data source.
  • Use reasonable query volumes -- avoid scheduling unnecessarily frequent runs or requesting maximum results when fewer would suffice. The Europe PMC API is a shared public resource.
  • Comply with institutional policies -- if you are accessing this actor through an institutional Apify account, ensure your usage complies with your organization's data handling and research ethics policies.
  • Do not use for spam or harassment -- do not use extracted author contact information (affiliations) for unsolicited bulk communications.

FAQ

Q: What databases does Europe PMC cover? A: Europe PMC indexes over 40 million records from three primary sources: PubMed (MED) for MEDLINE biomedical citations, PubMed Central (PMC) for full-text open access articles, and preprint servers (PPR) including bioRxiv and medRxiv. It also includes content from patents, agricultural research, and European life science repositories.

Q: Do I need an API key to use this actor? A: No. The Europe PMC REST API is completely free and open. This actor requires no API keys, tokens, or registration to run.

Q: How is this different from the PubMed Research Search actor? A: Europe PMC includes everything in PubMed plus additional content from PubMed Central full-text articles, preprints from bioRxiv/medRxiv, and European life science sources. It also provides MeSH terms, richer author metadata with affiliations, and direct full-text URLs in a single query.

Q: Can I get the full text of articles? A: The actor provides a fullTextUrl field with the best available link to the full text (HTML or PDF) for open access articles. For paywalled articles, you receive the abstract and all metadata but need institutional access for full text.

Q: What query syntax is supported? A: The actor supports Europe PMC's Lucene-based query syntax. You can use free text, field-specific operators (TITLE:"term", AUTH:"name", DOI:10.xxx, ABSTRACT:"keyword"), Boolean operators (AND, OR, NOT), and wildcards (*). The dedicated filter fields for author, journal, date, and source are combined automatically.

Q: Can I search for preprints specifically? A: Yes. Set the Source Database parameter to PPR (Preprints) to restrict results to bioRxiv, medRxiv, and other preprint servers indexed by Europe PMC.

Q: How does pagination work? A: The actor uses cursor-based pagination with the Europe PMC API's cursorMark parameter. Each API call retrieves up to 1,000 results, and the actor continues fetching pages until it reaches your maxResults limit or exhausts available results.

Q: Can I schedule automatic searches for new publications? A: Yes. Set up an Apify schedule to run the actor daily or weekly. Use the dateFrom parameter set to a recent date to capture only newly published articles. Combine with webhooks to send email or Slack alerts when new papers match your criteria.

Q: What are MeSH terms and why are they useful? A: MeSH (Medical Subject Headings) is a standardized vocabulary maintained by the National Library of Medicine. MeSH terms enable consistent topic classification across articles, making them valuable for systematic reviews, meta-analyses, and building structured topic taxonomies.

Q: How current is the data? A: Europe PMC updates its index daily. PubMed records typically appear within 1--2 days of being indexed by the National Library of Medicine. Preprints are indexed shortly after they are posted to bioRxiv or medRxiv.

Q: Can I sort results by citation count? A: Yes. Set the Sort By parameter to CITED desc to return the most highly cited articles first. This is useful for identifying seminal papers and high-impact research on any topic.

Q: What happens if my query returns more than 500 results? A: The actor returns up to 500 results (or your configured maxResults limit, whichever is lower). If the total hit count exceeds this, you can narrow your search with additional filters or run multiple queries with non-overlapping date ranges to cover the full result set.


ActorDescription
PubMed Biomedical Literature SearchSearch PubMed for MEDLINE-indexed biomedical citations with abstracts and metadata
Semantic Scholar Paper SearchSearch Semantic Scholar for academic papers with AI-generated TLDRs and citation data
OpenAlex Research Paper SearchSearch OpenAlex for open scholarly metadata across all academic disciplines
Crossref Academic Paper SearchSearch Crossref for DOI-registered publications with reference metadata
ORCID Researcher SearchLook up researchers by ORCID ID to find their publication history and affiliations
ArXiv Preprint Paper SearchSearch ArXiv for preprints in physics, mathematics, computer science, and related fields