UniProt Protein Sequence & Annotation Scraper avatar

UniProt Protein Sequence & Annotation Scraper

Pricing

from $28.12 / 1,000 results

Go to Apify Store
UniProt Protein Sequence & Annotation Scraper

UniProt Protein Sequence & Annotation Scraper

Export UniProt Knowledgebase entries — search Swiss-Prot by organism, keyword, gene, or any UniProt query, or fetch a single accession. Returns names, genes, organism, sequence length & molecular weight, keywords, comments, features, and PDB/RefSeq/Ensembl/KEGG cross-refs.

Pricing

from $28.12 / 1,000 results

Rating

0.0

(0)

Developer

ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

ParseForge Banner

🧬 UniProt Protein Sequence & Annotation Scraper

🚀 Export UniProt Knowledgebase entries in seconds. Query Swiss-Prot and TrEMBL by organism, gene, keyword, subcellular location, length range, or any UniProt field, or fetch a single accession with full annotations. No API key, no SPARQL, no XML parsing.

🕒 Last updated: 2026-05-13 · 📊 25 fields per entry · 🧬 250M+ UniProt entries · 🌍 every kingdom of life

The UniProt Protein Scraper queries the official UniProt REST API and returns standardized protein records from the world's largest protein-sequence knowledgebase. Each entry carries the primary accession, UniProtKB ID, entry type (reviewed Swiss-Prot vs unreviewed TrEMBL), protein name, alternative names, gene names, organism (scientific + common + taxon ID + lineage), evidence level, annotation score, sequence length, molecular weight, CRC64 / MD5 sequence hashes, keywords (with categories), curated comments (function, subunit, subcellular location, etc.), structural features, reference counts, last-update date, entry version, and the canonical UniProt URL.

UniProt is maintained jointly by EMBL-EBI, SIB, and PIR and is the de facto reference for protein biology in research, pharma, and bioinformatics. Coverage spans 250 million+ entries across 2.7 million+ species in TrEMBL, with ~570,000 manually curated entries in Swiss-Prot. This Actor flattens UniProt's nested JSON into rows that drop into pandas, R, or any warehouse.

🎯 Target Audience💡 Primary Use Cases
Bioinformatics teams, computational biologists, pharma research, structural biologists, drug-discovery startups, science journalistsProteome exports, gene-to-protein mapping, target dossier builds, organism-level annotation, sequence + feature retrieval, cross-database joining

📋 What the UniProt Scraper does

Two lookup modes in one Actor:

  • 🔍 Query mode. Pass any UniProt query (reviewed:true AND organism_id:9606, keyword:KW-0181, gene:BRCA1, cc_subcellular_location:nucleus, existence:1, taxonomy_id:10090 AND length:[100 TO 500]).
  • 🆔 Accession mode. Set accession (e.g. P00533) for a single full-entry pull. Skips the search query entirely.

Each record carries identifiers (primary accession, UniProtKB ID, entry type), names (protein name, alternative names, gene names), taxonomy (scientific + common organism, taxon ID, lineage), evidence (protein existence, annotation score), sequence facts (length, molecular weight, CRC64, MD5, plus optional full sequence string), curated annotations (keywords, comments, features), reference + feature counts, last-updated date, version, and the canonical UniProt URL.

💡 Why it matters: UniProt's REST API is rich but verbose. Researchers and engineering teams spend days writing parsers for keywords, comments, and features. This Actor flattens the response into 25 spreadsheet-ready fields so target dossiers, comparative proteomics, and dataset prep land in one query.


🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing a human proteome pull, gene lookup, and accession fetch.


⚙️ Input

InputTypeDefaultBehavior
querystring"reviewed:true AND organism_id:9606"UniProt query syntax. Supports reviewed:, organism_id:, taxonomy_id:, gene:, keyword:, cc_subcellular_location:, existence:, length:[X TO Y], and more. Ignored when accession is set.
accessionstring""Single UniProt accession (e.g. P00533). Bypasses the search query when set.
maxItemsinteger10Records to return. Free plan caps at 10, paid plan at 1,000,000.
fetchSequencebooleanfalseWhen true, embeds the full amino-acid sequence string in every record. Sequence length and molecular weight are always returned.
pageSizeinteger500Entries per API request. UniProt hard max is 500.

Example: every reviewed human Swiss-Prot entry.

{
"query": "reviewed:true AND organism_id:9606",
"maxItems": 1000,
"pageSize": 500
}

Example: single accession, full sequence included.

{
"accession": "P00533",
"fetchSequence": true
}

⚠️ Good to Know: the accession field is for a single entry. To resolve a list of accessions, use the query syntax: accession:P00533 OR accession:P04637. Use fetchSequence: false (default) when you do not need the raw amino-acid string. Sequence length and molecular weight are always returned regardless.


📊 Output

Each entry carries 25 fields. Download as CSV, Excel, JSON, or XML.

🧾 Schema

FieldTypeExample
🆔 primaryAccessionstring"A0A0C5B5G6"
🏷️ uniProtkbIdstring"MOTSC_HUMAN"
📚 entryTypestring"UniProtKB reviewed (Swiss-Prot)"
🧬 proteinNamestring"Mitochondrial-derived peptide MOTS-c"
📝 alternativeNamesstring[]["Mitochondrial open reading frame of the 12S rRNA-c"]
🧫 geneNamesstring[]["MT-RNR1"]
🦠 organismScientificstring"Homo sapiens"
👤 organismCommonstring"Human"
🆔 taxonIdnumber9606
🌳 organismLineagestring[]["Eukaryota","Metazoa","Chordata",...]
🧪 proteinExistencestring"1: Evidence at protein level"
annotationScorenumber5
📏 sequenceLengthnumber16
⚖️ sequenceMolWeightnumber2175
🔐 sequenceCrc64string"361DE748426DD505"
🔐 sequenceMd5string"AE72B6C4E87692429C0D558B92BD7B3D"
🏷️ keywordsobject[][{ "id": "KW-0238", "category": "Molecular function", "name": "DNA-binding" }]
💬 commentsobject[][{ "type": "FUNCTION", "text": "Regulates insulin sensitivity ..." }]
🧩 featuresobject[][{ "type": "Chain", "description": "MOTS-c", "start": 1, "end": 16 }]
📖 referenceCountnumber17
🧱 featureCountnumber6
📅 lastUpdateddate"2026-01-28"
🔢 entryVersionnumber30
🔗 urlstring"https://www.uniprot.org/uniprotkb/A0A0C5B5G6/entry"
🕒 scrapedAtISO 8601"2026-05-13T22:25:18.386Z"

📦 Sample record


✨ Why choose this Actor

Capability
🧬Authoritative knowledgebase. Pulls directly from the official UniProt REST API.
🔍Full query syntax. Every UniProt search field works: organism, gene, keyword, location, length range, evidence, taxonomy.
🆔Accession fast-path. Set accession: to pull one entry without writing a query.
📏Sequence facts built in. Length and molecular weight always returned. Full sequence string available on demand.
🏷️Curated annotations exposed. Keywords, comments, and features come through as structured arrays.
🚫No API key. UniProt is a free public service.
🔁Always fresh. Reflects the current UniProt release.

📊 UniProt entries are referenced in nearly every modern paper on protein biology, drug discovery, and structural biology.


📈 How it compares to alternatives

ApproachCostCoverageRefreshFormatSetup
⭐ UniProt Scraper (this Actor)$5 free credit, then pay-per-useUniProtKB (Swiss-Prot + TrEMBL)Live per runFlat JSON / CSV⚡ 2 min
Direct REST API callsFreeSameLiveNested JSON🐢 Hours
Full release FASTA + XML downloadFreeFull UniProt8-week releaseMassive flatfiles🐢 Days
Commercial bioinformatics platform$$$Curated subsetReal-timeWeb UI / API⏳ Vendor onboarding

Pick this Actor when you want UniProt records in a flat table without writing a client or downloading the release.


🚀 How to use

  1. 📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
  2. 🌐 Open the Actor. Go to the UniProt Protein Scraper page on the Apify Store.
  3. 🎯 Set input. Pick a query (reviewed:true AND organism_id:9606 is a great starter) or an accession.
  4. 🚀 Run it. Click Start and let the Actor walk the UniProt API.
  5. 📥 Download. Grab results in the Dataset tab as CSV, Excel, JSON, or XML.

⏱️ Total time from signup to a downloaded proteome slice: 3-5 minutes. No coding required.


💼 Business use cases

🧪 Drug Discovery & Pharma

  • Target dossier builds for new programs
  • Cross-organism homolog comparisons
  • Subcellular location filters for druggability
  • Evidence-level scoring for prioritization

🧬 Bioinformatics & Genomics

  • Gene-to-protein lookups across organisms
  • Proteome exports for comparative analysis
  • Annotation enrichment for variant calling
  • Keyword and feature-based cohort building

🔬 Structural Biology

  • Length and molecular-weight filters for crystallography candidates
  • Feature-table mining for domain boundaries
  • Sequence hash joins to PDB or AlphaFold IDs
  • Reference-count signals for popular targets

🤖 LLM & Bio AI

  • Ground LLM responses in UniProt-authoritative data
  • Build RAG indexes for protein chatbots
  • Training data for sequence-attribute models
  • Validation layers for bio AI agents

🔌 Automating UniProt Scraper

Control the scraper programmatically for scheduled runs and pipeline integrations:

  • 🟢 Node.js. Install the apify-client NPM package.
  • 🐍 Python. Use the apify-client PyPI package.
  • 📚 See the Apify API documentation for full details.

The Apify Schedules feature lets you trigger this Actor on any cron interval. UniProt has an eight-week release cycle. Schedule a refresh on the same cadence to stay current.


🌟 Beyond business use cases

UniProt data feeds far more than commercial pharma. The same structured records support research, education, and open-science work.

🎓 Research and academia

  • Reproducible proteome datasets for papers
  • Coursework on protein annotation and biocuration
  • Comparative-genomics theses with structured features
  • Open-data benchmarks for sequence-based ML

🎨 Personal and creative

  • Hobbyist bioinformatics portfolio projects
  • Sci-comm visualizations of protein families
  • Personal target tracker for citizen scientists
  • Indie tools for amateur synthetic biology

🤝 Non-profit and civic

  • Pandemic preparedness datasets keyed to UniProt
  • Public-health reports on pathogen proteomes
  • Open-source vaccine candidate research
  • Civic transparency on bio-research outputs

🧪 Experimentation

  • Train sequence-attribute ML classifiers
  • Prototype agents that build target dossiers
  • Test bio chatbot grounding against real records
  • Benchmark protein-NER models

🤖 Ask an AI assistant about this scraper

Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:


❓ Frequently Asked Questions

🧩 How does it work?

Either supply a UniProt query (reviewed:true AND organism_id:9606) or an accession (P00533), then click Start. The Actor pages through the UniProt REST API, flattens nested fields, and emits a row per entry with 25 columns including keywords, comments, and features.

🔍 What query syntax can I use?

Everything UniProt supports in its own search bar. Common fields: reviewed:, organism_id:, taxonomy_id:, gene:, keyword:, cc_subcellular_location:, existence:, length:[X TO Y], accession:, plus boolean AND/OR/NOT. See the UniProt query fields docs for the full list.

🆔 How do I look up a single accession?

Set the accession field (e.g. P00533). It bypasses the query and pulls the full entry directly.

🧬 How do I look up many accessions at once?

Use the query syntax with OR: accession:P00533 OR accession:P04637 OR accession:Q9Y6K8.

📏 Does it include the full sequence string?

Only when fetchSequence: true. Sequence length and molecular weight are always returned. Skip the full string for big proteomes to keep dataset sizes manageable.

🔁 How fresh is the data?

UniProt releases every eight weeks. Every run hits the live API, so output reflects the current release.

📚 What is the difference between Swiss-Prot and TrEMBL?

Swiss-Prot is manually curated (reviewed:true, ~570K entries). TrEMBL is automatically annotated (reviewed:false, hundreds of millions of entries). Pick the slice your work needs.

🚫 Do I need an API key?

No. The UniProt REST API is free and public.

⏰ Can I schedule recurring runs?

Yes. Use Apify Schedules to refresh on the UniProt release cadence and pipe results into your pipeline.

Yes. UniProt is released under CC BY 4.0. Attribute UniProt in any downstream publication or product, as their license requires.

💳 Do I need a paid Apify plan?

No. The free plan covers small runs (10 records). A paid plan unlocks higher limits and scheduling.

🆘 What if I need help?

Reach out via the contact form below to request a custom protein workflow.


🔌 Integrate with any app

UniProt Protein Scraper connects to any cloud service via Apify integrations:

  • Make - Automate multi-step research workflows
  • Zapier - Connect with 5,000+ apps
  • Slack - Get release notifications in your channels
  • Airbyte - Pipe protein records into your warehouse
  • GitHub - Trigger runs from commits and releases
  • Google Drive - Export datasets straight to Sheets

You can also use webhooks to trigger downstream actions when a run finishes. Push fresh UniProt entries into your bio pipeline or alert your team in Slack.


💡 Pro Tip: browse the complete ParseForge collection for more reference-data scrapers.


🆘 Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.


⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by EMBL-EBI, the SIB Swiss Institute of Bioinformatics, the Protein Information Resource (PIR), the UniProt Consortium, or any of their funding agencies. All trademarks mentioned are the property of their respective owners. Only publicly available UniProtKB data is collected. Please cite UniProt as required by their CC BY 4.0 license.