BAAI / Zhiyuan AI Research Papers Scraper
Pricing
Pay per event
BAAI / Zhiyuan AI Research Papers Scraper
Scrapes curated AI research papers from BAAI (Beijing Academy of AI, hub.baai.ac.cn). Extracts paper titles, authors, abstracts, arxiv IDs, venues, curator notes in Chinese, and links.
Pricing
Pay per event
Rating
0.0
(0)
Developer
BowTiedRaccoon
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
24 days ago
Last modified
Categories
Share
Extract the current curated AI research paper feed from BAAI (Beijing Academy of Artificial Intelligence, 智源研究院) at hub.baai.ac.cn. Each run fetches the hotness-sorted daily paper feed and enriches every paper with full editorial curator notes written in Chinese by BAAI staff.
What You Get
Each record includes:
| Field | Description |
|---|---|
paper_title_en | Paper title in English |
arxiv_id | ArXiv paper ID (e.g. 2606.06624) |
authors | List of author names |
publication_date | Release date (ISO 8601) |
abstract_zh | Full Chinese-language abstract |
keywords_zh | Chinese subject tags (e.g. 机器学习, 生成模型) |
keywords_en | ArXiv category codes (e.g. cs.LG, cs.RL) |
pdf_url | Direct PDF download link (BAAI-hosted mirror) |
baai_curator_note | Structured editorial notes: [简介] abstract, [问题] problem addressed, [思路] key approach, [亮点] highlights, [相关] related work |
baai_url | Canonical BAAI paper page URL |
cited_by_count | BAAI hotness score |
source | Always hub.baai.ac.cn |
Why BAAI?
BAAI (智源研究院) is China's premier government-backed AI research institute, behind the WuDao foundation model series, the BGE embedding family, and the Aquila LLM. Their curated daily paper feed covers ~10–30 papers per day with Chinese-language editorial summaries not available on arXiv — the editorial value add is the key moat.
Use cases:
- Track Chinese AI research output for competitive intelligence
- Build a joinable dataset with an ArXiv scraper (shared
arxiv_idkey) - Monitor BAAI's curated AI research highlights in Chinese for sino-watchers
- Feed into downstream LLM pipelines with Chinese-language summaries
Input
| Parameter | Required | Default | Description |
|---|---|---|---|
maxItems | Yes | 5 | Maximum number of papers to return (current feed has ~9 per run) |
How It Works
- Fetches
hub.baai.ac.cn/papers— a Nuxt SSR page that embeds the current hotness feed inwindow.__NUXT__state (no JavaScript execution required) - Extracts up to 9 paper UUIDs from the SSR data
- Fetches each paper's detail page (
hub.baai.ac.cn/paper/<uuid>) — also fully SSR-rendered - Merges listing data (basic fields) with detail data (curator notes, extended keywords)
- Emits one record per paper
Note on scope: The BAAI listing page renders the current editorial feed (~9 papers) via server-side rendering. Further pagination is client-side only (infinite scroll). Each run captures the current curated snapshot — run daily to build a historical archive.
Sample Output
{"paper_title_en": "Rethinking the Trust Region in LLM Reinforcement Learning","arxiv_id": "2602.04879","authors": ["Penghui Qi", "Xiangxin Zhou", "Zichen Liu"],"publication_date": "2026-02-04","abstract_zh": "强化学习(RL)已成为大语言模型(LLM)微调的基石...","keywords_zh": ["机器学习", "强化学习", "大语言模型"],"keywords_en": ["cs.LG", "cs.CL", "cs.AI"],"pdf_url": "https://simg.baai.ac.cn/paperfile/572bbeac-4516-4c34-8bc2-15ee9ef5bbb7.pdf","baai_curator_note": "[简介] 强化学习(RL)已成为大语言模型...\n\n[问题] 如何设计更合理的信任域约束...\n\n[思路] 提出散度近端策略优化(DPPO)...","baai_url": "https://hub.baai.ac.cn/paper/572bbeac-4516-4c34-8bc2-15ee9ef5bbb7","cited_by_count": 120,"source": "hub.baai.ac.cn"}
Notes
- China-hosted: The site is hosted in China. Cross-border latency is factored into timeouts (45 seconds per request). Runs from US/EU Apify datacenters may experience occasional delays.
- No authentication required: The papers feed is publicly accessible without login.
- Daily curation: BAAI curates ~10–30 papers per day. Running this actor daily gives you a rolling archive of their editorial picks.