BAAI / Zhiyuan AI Research Papers Scraper avatar

BAAI / Zhiyuan AI Research Papers Scraper

Pricing

Pay per event

Go to Apify Store
BAAI / Zhiyuan AI Research Papers Scraper

BAAI / Zhiyuan AI Research Papers Scraper

Scrapes curated AI research papers from BAAI (Beijing Academy of AI, hub.baai.ac.cn). Extracts paper titles, authors, abstracts, arxiv IDs, venues, curator notes in Chinese, and links.

Pricing

Pay per event

Rating

0.0

(0)

Developer

BowTiedRaccoon

BowTiedRaccoon

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

24 days ago

Last modified

Share

Extract the current curated AI research paper feed from BAAI (Beijing Academy of Artificial Intelligence, 智源研究院) at hub.baai.ac.cn. Each run fetches the hotness-sorted daily paper feed and enriches every paper with full editorial curator notes written in Chinese by BAAI staff.

What You Get

Each record includes:

FieldDescription
paper_title_enPaper title in English
arxiv_idArXiv paper ID (e.g. 2606.06624)
authorsList of author names
publication_dateRelease date (ISO 8601)
abstract_zhFull Chinese-language abstract
keywords_zhChinese subject tags (e.g. 机器学习, 生成模型)
keywords_enArXiv category codes (e.g. cs.LG, cs.RL)
pdf_urlDirect PDF download link (BAAI-hosted mirror)
baai_curator_noteStructured editorial notes: [简介] abstract, [问题] problem addressed, [思路] key approach, [亮点] highlights, [相关] related work
baai_urlCanonical BAAI paper page URL
cited_by_countBAAI hotness score
sourceAlways hub.baai.ac.cn

Why BAAI?

BAAI (智源研究院) is China's premier government-backed AI research institute, behind the WuDao foundation model series, the BGE embedding family, and the Aquila LLM. Their curated daily paper feed covers ~10–30 papers per day with Chinese-language editorial summaries not available on arXiv — the editorial value add is the key moat.

Use cases:

  • Track Chinese AI research output for competitive intelligence
  • Build a joinable dataset with an ArXiv scraper (shared arxiv_id key)
  • Monitor BAAI's curated AI research highlights in Chinese for sino-watchers
  • Feed into downstream LLM pipelines with Chinese-language summaries

Input

ParameterRequiredDefaultDescription
maxItemsYes5Maximum number of papers to return (current feed has ~9 per run)

How It Works

  1. Fetches hub.baai.ac.cn/papers — a Nuxt SSR page that embeds the current hotness feed in window.__NUXT__ state (no JavaScript execution required)
  2. Extracts up to 9 paper UUIDs from the SSR data
  3. Fetches each paper's detail page (hub.baai.ac.cn/paper/<uuid>) — also fully SSR-rendered
  4. Merges listing data (basic fields) with detail data (curator notes, extended keywords)
  5. Emits one record per paper

Note on scope: The BAAI listing page renders the current editorial feed (~9 papers) via server-side rendering. Further pagination is client-side only (infinite scroll). Each run captures the current curated snapshot — run daily to build a historical archive.

Sample Output

{
"paper_title_en": "Rethinking the Trust Region in LLM Reinforcement Learning",
"arxiv_id": "2602.04879",
"authors": ["Penghui Qi", "Xiangxin Zhou", "Zichen Liu"],
"publication_date": "2026-02-04",
"abstract_zh": "强化学习(RL)已成为大语言模型(LLM)微调的基石...",
"keywords_zh": ["机器学习", "强化学习", "大语言模型"],
"keywords_en": ["cs.LG", "cs.CL", "cs.AI"],
"pdf_url": "https://simg.baai.ac.cn/paperfile/572bbeac-4516-4c34-8bc2-15ee9ef5bbb7.pdf",
"baai_curator_note": "[简介] 强化学习(RL)已成为大语言模型...\n\n[问题] 如何设计更合理的信任域约束...\n\n[思路] 提出散度近端策略优化(DPPO)...",
"baai_url": "https://hub.baai.ac.cn/paper/572bbeac-4516-4c34-8bc2-15ee9ef5bbb7",
"cited_by_count": 120,
"source": "hub.baai.ac.cn"
}

Notes

  • China-hosted: The site is hosted in China. Cross-border latency is factored into timeouts (45 seconds per request). Runs from US/EU Apify datacenters may experience occasional delays.
  • No authentication required: The papers feed is publicly accessible without login.
  • Daily curation: BAAI curates ~10–30 papers per day. Running this actor daily gives you a rolling archive of their editorial picks.