All notable changes to Zhihu Scraper — Q&A, Answers, Articles & Columns will be documented in this file.
[2026-05-15]
🎉 Zhihu Scraper — Launch!
Keyword Search Across Zhihu — search answers, questions, articles, and people in one call; mixed-type results dispatched to the right ID/URL schema per row
Full Question Answer Threads — pull every answer for a Zhihu question with complete HTML body (not snippets), vote counts, comment counts, and author profile
Article Detail with Full HTML — single-call extraction of any Zhihu article or column post: full HTML content, topic tags, parent column reference, and author credentials
Column (Zhuanlan) Article Lists — paginate the complete catalog of any Zhihu column by slug, ~10 articles per page
Author + Badge Data on Every Row — Zhihu blue/gold badge verification, follower counts, lifetime upvote tallies, headline bios baked into every dataset row — ready for KOL discovery workflows
18–19-Digit ID Precision — IDs preserved as strings so 64-bit Zhihu identifiers never get silently truncated by JavaScript bigint limits
HTTPS URL Normalization — all Zhihu CDN URLs (*.zhimg.com) upgraded to HTTPS automatically
Resilient Pagination — built-in retry on transient upstream errors with mandarin-aware error translation (问题不存在, 专栏不存在 → plain English)
Gold-standard LLM training data — full HTML answer bodies plus author credentials and vote signals make this the cleanest source of curated Mandarin Q&A for SFT and RAG fine-tuning
Long-form depth competitors don't ship — most Zhihu scrapers return excerpts only; we return the entire answer and article body with formatting and images intact
One operation, one clean dataset — no chaining, no manual cursor management, no proxy setup
No account, no API key, no Zhihu login — paste a keyword or ID, click Run, get clean JSON
🎯 Use Cases
Wei (ML Engineer, Beijing) pulls 100K+ Mandarin answer threads per month to build domain-balanced Chinese instruction-tuning datasets
Lin (Insights Lead, Shanghai agency) keyword-tracks branded questions weekly to surface unfiltered consumer sentiment in real Chinese consumer language
Anya (PM, B2B SaaS) monitors competitor mentions in Q&A threads to catch "X vs. Y" comparison content before it goes viral
Marcus (B2B Marketing Lead) shortlists Zhihu KOLs by topic + badge verification + follower count for sponsored long-form content campaigns
Chen (Data Scientist, hedge fund) runs daily keyword searches across industry watchlists to spot emerging questions for alpha-generation pipelines
Dr. Park (Stanford computational linguist) builds Mandarin discourse corpora and trains discourse-level classifiers on Chinese internet argumentation patterns