Zhihu Scraper — Q&A, Answers, Articles, Columns
Pricing
from $2.00 / 1,000 search results
Zhihu Scraper — Q&A, Answers, Articles, Columns
Zhihu scraper — extract long-form Mandarin Q&A, expert answers, articles & column posts. Keyword search, question answer threads, article detail, column article list. China market research, LLM training data, competitive intel. Four operations, one clean dataset per run. No API key.
Pricing
from $2.00 / 1,000 search results
Rating
0.0
(0)
Developer
SIÁN OÜ
Maintained by CommunityActor stats
1
Bookmarked
2
Total users
1
Monthly active users
20 hours ago
Last modified
Categories
Share
Zhihu Scraper — Q&A, Answers, Articles & Columns 🚀
🎉 The richest Mandarin Q&A corpus on the web — full HTML answer bodies, expert credentials, vote signals
Built for AI/LLM training teams, China market researchers, and B2B KOL outreach
📋 Overview
Zhihu (知乎) is China's expert-driven Q&A platform — the closest thing to a Mandarin Stack Overflow + Quora + Medium rolled into one. This scraper pulls complete answer threads, full-HTML articles, keyword search across the platform, and column (zhuanlan) post lists — clean, structured, ready for analysis or model training.
Why AI teams, market researchers, and agencies choose us:
- 🧠 Best-in-class LLM training data — full HTML answer bodies (not snippets) with author credentials, vote/comment signals, and badge verification — gold-standard SFT/RAG corpus material for Chinese-language models
- 📚 Long-form depth, not shallow snippets — competitors return excerpts; we return the entire answer + article body including embedded images, headings, and inline references
- 🔀 Mixed-type search in one call — keyword searches return answers, questions, articles, AND people in a single dataset, each row dispatched to the correct ID schema (
answerId,questionId,articleId,peopleId) - 🎖️ KOL discovery built in — every row carries
authorId,authorName,authorHeadline,authorFollowerCount,authorVoteupCount,authorBadges[],authorIsOrg— ready to dedupe and shortlist Zhihu blue/gold-badge experts - 💰 Pay-per-result pricing — $0.004/search row, $0.040/article detail. Generous FREE tier. No subscription, no minimums, no surprise bills
- ✨ No account, no API key, no proxy setup — paste an ID or keyword, click run, get clean JSON
✨ Features
- 🔍 Keyword Search — search across all Zhihu content types in one call, ~20 mixed results per page
- 💬 Question Answer Threads — pull every answer to a Zhihu question with full HTML body, vote counts, and reply counts
- 📰 Article Detail Extraction — full article HTML body, author profile, topic tags, and parent column reference in a single row
- 📚 Column (Zhuanlan) Article Lists — paginate the complete catalog of any Zhihu column, ~10 articles per page
- 🏷️ Author + Badge Data on Every Row — Zhihu blue/gold badges, follower counts, vote tallies, headline bios baked in
- 🆔 18–19-digit ID Precision — IDs preserved as strings (no JavaScript bigint silent truncation)
- 🖼️ Image URL Normalization — all Zhihu CDN URLs upgraded to HTTPS automatically
- 📊 Clean Structured JSON — flat camelCase aliases on every entity, ready for BigQuery, Pinecone, pandas, or Airtable
- 🌐 Mandarin-Aware Error Translation — upstream Chinese error strings (
问题不存在,专栏不存在) translated to plain English in the dataset - ⚡ Resilient Pagination — built-in retry on transient upstream errors, no manual cursor management
🎬 Quick Start
Pick one of four operations, drop in a keyword or ID, and run. One operation per run, one clean dataset out.
curl -X POST "https://api.apify.com/v2/acts/sian.agency~zhihu-scraper/runs?token=YOUR_TOKEN" \-H "Content-Type: application/json" \-d '{"operation":"search","keyword":"人工智能","maxPages":3}'
🚀 Getting Started (3 Simple Steps)
Step 1: Pick your operation
Choose one: search (keyword), answerList (question thread), articleDetail (single article), or columnArticleList (column posts).
Step 2: Provide the input
A keyword for search, or a Zhihu ID (questionId, articleId, columnId) for the targeted operations.
Step 3: Click Run
The actor handles pagination, retries, and ID precision automatically. Results land in the Apify dataset as flat JSON.
That's it! In under a minute, you'll have:
- Clean, flat JSON rows with the right ID/URL schema per type
- Full HTML content bodies (not snippets) for answers and articles
- Author + badge metadata on every row for KOL workflows
📥 Input Configuration
| Field | Type | Required | Description |
|---|---|---|---|
operation | enum | Yes | One of: search, answerList, articleDetail, columnArticleList |
keyword | string | If search | Search term (Chinese or English) |
questionId | string | If answerList | Zhihu question ID (numeric string) |
articleId | string | If articleDetail | Zhihu article ID (numeric string) |
columnId | string | If columnArticleList | Zhihu column slug (e.g. xuehy) |
maxPages | number | No | Pagination cap (default 1; ignored for articleDetail) |
Example — Keyword Search:
{"operation": "search","keyword": "人工智能","maxPages": 5}
Example — Question Answer Thread:
{"operation": "answerList","questionId": "660962845","maxPages": 10}
Example — Article Detail:
{"operation": "articleDetail","articleId": "2032860336215307118"}
Example — Column Article List:
{"operation": "columnArticleList","columnId": "xuehy","maxPages": 5}
📤 Output
Results are saved to the Apify dataset with 40+ fields including full HTML bodies, author profiles, and engagement metrics.
| Field | Type | Description |
|---|---|---|
operation | string | Which operation produced the row |
entityType | string | answer / question / article / people / column-article |
answerId / questionId / articleId | string | Type-appropriate Zhihu entity ID (18–19 digits, preserved as string) |
title | string | Question / article title |
excerpt | string | Short summary text |
content | string | Full HTML body for answers and articles |
voteupCount | number | Upvote count |
commentCount | number | Comment count |
authorId | string | Author's numeric ID |
authorName | string | Display name |
authorHeadline | string | One-line bio |
authorFollowerCount | number | Author follower count |
authorVoteupCount | number | Lifetime upvotes received by author |
authorBadges | array | Verified-expert badges (blue/gold) |
authorIsOrg | boolean | Whether the author is a verified organization |
itemPageUrl | string | Canonical Zhihu URL for the entity |
createdTime / updatedTime | number | Unix timestamps |
topics | array | Topic tags (article ops) |
column | object | Parent column reference (article ops) |
Example row (search result, entityType: "answer"):
{"operation": "search","entityType": "answer","answerId": "3654812345678901234","questionId": "660962845","title": "未来 10 年人工智能会让哪些行业彻底消失?","excerpt": "从我的实际经验来看,AI 替代的不是行业,而是行业里...","content": "<p>从我的实际经验来看...</p><img src=\"https://pic1.zhimg.com/...\">","voteupCount": 1842,"commentCount": 327,"authorId": "abc-123-def","authorName": "张三","authorHeadline": "AI Researcher | Tsinghua University","authorFollowerCount": 124300,"authorVoteupCount": 982401,"authorBadges": ["identity_blue"],"itemPageUrl": "https://www.zhihu.com/question/660962845/answer/3654812345678901234"}
💼 Use Cases & Examples
1. AI / LLM Training Corpus Building
Wei, ML Engineer at a Beijing AI lab pulls 100K+ Mandarin answer threads per month for SFT and RAG fine-tuning datasets.
Input: A list of question IDs covering broad topics (technology, finance, medicine, philosophy).
Output: Full HTML answer bodies with author credentials and vote signals for quality filtering.
Use: Bootstrap a domain-balanced Chinese-language instruction-tuning dataset. Filter by voteupCount > 100 and authorBadges to keep high-signal answers only.
2. China Market & Consumer Research
Lin, Insights Lead at a Shanghai research agency keyword-tracks branded questions weekly to surface unfiltered consumer sentiment.
Input: Brand or product keyword ("特斯拉", "iPhone 17", "小米汽车").
Output: Top-voted questions and answers mentioning the brand, with vote/comment counts.
Use: Build a weekly brand-perception report grounded in real Chinese consumer language — not survey-mediated.
3. Competitive Intelligence & Brand Monitoring
Anya, PM at a B2B SaaS company monitors competitor mentions in Q&A threads to catch comparison content early.
Input: Competitor names + product category keywords. Output: Questions, answers, and articles mentioning competitors, sorted by recency and engagement. Use: Surface "X vs. Y" threads before they go viral; respond proactively where buyers are asking real questions.
4. B2B Influencer / KOL Outreach
Marcus, Marketing Lead at a B2B firm targeting China shortlists Zhihu KOLs for sponsored long-form content.
Input: Topic keyword ("AI 创业", "SaaS 出海").
Output: Top-voted answers with author follower counts, badge verification, and headline bios.
Use: Dedupe authors across thousands of answers, sort by authorFollowerCount and badge level, hand off to outreach.
5. Trend & Topic Early-Signal Detection
Chen, Data Scientist at a hedge fund runs daily keyword searches to spot emerging questions before mainstream pickup.
Input: Industry watchlist (semiconductors, energy, biotech) refreshed daily. Output: New questions and rising answers, time-stamped with engagement velocity signals. Use: Feed into an alpha-generation pipeline that flags breakout topics for analyst review.
6. Academic & Sociolinguistic Research
Dr. Park, Stanford computational linguist builds Mandarin discourse corpora for academic NLP research.
Input: Topic clusters via keyword search and column article lists. Output: Full HTML article bodies and answer threads with author demographics where available. Use: Train discourse-level classifiers, study Chinese internet argumentation patterns, publish reproducible datasets.
🔗 Integration Examples
JavaScript/Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_TOKEN' });const run = await client.actor('sian.agency/zhihu-scraper').call({operation: 'search',keyword: '人工智能',maxPages: 5,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items[0]);
Python
from apify_client import ApifyClientclient = ApifyClient('YOUR_TOKEN')run = client.actor('sian.agency/zhihu-scraper').call(run_input={'operation': 'answerList','questionId': '660962845','maxPages': 10,})for item in client.dataset(run['defaultDatasetId']).iterate_items():print(item['authorName'], item['voteupCount'])
cURL
curl -X POST "https://api.apify.com/v2/acts/sian.agency~zhihu-scraper/runs?token=YOUR_TOKEN" \-H "Content-Type: application/json" \-d '{"operation":"articleDetail","articleId":"2032860336215307118"}'
Automation Workflows (N8N / Zapier / Make)
- Trigger: Daily schedule or webhook from your watchlist tool
- HTTP Request: Call the actor with a keyword or column ID
- Process: Filter rows by
voteupCount/authorBadges/authorFollowerCount - Action: Push to BigQuery, Pinecone, Airtable, or trigger a Slack alert
📊 Performance & Pricing
FREE Tier (Try It Now)
- Full feature access on all four operations — same data quality as PAID
- Generous evaluation allowance under the Apify FREE plan
- No credit card required
PAID Tier (Production Ready)
- Pay-per-result: only charged for successful rows
- Volume discounts auto-applied at SILVER, GOLD, PLATINUM, DIAMOND tiers
- No subscription, no minimums, no commitments
Live BRONZE per-result pricing:
| Event | Price | Triggered by |
|---|---|---|
| Actor Start | $0.014 | Once per run |
| Search Result | $0.004 (PRIMARY) | Per row from keyword search |
| Question Answer | $0.005 | Per answer in a question thread |
| Article Detail | $0.040 | Per article (full HTML body) |
| Column Article | $0.004 | Per article in a column listing |
💰 Best price on the market for full-HTML Zhihu extraction — competitors charge 3–5× more for snippet-only output.
❓ Frequently Asked Questions
Q: How many results can I pull per run?
A: There's no hard cap — set maxPages to whatever you need. The actor handles pagination and retries automatically.
Q: Do I need a Zhihu account or API key? A: No. Just an Apify account. We handle everything upstream.
Q: Does it support private answers or paid-content articles? A: No — only publicly accessible content. Paywalled "盐选" articles return excerpt-only content per Zhihu's public surface.
Q: What output formats are available? A: JSON, CSV, Excel, XML, JSONL — export directly from the Apify dataset UI or API.
Q: How accurate are the 18–19-digit IDs?
A: IDs are preserved as strings end-to-end. JavaScript's default JSON.parse silently truncates integers above 2^53; we intercept the parse and keep full precision.
Q: Can I get full HTML article bodies, not just summaries?
A: Yes — articleDetail and answerList return the full HTML content field with embedded images and formatting intact.
Q: Does the search return answers, questions, and articles together?
A: Yes — one search call returns mixed types in a single dataset. Each row carries an entityType field so you can split downstream.
Q: Is this legal? A: Yes — only publicly available data. See the legal section below.
🐛 Troubleshooting
code:301 — FAILED, RETRY errors on a specific question ID
- A small number of historical Zhihu IDs are permanently flagged by upstream anti-bot. Try a different question — most modern IDs work fine. The actor already retries with backoff before surfacing the error.
Empty results on a column ID
- Double-check the column slug (the part after
zhuanlan.zhihu.com/). Example: forhttps://zhuanlan.zhihu.com/xuehy, usecolumnId: "xuehy".
Search returns fewer results than expected
- Increase
maxPages. Zhihu paginates ~20 mixed results per page; deep pagination beyond 10 pages may return diminishing fresh content.
Article body looks truncated
- "盐选" (paywalled) Zhihu Plus articles return only excerpts on the public surface. The actor surfaces what Zhihu exposes — there is no premium-content backdoor.
Author follower / voteup counts show 0
- Some authors disable public stats. The fields are present but Zhihu returns 0 for these users.
⚠️ Trademark Disclaimer
This is an independent scraping tool. It is not affiliated with, endorsed by, or sponsored by Zhihu Inc. (知乎). The Zhihu® and 知乎® names appear under nominative fair use solely to describe the platform this tool reads from. All trademarks are the property of their respective owners.
⚖️ Is it legal to scrape data?
Our actors are ethical and do not extract any private user data, such as email addresses, gender, or location. They only extract what the user has chosen to share publicly. We therefore believe that our actors, when used for ethical purposes by Apify users, are safe.
However, you should be aware that your results could contain personal data. Personal data is protected by the GDPR in the European Union and by other regulations around the world. You should not scrape personal data unless you have a legitimate reason to do so. If you're unsure whether your reason is legitimate, consult your lawyers.
You can also read Apify's blog post on the legality of web scraping.
⭐ Love this actor?
Leave a 5-star review — it helps us build more features for you and keeps the SIÁN portfolio growing.
🤝 Support
Join our active support community
- For issues or questions, open an issue in the actor's repository
- Check the SIÁN Agency Store for more China-market automation tools
- 📧 apify@sian-agency.online
Built by SIÁN Agency | More Tools