Zhihu Scraper — Q&A, Answers, Articles, Columns avatar

Zhihu Scraper — Q&A, Answers, Articles, Columns

Pricing

from $2.00 / 1,000 search results

Go to Apify Store
Zhihu Scraper — Q&A, Answers, Articles, Columns

Zhihu Scraper — Q&A, Answers, Articles, Columns

Zhihu scraper — extract long-form Mandarin Q&A, expert answers, articles & column posts. Keyword search, question answer threads, article detail, column article list. China market research, LLM training data, competitive intel. Four operations, one clean dataset per run. No API key.

Pricing

from $2.00 / 1,000 search results

Rating

0.0

(0)

Developer

SIÁN OÜ

SIÁN OÜ

Maintained by Community

Actor stats

1

Bookmarked

2

Total users

1

Monthly active users

20 hours ago

Last modified

Categories

Share

Zhihu Scraper — Q&A, Answers, Articles & Columns 🚀

SIÁN Agency Store SIÁN Weibo SIÁN Xiaohongshu RedNote SIÁN Taobao & Tmall

🎉 The richest Mandarin Q&A corpus on the web — full HTML answer bodies, expert credentials, vote signals

Built for AI/LLM training teams, China market researchers, and B2B KOL outreach


📋 Overview

Zhihu (知乎) is China's expert-driven Q&A platform — the closest thing to a Mandarin Stack Overflow + Quora + Medium rolled into one. This scraper pulls complete answer threads, full-HTML articles, keyword search across the platform, and column (zhuanlan) post lists — clean, structured, ready for analysis or model training.

Why AI teams, market researchers, and agencies choose us:

  • 🧠 Best-in-class LLM training data — full HTML answer bodies (not snippets) with author credentials, vote/comment signals, and badge verification — gold-standard SFT/RAG corpus material for Chinese-language models
  • 📚 Long-form depth, not shallow snippets — competitors return excerpts; we return the entire answer + article body including embedded images, headings, and inline references
  • 🔀 Mixed-type search in one call — keyword searches return answers, questions, articles, AND people in a single dataset, each row dispatched to the correct ID schema (answerId, questionId, articleId, peopleId)
  • 🎖️ KOL discovery built in — every row carries authorId, authorName, authorHeadline, authorFollowerCount, authorVoteupCount, authorBadges[], authorIsOrg — ready to dedupe and shortlist Zhihu blue/gold-badge experts
  • 💰 Pay-per-result pricing — $0.004/search row, $0.040/article detail. Generous FREE tier. No subscription, no minimums, no surprise bills
  • No account, no API key, no proxy setup — paste an ID or keyword, click run, get clean JSON

✨ Features

  • 🔍 Keyword Search — search across all Zhihu content types in one call, ~20 mixed results per page
  • 💬 Question Answer Threads — pull every answer to a Zhihu question with full HTML body, vote counts, and reply counts
  • 📰 Article Detail Extraction — full article HTML body, author profile, topic tags, and parent column reference in a single row
  • 📚 Column (Zhuanlan) Article Lists — paginate the complete catalog of any Zhihu column, ~10 articles per page
  • 🏷️ Author + Badge Data on Every Row — Zhihu blue/gold badges, follower counts, vote tallies, headline bios baked in
  • 🆔 18–19-digit ID Precision — IDs preserved as strings (no JavaScript bigint silent truncation)
  • 🖼️ Image URL Normalization — all Zhihu CDN URLs upgraded to HTTPS automatically
  • 📊 Clean Structured JSON — flat camelCase aliases on every entity, ready for BigQuery, Pinecone, pandas, or Airtable
  • 🌐 Mandarin-Aware Error Translation — upstream Chinese error strings (问题不存在, 专栏不存在) translated to plain English in the dataset
  • Resilient Pagination — built-in retry on transient upstream errors, no manual cursor management

🎬 Quick Start

Pick one of four operations, drop in a keyword or ID, and run. One operation per run, one clean dataset out.

curl -X POST "https://api.apify.com/v2/acts/sian.agency~zhihu-scraper/runs?token=YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"operation":"search","keyword":"人工智能","maxPages":3}'

🚀 Getting Started (3 Simple Steps)

Step 1: Pick your operation

Choose one: search (keyword), answerList (question thread), articleDetail (single article), or columnArticleList (column posts).

Step 2: Provide the input

A keyword for search, or a Zhihu ID (questionId, articleId, columnId) for the targeted operations.

Step 3: Click Run

The actor handles pagination, retries, and ID precision automatically. Results land in the Apify dataset as flat JSON.

That's it! In under a minute, you'll have:

  • Clean, flat JSON rows with the right ID/URL schema per type
  • Full HTML content bodies (not snippets) for answers and articles
  • Author + badge metadata on every row for KOL workflows

📥 Input Configuration

FieldTypeRequiredDescription
operationenumYesOne of: search, answerList, articleDetail, columnArticleList
keywordstringIf searchSearch term (Chinese or English)
questionIdstringIf answerListZhihu question ID (numeric string)
articleIdstringIf articleDetailZhihu article ID (numeric string)
columnIdstringIf columnArticleListZhihu column slug (e.g. xuehy)
maxPagesnumberNoPagination cap (default 1; ignored for articleDetail)

Example — Keyword Search:

{
"operation": "search",
"keyword": "人工智能",
"maxPages": 5
}

Example — Question Answer Thread:

{
"operation": "answerList",
"questionId": "660962845",
"maxPages": 10
}

Example — Article Detail:

{
"operation": "articleDetail",
"articleId": "2032860336215307118"
}

Example — Column Article List:

{
"operation": "columnArticleList",
"columnId": "xuehy",
"maxPages": 5
}

📤 Output

Results are saved to the Apify dataset with 40+ fields including full HTML bodies, author profiles, and engagement metrics.

FieldTypeDescription
operationstringWhich operation produced the row
entityTypestringanswer / question / article / people / column-article
answerId / questionId / articleIdstringType-appropriate Zhihu entity ID (18–19 digits, preserved as string)
titlestringQuestion / article title
excerptstringShort summary text
contentstringFull HTML body for answers and articles
voteupCountnumberUpvote count
commentCountnumberComment count
authorIdstringAuthor's numeric ID
authorNamestringDisplay name
authorHeadlinestringOne-line bio
authorFollowerCountnumberAuthor follower count
authorVoteupCountnumberLifetime upvotes received by author
authorBadgesarrayVerified-expert badges (blue/gold)
authorIsOrgbooleanWhether the author is a verified organization
itemPageUrlstringCanonical Zhihu URL for the entity
createdTime / updatedTimenumberUnix timestamps
topicsarrayTopic tags (article ops)
columnobjectParent column reference (article ops)

Example row (search result, entityType: "answer"):

{
"operation": "search",
"entityType": "answer",
"answerId": "3654812345678901234",
"questionId": "660962845",
"title": "未来 10 年人工智能会让哪些行业彻底消失?",
"excerpt": "从我的实际经验来看,AI 替代的不是行业,而是行业里...",
"content": "<p>从我的实际经验来看...</p><img src=\"https://pic1.zhimg.com/...\">",
"voteupCount": 1842,
"commentCount": 327,
"authorId": "abc-123-def",
"authorName": "张三",
"authorHeadline": "AI Researcher | Tsinghua University",
"authorFollowerCount": 124300,
"authorVoteupCount": 982401,
"authorBadges": ["identity_blue"],
"itemPageUrl": "https://www.zhihu.com/question/660962845/answer/3654812345678901234"
}

💼 Use Cases & Examples

1. AI / LLM Training Corpus Building

Wei, ML Engineer at a Beijing AI lab pulls 100K+ Mandarin answer threads per month for SFT and RAG fine-tuning datasets.

Input: A list of question IDs covering broad topics (technology, finance, medicine, philosophy). Output: Full HTML answer bodies with author credentials and vote signals for quality filtering. Use: Bootstrap a domain-balanced Chinese-language instruction-tuning dataset. Filter by voteupCount > 100 and authorBadges to keep high-signal answers only.

2. China Market & Consumer Research

Lin, Insights Lead at a Shanghai research agency keyword-tracks branded questions weekly to surface unfiltered consumer sentiment.

Input: Brand or product keyword ("特斯拉", "iPhone 17", "小米汽车"). Output: Top-voted questions and answers mentioning the brand, with vote/comment counts. Use: Build a weekly brand-perception report grounded in real Chinese consumer language — not survey-mediated.

3. Competitive Intelligence & Brand Monitoring

Anya, PM at a B2B SaaS company monitors competitor mentions in Q&A threads to catch comparison content early.

Input: Competitor names + product category keywords. Output: Questions, answers, and articles mentioning competitors, sorted by recency and engagement. Use: Surface "X vs. Y" threads before they go viral; respond proactively where buyers are asking real questions.

4. B2B Influencer / KOL Outreach

Marcus, Marketing Lead at a B2B firm targeting China shortlists Zhihu KOLs for sponsored long-form content.

Input: Topic keyword ("AI 创业", "SaaS 出海"). Output: Top-voted answers with author follower counts, badge verification, and headline bios. Use: Dedupe authors across thousands of answers, sort by authorFollowerCount and badge level, hand off to outreach.

5. Trend & Topic Early-Signal Detection

Chen, Data Scientist at a hedge fund runs daily keyword searches to spot emerging questions before mainstream pickup.

Input: Industry watchlist (semiconductors, energy, biotech) refreshed daily. Output: New questions and rising answers, time-stamped with engagement velocity signals. Use: Feed into an alpha-generation pipeline that flags breakout topics for analyst review.

6. Academic & Sociolinguistic Research

Dr. Park, Stanford computational linguist builds Mandarin discourse corpora for academic NLP research.

Input: Topic clusters via keyword search and column article lists. Output: Full HTML article bodies and answer threads with author demographics where available. Use: Train discourse-level classifiers, study Chinese internet argumentation patterns, publish reproducible datasets.


🔗 Integration Examples

JavaScript/Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_TOKEN' });
const run = await client.actor('sian.agency/zhihu-scraper').call({
operation: 'search',
keyword: '人工智能',
maxPages: 5,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0]);

Python

from apify_client import ApifyClient
client = ApifyClient('YOUR_TOKEN')
run = client.actor('sian.agency/zhihu-scraper').call(
run_input={
'operation': 'answerList',
'questionId': '660962845',
'maxPages': 10,
}
)
for item in client.dataset(run['defaultDatasetId']).iterate_items():
print(item['authorName'], item['voteupCount'])

cURL

curl -X POST "https://api.apify.com/v2/acts/sian.agency~zhihu-scraper/runs?token=YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"operation":"articleDetail","articleId":"2032860336215307118"}'

Automation Workflows (N8N / Zapier / Make)

  1. Trigger: Daily schedule or webhook from your watchlist tool
  2. HTTP Request: Call the actor with a keyword or column ID
  3. Process: Filter rows by voteupCount / authorBadges / authorFollowerCount
  4. Action: Push to BigQuery, Pinecone, Airtable, or trigger a Slack alert

📊 Performance & Pricing

FREE Tier (Try It Now)

  • Full feature access on all four operations — same data quality as PAID
  • Generous evaluation allowance under the Apify FREE plan
  • No credit card required
  • Pay-per-result: only charged for successful rows
  • Volume discounts auto-applied at SILVER, GOLD, PLATINUM, DIAMOND tiers
  • No subscription, no minimums, no commitments

Live BRONZE per-result pricing:

EventPriceTriggered by
Actor Start$0.014Once per run
Search Result$0.004 (PRIMARY)Per row from keyword search
Question Answer$0.005Per answer in a question thread
Article Detail$0.040Per article (full HTML body)
Column Article$0.004Per article in a column listing

💰 Best price on the market for full-HTML Zhihu extraction — competitors charge 3–5× more for snippet-only output.

🔗 View current pricing


❓ Frequently Asked Questions

Q: How many results can I pull per run? A: There's no hard cap — set maxPages to whatever you need. The actor handles pagination and retries automatically.

Q: Do I need a Zhihu account or API key? A: No. Just an Apify account. We handle everything upstream.

Q: Does it support private answers or paid-content articles? A: No — only publicly accessible content. Paywalled "盐选" articles return excerpt-only content per Zhihu's public surface.

Q: What output formats are available? A: JSON, CSV, Excel, XML, JSONL — export directly from the Apify dataset UI or API.

Q: How accurate are the 18–19-digit IDs? A: IDs are preserved as strings end-to-end. JavaScript's default JSON.parse silently truncates integers above 2^53; we intercept the parse and keep full precision.

Q: Can I get full HTML article bodies, not just summaries? A: Yes — articleDetail and answerList return the full HTML content field with embedded images and formatting intact.

Q: Does the search return answers, questions, and articles together? A: Yes — one search call returns mixed types in a single dataset. Each row carries an entityType field so you can split downstream.

Q: Is this legal? A: Yes — only publicly available data. See the legal section below.


🐛 Troubleshooting

code:301 — FAILED, RETRY errors on a specific question ID

  • A small number of historical Zhihu IDs are permanently flagged by upstream anti-bot. Try a different question — most modern IDs work fine. The actor already retries with backoff before surfacing the error.

Empty results on a column ID

  • Double-check the column slug (the part after zhuanlan.zhihu.com/). Example: for https://zhuanlan.zhihu.com/xuehy, use columnId: "xuehy".

Search returns fewer results than expected

  • Increase maxPages. Zhihu paginates ~20 mixed results per page; deep pagination beyond 10 pages may return diminishing fresh content.

Article body looks truncated

  • "盐选" (paywalled) Zhihu Plus articles return only excerpts on the public surface. The actor surfaces what Zhihu exposes — there is no premium-content backdoor.

Author follower / voteup counts show 0

  • Some authors disable public stats. The fields are present but Zhihu returns 0 for these users.

⚠️ Trademark Disclaimer

This is an independent scraping tool. It is not affiliated with, endorsed by, or sponsored by Zhihu Inc. (知乎). The Zhihu® and 知乎® names appear under nominative fair use solely to describe the platform this tool reads from. All trademarks are the property of their respective owners.


Our actors are ethical and do not extract any private user data, such as email addresses, gender, or location. They only extract what the user has chosen to share publicly. We therefore believe that our actors, when used for ethical purposes by Apify users, are safe.

However, you should be aware that your results could contain personal data. Personal data is protected by the GDPR in the European Union and by other regulations around the world. You should not scrape personal data unless you have a legitimate reason to do so. If you're unsure whether your reason is legitimate, consult your lawyers.

You can also read Apify's blog post on the legality of web scraping.


⭐ Love this actor?

Leave a 5-star review — it helps us build more features for you and keeps the SIÁN portfolio growing.


🤝 Support

Telegram Support

Join our active support community


Built by SIÁN Agency | More Tools