Under maintenance

Pricing

from $2.00 / 1,000 search results

Try for free

Go to Apify Store

Zhihu Scraper — Q&A, Answers, Articles, Columns

Under maintenance

Try for free

Zhihu scraper — extract long-form Mandarin Q&A, expert answers, articles & column posts. Keyword search, question answer threads, article detail, column article list. China market research, LLM training data, competitive intel. Four operations, one clean dataset per run. No API key.

Pricing

from $2.00 / 1,000 search results

Rating

0.0

(0)

Developer

SIÁN OÜ

Actor stats

Bookmarked

106

Total users

Monthly active users

8.3 hours

Issues response

3 days ago

Last modified

Zhihu Scraper — Q&A, Answers, Articles & Columns 🚀

🎉 The richest Mandarin Q&A corpus on the web — full HTML answer bodies, expert credentials, vote signals

Built for AI/LLM training teams, China market researchers, and B2B KOL outreach

📋 Overview

Zhihu (知乎) is China's expert-driven Q&A platform — the closest thing to a Mandarin Stack Overflow + Quora + Medium rolled into one. This scraper pulls complete answer threads, full-HTML articles, keyword search across the platform, and column (zhuanlan) post lists — clean, structured, ready for analysis or model training.

Why AI teams, market researchers, and agencies choose us:

🧠 Best-in-class LLM training data — full HTML answer bodies (not snippets) with author credentials, vote/comment signals, and badge verification — gold-standard SFT/RAG corpus material for Chinese-language models
📚 Long-form depth, not shallow snippets — competitors return excerpts; we return the entire answer + article body including embedded images, headings, and inline references
🔀 Mixed-type search in one call — keyword searches return answers, questions, articles, AND people in a single dataset, each row dispatched to the correct ID schema (answerId, questionId, articleId, peopleId)
🎖️ KOL discovery built in — every row carries authorId, authorName, authorHeadline, authorFollowerCount, authorVoteupCount, authorBadges[], authorIsOrg — ready to dedupe and shortlist Zhihu blue/gold-badge experts
💰 Pay-per-result pricing — $0.004/search row, $0.040/article detail. Generous FREE tier. No subscription, no minimums, no surprise bills
✨ No account, no API key, no proxy setup — paste an ID or keyword, click run, get clean JSON

✨ Features

🔍 Keyword Search — search across all Zhihu content types in one call, ~20 mixed results per page
💬 Question Answer Threads — pull every answer to a Zhihu question with full HTML body, vote counts, and reply counts
📰 Article Detail Extraction — full article HTML body, author profile, topic tags, and parent column reference in a single row
📚 Column (Zhuanlan) Article Lists — paginate the complete catalog of any Zhihu column, ~10 articles per page
🏷️ Author + Badge Data on Every Row — Zhihu blue/gold badges, follower counts, vote tallies, headline bios baked in
🆔 18–19-digit ID Precision — IDs preserved as strings (no JavaScript bigint silent truncation)
🖼️ Image URL Normalization — all Zhihu CDN URLs upgraded to HTTPS automatically
📊 Clean Structured JSON — flat camelCase aliases on every entity, ready for BigQuery, Pinecone, pandas, or Airtable
🌐 Mandarin-Aware Error Translation — upstream Chinese error strings (问题不存在, 专栏不存在) translated to plain English in the dataset
⚡ Resilient Pagination — built-in retry on transient upstream errors, no manual cursor management

🎬 Quick Start

Pick one of four operations, drop in a keyword or ID, and run. One operation per run, one clean dataset out.

curl -X POST "https://api.apify.com/v2/acts/sian.agency~zhihu-scraper/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"operation":"search","keyword":"人工智能","maxPages":3}'

🚀 Getting Started (3 Simple Steps)

Step 1: Pick your operation

Choose one: search (keyword), answerList (question thread), articleDetail (single article), or columnArticleList (column posts).

Step 2: Provide the input

A keyword for search, or a Zhihu ID (questionId, articleId, columnId) for the targeted operations.

Step 3: Click Run

The actor handles pagination, retries, and ID precision automatically. Results land in the Apify dataset as flat JSON.

That's it! In under a minute, you'll have:

Clean, flat JSON rows with the right ID/URL schema per type
Full HTML content bodies (not snippets) for answers and articles
Author + badge metadata on every row for KOL workflows

📥 Input Configuration

Field	Type	Required	Description
`operation`	enum	Yes	One of: `search`, `answerList`, `articleDetail`, `columnArticleList`
`keyword`	string	If `search`	Search term (Chinese or English)
`questionId`	string	If `answerList`	Zhihu question ID (numeric string)
`articleId`	string	If `articleDetail`	Zhihu article ID (numeric string)
`columnId`	string	If `columnArticleList`	Zhihu column slug (e.g. `xuehy`)
`maxPages`	number	No	Pagination cap (default 1; ignored for `articleDetail`)

Example — Keyword Search:

{
  "operation": "search",
  "keyword": "人工智能",
  "maxPages": 5
}

Example — Question Answer Thread:

{
  "operation": "answerList",
  "questionId": "660962845",
  "maxPages": 10
}

Example — Article Detail:

{
  "operation": "articleDetail",
  "articleId": "2032860336215307118"
}

Example — Column Article List:

{
  "operation": "columnArticleList",
  "columnId": "xuehy",
  "maxPages": 5
}

📤 Output

Results are saved to the Apify dataset with 40+ fields including full HTML bodies, author profiles, and engagement metrics.

Field	Type	Description
`operation`	string	Which operation produced the row
`entityType`	string	`answer` / `question` / `article` / `people` / `column-article`
`answerId` / `questionId` / `articleId`	string	Type-appropriate Zhihu entity ID (18–19 digits, preserved as string)
`title`	string	Question / article title
`excerpt`	string	Short summary text
`content`	string	Full HTML body for answers and articles
`voteupCount`	number	Upvote count
`commentCount`	number	Comment count
`authorId`	string	Author's numeric ID
`authorName`	string	Display name
`authorHeadline`	string	One-line bio
`authorFollowerCount`	number	Author follower count
`authorVoteupCount`	number	Lifetime upvotes received by author
`authorBadges`	array	Verified-expert badges (blue/gold)
`authorIsOrg`	boolean	Whether the author is a verified organization
`itemPageUrl`	string	Canonical Zhihu URL for the entity
`createdTime` / `updatedTime`	number	Unix timestamps
`topics`	array	Topic tags (article ops)
`column`	object	Parent column reference (article ops)

Example row (search result, entityType: "answer"):

{
  "operation": "search",
  "entityType": "answer",
  "answerId": "3654812345678901234",
  "questionId": "660962845",
  "title": "未来 10 年人工智能会让哪些行业彻底消失？",
  "excerpt": "从我的实际经验来看，AI 替代的不是行业，而是行业里...",
  "content": "<p>从我的实际经验来看...</p><img src=\"https://pic1.zhimg.com/...\">",
  "voteupCount": 1842,
  "commentCount": 327,
  "authorId": "abc-123-def",
  "authorName": "张三",
  "authorHeadline": "AI Researcher | Tsinghua University",
  "authorFollowerCount": 124300,
  "authorVoteupCount": 982401,
  "authorBadges": ["identity_blue"],
  "itemPageUrl": "https://www.zhihu.com/question/660962845/answer/3654812345678901234"
}

💼 Use Cases & Examples

1. AI / LLM Training Corpus Building

Wei, ML Engineer at a Beijing AI lab pulls 100K+ Mandarin answer threads per month for SFT and RAG fine-tuning datasets.

Input: A list of question IDs covering broad topics (technology, finance, medicine, philosophy). Output: Full HTML answer bodies with author credentials and vote signals for quality filtering. Use: Bootstrap a domain-balanced Chinese-language instruction-tuning dataset. Filter by voteupCount > 100 and authorBadges to keep high-signal answers only.

2. China Market & Consumer Research

Lin, Insights Lead at a Shanghai research agency keyword-tracks branded questions weekly to surface unfiltered consumer sentiment.

Input: Brand or product keyword ("特斯拉", "iPhone 17", "小米汽车"). Output: Top-voted questions and answers mentioning the brand, with vote/comment counts. Use: Build a weekly brand-perception report grounded in real Chinese consumer language — not survey-mediated.

3. Competitive Intelligence & Brand Monitoring

Anya, PM at a B2B SaaS company monitors competitor mentions in Q&A threads to catch comparison content early.

Input: Competitor names + product category keywords. Output: Questions, answers, and articles mentioning competitors, sorted by recency and engagement. Use: Surface "X vs. Y" threads before they go viral; respond proactively where buyers are asking real questions.

4. B2B Influencer / KOL Outreach

Marcus, Marketing Lead at a B2B firm targeting China shortlists Zhihu KOLs for sponsored long-form content.

Input: Topic keyword ("AI 创业", "SaaS 出海"). Output: Top-voted answers with author follower counts, badge verification, and headline bios. Use: Dedupe authors across thousands of answers, sort by authorFollowerCount and badge level, hand off to outreach.

5. Trend & Topic Early-Signal Detection

Chen, Data Scientist at a hedge fund runs daily keyword searches to spot emerging questions before mainstream pickup.

Input: Industry watchlist (semiconductors, energy, biotech) refreshed daily. Output: New questions and rising answers, time-stamped with engagement velocity signals. Use: Feed into an alpha-generation pipeline that flags breakout topics for analyst review.

6. Academic & Sociolinguistic Research

Dr. Park, Stanford computational linguist builds Mandarin discourse corpora for academic NLP research.

Input: Topic clusters via keyword search and column article lists. Output: Full HTML article bodies and answer threads with author demographics where available. Use: Train discourse-level classifiers, study Chinese internet argumentation patterns, publish reproducible datasets.

🔗 Integration Examples

JavaScript/Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_TOKEN' });

const run = await client.actor('sian.agency/zhihu-scraper').call({
  operation: 'search',
  keyword: '人工智能',
  maxPages: 5,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0]);

Python

from apify_client import ApifyClient
client = ApifyClient('YOUR_TOKEN')

run = client.actor('sian.agency/zhihu-scraper').call(
    run_input={
        'operation': 'answerList',
        'questionId': '660962845',
        'maxPages': 10,
    }
)

for item in client.dataset(run['defaultDatasetId']).iterate_items():
    print(item['authorName'], item['voteupCount'])

cURL

curl -X POST "https://api.apify.com/v2/acts/sian.agency~zhihu-scraper/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"operation":"articleDetail","articleId":"2032860336215307118"}'

Automation Workflows (N8N / Zapier / Make)

Trigger: Daily schedule or webhook from your watchlist tool
HTTP Request: Call the actor with a keyword or column ID
Process: Filter rows by voteupCount / authorBadges / authorFollowerCount
Action: Push to BigQuery, Pinecone, Airtable, or trigger a Slack alert

📊 Performance & Pricing

FREE Tier (Try It Now)

Full feature access on all four operations — same data quality as PAID
Generous evaluation allowance under the Apify FREE plan
No credit card required

PAID Tier (Production Ready)

Pay-per-result: only charged for successful rows
Volume discounts auto-applied at SILVER, GOLD, PLATINUM, DIAMOND tiers
No subscription, no minimums, no commitments

Live BRONZE per-result pricing:

Event	Price	Triggered by
Actor Start	$0.014	Once per run
Search Result	$0.004 (PRIMARY)	Per row from keyword search
Question Answer	$0.005	Per answer in a question thread
Article Detail	$0.040	Per article (full HTML body)
Column Article	$0.004	Per article in a column listing

💰 Best price on the market for full-HTML Zhihu extraction — competitors charge 3–5× more for snippet-only output.

🔗 View current pricing

❓ Frequently Asked Questions

Q: How many results can I pull per run? A: There's no hard cap — set maxPages to whatever you need. The actor handles pagination and retries automatically.

Q: Do I need a Zhihu account or API key? A: No. Just an Apify account. We handle everything upstream.

Q: Does it support private answers or paid-content articles? A: No — only publicly accessible content. Paywalled "盐选" articles return excerpt-only content per Zhihu's public surface.

Q: What output formats are available? A: JSON, CSV, Excel, XML, JSONL — export directly from the Apify dataset UI or API.

Q: How accurate are the 18–19-digit IDs? A: IDs are preserved as strings end-to-end. JavaScript's default JSON.parse silently truncates integers above 2^53; we intercept the parse and keep full precision.

Q: Can I get full HTML article bodies, not just summaries? A: Yes — articleDetail and answerList return the full HTML content field with embedded images and formatting intact.

Q: Does the search return answers, questions, and articles together? A: Yes — one search call returns mixed types in a single dataset. Each row carries an entityType field so you can split downstream.

Q: Is this legal? A: Yes — only publicly available data. See the legal section below.

🐛 Troubleshooting

code:301 — FAILED, RETRY errors on a specific question ID

A small number of historical Zhihu IDs are permanently flagged by upstream anti-bot. Try a different question — most modern IDs work fine. The actor already retries with backoff before surfacing the error.

Empty results on a column ID

Double-check the column slug (the part after zhuanlan.zhihu.com/). Example: for https://zhuanlan.zhihu.com/xuehy, use columnId: "xuehy".

Search returns fewer results than expected

Increase maxPages. Zhihu paginates ~20 mixed results per page; deep pagination beyond 10 pages may return diminishing fresh content.

Article body looks truncated

"盐选" (paywalled) Zhihu Plus articles return only excerpts on the public surface. The actor surfaces what Zhihu exposes — there is no premium-content backdoor.

Author follower / voteup counts show 0

Some authors disable public stats. The fields are present but Zhihu returns 0 for these users.

⚠️ Trademark Disclaimer

This is an independent scraping tool. It is not affiliated with, endorsed by, or sponsored by Zhihu Inc. (知乎). The Zhihu® and 知乎® names appear under nominative fair use solely to describe the platform this tool reads from. All trademarks are the property of their respective owners.

⚖️ Is it legal to scrape data?

Our actors are ethical and do not extract any private user data, such as email addresses, gender, or location. They only extract what the user has chosen to share publicly. We therefore believe that our actors, when used for ethical purposes by Apify users, are safe.

However, you should be aware that your results could contain personal data. Personal data is protected by the GDPR in the European Union and by other regulations around the world. You should not scrape personal data unless you have a legitimate reason to do so. If you're unsure whether your reason is legitimate, consult your lawyers.

You can also read Apify's blog post on the legality of web scraping.

⭐ Love this actor?

Leave a 5-star review — it helps us build more features for you and keeps the SIÁN portfolio growing.

🤝 Support

Join our active support community

For issues or questions, open an issue in the actor's repository
Check the SIÁN Agency Store for more China-market automation tools
📧 apify@sian-agency.online

Built by SIÁN Agency | More Tools

❓ Zhihu Question Answers Scraper

ethereal_wool/zhihu-question-answers-scraper

Extract Zhihu question answers data — title, author, engagement, and more. Scrape by keyword, URL or ID. Export to JSON, CSV & Excel, use the API, schedule runs and integrate. No code required.

Jackie Chen

Zhihu Scraper — Hot List, Q&A & Profiles

blackfalcondata/zhihu-scraper

Scrape zhihu.com — trending hot-list questions (热榜), full Q&A answers with text and engagement counts, and author profiles as structured data. No login or API key required. Incremental mode flags new and changed records for monitoring and AI pipelines.

Black Falcon Data

Zhihu Q&A Tracker - China Hot List & Knowledge Mining

nexgendata/zhihu-qa-tracker

Scrape Zhihu (知乎), China's Quora: the daily hot list plus keyword Q&A search. Each record has the question, top-answer excerpt, voteup count, view count and category. For China social listening, consumer research and brand monitoring. No CN account needed.

NexGenData

❓ Zhihu Search Scraper

ethereal_wool/zhihu-search-scraper

Extract Zhihu search data — title, author, engagement, and more. Scrape by keyword, URL or ID. Export to JSON, CSV & Excel, use the API, schedule runs and integrate. No code required.

Jackie Chen

❓ Zhihu User Content Scraper

ethereal_wool/zhihu-user-content-scraper

Extract Zhihu user content data — title, and more. Scrape by keyword, URL or ID. Export to JSON, CSV & Excel, use the API, schedule runs and integrate. No code required.

Jackie Chen

Quora Scraper

sian.agency/quora-scraper

Scrape Quora questions and answers into clean datasets — question text, answer & follower counts, topics, and top answers (author, credential, upvotes, views, text). Fast overview or full per-question detail. Engagement scoring built in. No account or API key needed.

SIÁN OÜ

CSV Combiner | 💾 Merge CSV Files with Custom Column Order

amr-mando/csv-combiner

Combine up to three CSV files into one. Columns are matched by header name, so data stays under the right column even when the files order their columns differently. You choose the output column order.

Mando

Grounded Q&A: Structured Answers with Citations

aitoolbreakdown/atb-grounded-qa

Answers a natural-language question using ONLY the URLs you provide. Returns structured JSON with per-claim citations and confidence. No hallucinated sources.

AI Tool Breakdown

Reddit Answers Scraper

lexis-solutions/reddit-answers-scraper

Unlock structured AI-powered Q&A from Reddit Answers—extract organized answers, source subreddits, related posts, and suggested topics. Perfect for market research, content creation, SEO strategy, and knowledge base building. Fast, reliable, and fully customizable.

Lexis Solutions

5.0

CSV Data Profiler — Column Types, Stats and Quality Report

eliai/csv-profiler

Profile any CSV via API. Input: a CSV URL or pasted text. Output: JSON per column with detected data type, null and unique counts, min/max/mean for numeric columns, top values for categorical columns, plus data-quality warnings. Cheap flat pay-per-file pricing.

Anthony Snider

Zhihu Scraper — Q&A, Answers, Articles, Columns

Zhihu Scraper — Q&A, Answers, Articles & Columns 🚀

🎉 The richest Mandarin Q&A corpus on the web — full HTML answer bodies, expert credentials, vote signals

Built for AI/LLM training teams, China market researchers, and B2B KOL outreach

📋 Overview

✨ Features

🎬 Quick Start

🚀 Getting Started (3 Simple Steps)

Step 1: Pick your operation

Step 2: Provide the input

Step 3: Click Run

📥 Input Configuration

📤 Output

💼 Use Cases & Examples

1. AI / LLM Training Corpus Building

2. China Market & Consumer Research

3. Competitive Intelligence & Brand Monitoring

4. B2B Influencer / KOL Outreach

5. Trend & Topic Early-Signal Detection

6. Academic & Sociolinguistic Research

🔗 Integration Examples

JavaScript/Node.js

Python

cURL

Automation Workflows (N8N / Zapier / Make)

📊 Performance & Pricing

FREE Tier (Try It Now)

PAID Tier (Production Ready)

❓ Frequently Asked Questions

🐛 Troubleshooting

⚠️ Trademark Disclaimer

⚖️ Is it legal to scrape data?

⭐ Love this actor?

🤝 Support

You might also like

❓ Zhihu Question Answers Scraper

Zhihu Scraper — Hot List, Q&A & Profiles

Zhihu Q&A Tracker - China Hot List & Knowledge Mining

❓ Zhihu Search Scraper

❓ Zhihu User Content Scraper

Quora Scraper

CSV Combiner | 💾 Merge CSV Files with Custom Column Order

Grounded Q&A: Structured Answers with Citations

Reddit Answers Scraper

CSV Data Profiler — Column Types, Stats and Quality Report