Pricing

from $25.00 / 1,000 ai-ready document (http source)s

Chinese AI Training Corpus Engine

Turn China's public web into AI-training-ready text. Pulls Weibo, Bilibili, Xueqiu, Douban & RedNote, then deduplicates, quality-scores, PII-scrubs and provenance-stamps every document. From $0.025/doc, pay-as-you-go. For LLM training-data teams, data vendors & academic NLP researchers.

Pricing

from $25.00 / 1,000 ai-ready document (http source)s

Rating

0.0

(0)

Developer

Sami

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Chinese AI Training Corpus Engine — Weibo + Bilibili + Xueqiu + Douban + RedNote

Turn China's public web into AI-training-ready text — deduplicated, quality-scored, PII-scrubbed, provenance-stamped. One run pulls topical documents from Weibo, Bilibili, Xueqiu, Douban, and (optionally) RedNote, runs every document through a cleaning + dedup + quality + provenance pipeline, and bills only the documents that survive the gates. Built for AI/LLM training-data teams, data vendors, and academic NLP researchers. No login, no API key, no VPN.

🏢 Sourcing a Chinese-language LLM training corpus at production scale?

This Actor assembles Chinese-language corpora at corpus scale: tens of thousands to hundreds of thousands of clean, deduplicated, provenance-stamped documents per run — on a schedule that grows one corpus without ever paying twice for the same document. Drop-in for SFT/RLHF dataset builds, foundation-corpus slices, and data-vendor catalogs. Pay-per-document, no contract.

For high-volume / enterprise I offer bulk & volume pricing, custom output schemas matched to your training pipeline, dedicated proxy throughput for sustained bulk pulls, scheduled managed corpus feeds, and a schema-stability SLA (no breaking changes without 30-day notice).

→ DM me on Apify, open an Issue titled "Enterprise inquiry", or email samimassis2002@gmail.com (subject "Corpus Engine enterprise").

💡 Brand/marketing intelligence rather than training data? See Chinese Brand Monitor (cross-platform mention tracking) and AI Brand Visibility Monitor (how the AI engines describe your brand).

Part of the Chinese Digital Intelligence Suite — where this Actor fits among the suite's three lanes
Who buys this Actor — buyer profiles and typical spend
What you get per document — the full annotated output record
EU AI Act & provenance — per-document documentation fields
Legal positioning & FAQ — what this tool is (and is not)
Modes — corpus_pull, dedup_merge, provenance_audit with copy-paste inputs
Pricing — $0.025/doc, billed only on documents that pass the gates
Scheduled corpus refresh — grow one corpus on a cron, never re-billed
Integrations — Sheets, Zapier, Make, n8n, REST API
What this Actor is NOT

Part of the Chinese Digital Intelligence Suite by Zhorex

The only Apify developer specializing in Chinese-platform intelligence — built specifically for AI training data buyers, equity research analysts covering Chinese consumer brands, and brand monitoring teams:

Platform	Users	Use Case	Link
🆕 Chinese AI Training Corpus Engine	All 5	Bulk AI-corpus assembly — dedup + quality scoring + PII scrub + per-document provenance ($0.025/doc)	You are here
Chinese Brand Monitor	All 5	Cross-platform brand aggregator — sentiment + dedup + reach-weighted brand-health rollup ($0.045/mention)	Chinese Brand Monitor
Weibo	580M+	Public opinion, hot search, trending topics	Weibo Scraper
RedNote (Xiaohongshu)	300M+	Consumer reviews, lifestyle signal, brand sentiment	RedNote Scraper
Bilibili	300M+	Video content, danmaku, Gen-Z creator sentiment	Bilibili Scraper
Douban	200M+	Long-form reviews (movies/books/music), group discussions	Douban Scraper
Xueqiu	20M+	Stock-discussion sentiment, cashtag indexing	Xueqiu Scraper
RedNote Shop	300M+	RedShop e-commerce: products, vendors, prices	RedNote Shop Scraper

Why use the suite — and which lane is yours? The suite has three lanes. The Chinese Brand Monitor is for recurring cross-platform brand monitoring — sentiment-tagged mentions on a schedule (it saves 4-6 hours vs. orchestrating individual scrapers). The single-platform scrapers are for deep extraction inside one platform — full comment trees, profiles, danmaku, long-form reviews, at the cheapest raw per-record price. The Corpus Engine is for bulk AI-corpus assembly: it pulls topical text across platforms in one run and ships every document already cleaned, deduplicated, quality-scored, PII-scrubbed, and provenance-stamped — the document your training pipeline ingests directly, not a raw record you still have to process.

Who buys this Actor

Buyer profile	Use case	Typical spend
AI / LLM training data teams	Chinese-language SFT/RLHF datasets and pretraining-corpus slices, deduplicated and quality-gated before they touch the pipeline	$200–$3,000 per corpus build
Data vendors / brokers	Resellable Chinese-text corpus slices with per-document provenance and content hashes for catalog documentation	$1,000–$10,000/mo
Academic NLP researchers	Reproducible Chinese social-text corpora (stable doc IDs, content hashes, pipeline version) for papers and classifier training	$30–$200/mo
AI compliance / governance teams	Per-document provenance audits — robots state, opt-out signals, PII counts — over corpora they already hold	$100–$500/mo

What you get per document

Every billed document is a single self-contained JSON record — text plus everything your data pipeline, your dedup ledger, and your compliance documentation need:

{
  "doc_id": "weibo_post_5123456789012345",      // stable: {platform}_{record_type}_{native_id}
  "record_type": "post",                         // post | comment | group_topic | note
  "topic": "新能源汽车",                          // which of your input topics matched this doc
  "text_clean": "……",                            // boilerplate-stripped, PII-scrubbed — feed this to training
  "text_raw": "……",                              // original text (also PII-scrubbed when piiScrub is on)
  "char_count": 412,                             // length of text_clean
  "language": "zh-CN",                           // zh-CN | en | mixed | und
  "quality": {                                   // 0-1 heuristic: length, charset, diversity,
    "score": 0.71,                               //   punctuation sanity, sentence structure
    "flags": []                                  // e.g. too_short, low_diversity, punct_spam
  },
  "pii": {                                       // what was found (and scrubbed when piiScrub: true)
    "emails": 0, "phones": 1, "national_ids": 0, "passports": 0,
    "wechat_handles": 0, "qq_numbers": 0, "total": 1
  },
  "dedup": {
    "cluster_id": "dup_3fa8c1d290bb",            // near-duplicate cluster this doc canonicalizes
    "is_canonical": true,                        // only canonical docs are returned and billed
    "near_dup_count": 2,                         // how many near-dups were collapsed (free)
    "duplicate_doc_ids": ["weibo_post_…"]        // up to 20 collapsed doc IDs
  },
  "engagement": { "likes": 230, "comments": 18, "shares": 4, "views": null },
  "provenance": {                                // the documentation layer — see EU AI Act section
    "platform": "weibo",
    "source_url": "https://weibo.com/…",
    "author_handle": "…",
    "published_at": "2026-06-08T11:32:00+08:00",
    "retrieved_at": "2026-06-10T09:14:55Z",
    "collection_method": "http_api",             // http_api | browser_render
    "robots_state": "allowed",                   // allowed | disallowed | unavailable
    "opt_out_signals": [],                       // e.g. ["robots_disallow"] — filter on this
    "license_hint": "User-generated content; rights remain with original authors; platform ToS restricts redistribution. This record conveys no copyright license or AI-training rights — obtain rights independently.",
    "pipeline_version": "1.0.0",
    "content_sha256": "…"                        // hash of final text_clean — your dedup ledger key
  },
  "billing_event": "corpus-doc",                 // corpus-doc | corpus-doc-browser
  "scrapedAt": "2026-06-10T09:14:55Z"
}

The pipeline behind it (every document, fixed order): normalize → language ID → boilerplate strip (URLs, repost chains, platform emoji codes, UI residue, zero-width chars) → PII scrub → quality score → near-duplicate detection with cluster IDs → billable gate → provenance stamp. Documents that fail the language filter, character floor, quality floor, or arrive as duplicates are dropped — not returned, not billed.

Every run also writes a free SUMMARY record to the run's key-value store: per-source document counts, drop reasons, dedup ratio, and a quality-score histogram — so you can judge a pull's yield at a glance before scaling it up.

Built for the EU AI Act provenance era

Under the EU AI Act, providers of general-purpose AI models must publish a sufficiently detailed summary of the content used for training, and EU text-and-data-mining rules require respecting machine-readable opt-out reservations. Most scraped datasets make that documentation work painful after the fact. This Actor does it per document, at collection time:

source_url + retrieved_at — where each document came from and exactly when it was collected
robots_state — whether the document's public URL path was allowed, disallowed, or unevaluable under the source domain's robots.txt at retrieval time
opt_out_signals — machine-readable opt-out indications detected for the document (e.g. robots_disallow); always present, even when empty
license_hint — a per-platform plain-language rights note attached to every record
content_sha256 + pipeline_version — reproducible content identity and processing lineage for your dataset documentation

Because these are first-class fields, downstream filtering is one line — e.g. drop everything with robots_state: "disallowed" or a non-empty opt_out_signals before a training run, and keep the audit trail showing you did.

Framing matters: this is documentation tooling, not legal clearance. These fields make a corpus documentable and filterable. They do not determine whether any given use of any given document is lawful in your jurisdiction — that judgment, and the rights to make it on, remain yours.

Legal positioning

⚖️ Read this before buying

This Actor is a collection, structuring, and provenance-documentation TOOL. It accesses publicly available content only — the same content any anonymous browser visitor can see. It does not grant, transfer, or imply any copyright license or AI-training rights to the collected content. Rights remain with the original authors and platforms; the license_hint field on every record says exactly that. Obtain any rights you need independently, and consult legal counsel for your specific use case and jurisdiction. The provenance and opt-out fields help you document and filter a corpus — they are not, and cannot be, legal clearance.

FAQ

Q: Is scraping Weibo / Bilibili / Xueqiu / Douban / RedNote legal? A: This Actor accesses only publicly visible content on each platform — no login bypass, no private accounts, no DMs, no follower lists. Optional cookieStrings are user-supplied and used only to improve recall and rate limits, never to bypass authentication. Laws on collecting and using public web data vary by jurisdiction and purpose. Always consult legal counsel for your specific use case and jurisdiction.

Q: Does buying documents from this Actor give me the right to train AI models on them? A: No. No scraper can grant that — and this one explicitly does not claim to. The Actor collects, structures, and documents public content; copyright and related rights stay with the original authors and platforms. The per-document license_hint, robots_state, and opt_out_signals fields exist precisely so you (and your counsel) can make and document that determination yourselves.

Q: How does the PII scrub work, and what about PIPL / GDPR? A: With piiScrub: true (the default), emails, phone numbers, checksum-validated Chinese resident IDs, and passport numbers found in document text are replaced with [EMAIL] / [PHONE] / [CN_ID] / [PASSPORT] tokens, and per-document counts are reported in the pii field (counts are reported even if you turn scrubbing off). This materially reduces incidental personal data in the corpus, but user-generated text can carry personal information in forms no scrubber catches — under PIPL, GDPR, and similar regimes you remain responsible for how you process and retain the data. The scrub is a hygiene layer, not a compliance guarantee.

Q: What does robots_state: "disallowed" mean on a record I received? A: It means that, at retrieval time, the source domain's robots.txt disallowed the document's public URL path for generic crawlers. The record is delivered anyway with that flag so you can apply your own policy — most training-data teams filter these out, and the field gives you a documented basis for doing so.

Q: Why is sentiment tagging not included? A: Deliberately out of scope for v1 — corpus buyers run their own labeling stacks, and bundling a sentiment model would add cost to every document whether you want it or not. If you need sentiment-tagged Chinese mentions, that's the Chinese Brand Monitor's lane ($0.045/mention, sentiment included).

Q: Can I run this from Python? A: Yes — see the Python example below. pip install apify-client and call zhorex/chinese-corpus-engine like any other Actor.

Modes

Mode	What it does	Billed as
`corpus_pull`	Scrape fresh documents for your topics from the selected sources and run the full pipeline	`corpus-doc` $0.025 / `corpus-doc-browser` $0.055
`dedup_merge`	No scraping — re-process datasets you already own (from this Actor or any suite Actor) into one canonical, deduplicated corpus	`audit-record` $0.003 per record processed
`provenance_audit`	No scraping — emit a compact audit record (robots state, opt-out signals, license hint, PII counts, content hash) for every document in datasets you already own	`audit-record` $0.003 per audit record

How to build a Chinese AI training corpus in 3 easy steps

Go to the Chinese AI Training Corpus Engine on Apify Store and click "Try for free"
Enter your topics (mix brands, categories, and themes for corpus diversity — Chinese keywords return far more native content than Latin spellings) and pick your sources
Click Run and download AI-ready documents in JSON, CSV, or Excel

No coding required. No login needed. Works with Apify's free plan.

Example: evaluate corpus quality (start here)

A small pull to judge yield and quality before committing to volume:

{
  "mode": "corpus_pull",
  "topics": ["新能源汽车", "智能驾驶"],
  "sources": ["weibo", "bilibili", "xueqiu", "douban"],
  "maxDocs": 1000,
  "minQuality": "0.35",
  "minCharCount": 40,
  "languages": ["zh-CN", "mixed"]
}

Check the SUMMARY record (drop reasons, dedup ratio, quality histogram), then raise maxDocs to 10,000–50,000 for the real build.

Example: SFT dataset build with strict quality

{
  "mode": "corpus_pull",
  "topics": ["护肤", "化妆品成分", "敏感肌"],
  "sources": ["weibo", "bilibili", "douban"],
  "maxDocs": 25000,
  "minQuality": "0.5",
  "minCharCount": 80,
  "languages": ["zh-CN"]
}

Example: add RedNote first-person reviews (browser source)

RedNote's full post bodies require JS rendering, so RedNote only runs when includeBrowserSources is enabled and bills at the higher corpus-doc-browser rate:

{
  "mode": "corpus_pull",
  "topics": ["国货美妆"],
  "sources": ["weibo", "rednote"],
  "includeBrowserSources": true,
  "maxDocs": 5000,
  "proxyConfiguration": { "useApifyProxy": true, "apifyProxyGroups": ["RESIDENTIAL"] }
}

Recommended run memory with browser sources on: 4096 MB.

Example: merge & dedup datasets you already own

Point it at datasets from previous runs — or from any Chinese Digital Intelligence Suite Actor — and get back one canonical corpus with cluster IDs, at a tenth of the scrape price:

{
  "mode": "dedup_merge",
  "inputDatasetIds": ["DATASET_ID_1", "DATASET_ID_2", "DATASET_ID_3"],
  "minQuality": "0.35"
}

Example: provenance audit of an existing corpus

{
  "mode": "provenance_audit",
  "inputDatasetIds": ["DATASET_ID_1"]
}

Each audit record carries doc_id, source_url, robots_state, opt_out_signals, license_hint, PII counts, content_sha256, and the dedup cluster ID — no text fields, just the documentation layer.

💾 Memory guidance for bulk runs. The default 2048 MB comfortably handles runs up to a practical ceiling of ~250,000 documents; for runs approaching 1M documents, set run memory to 8192 MB (dedup state is held in memory). With includeBrowserSources: true, use 4096 MB or more.

🍪 Optional cookies. The cookieStrings input (a secret field, encrypted at rest) accepts per-platform logged-in cookies, e.g. {"xueqiu": "xq_a_token=..."} — they improve recall and rate limits on the gated platforms. The Actor degrades gracefully without them. Use throwaway accounts and refresh roughly every 10 days for scheduled runs.

Pricing

This Actor uses Pay-Per-Event pricing — and the gates work in your favor: only documents that pass the quality floor, character floor, and deduplication are billed. Rejects, duplicates, and already-collected documents are free — not returned, not billed.

Event	Price	Charged when
`corpus-doc`	$0.025	An AI-ready document from an HTTP source (Weibo / Bilibili / Xueqiu / Douban) passes all gates
`corpus-doc-browser`	$0.055	An AI-ready document from a browser source (RedNote) passes all gates — only with `includeBrowserSources: true`
`audit-record`	$0.003	Per record processed in `dedup_merge`; per audit record emitted in `provenance_audit`

Pick the right lane (and the right price)

This engine deliberately does not replace the rest of the suite — each lane is priced for its job:

Your job	Right tool	Price lane
Raw records from one platform, cheapest per record — comments, danmaku, profiles, full review bodies	The single-platform scrapers (Weibo, Bilibili, Douban, Xueqiu, RedNote)	$0.005–$0.020 per raw record
Recurring brand monitoring — sentiment-tagged, reach-weighted mention feed	Chinese Brand Monitor	$0.045 per mention
AI-ready corpus — cleaned, deduplicated, quality-gated, PII-scrubbed, provenance-stamped documents	This Actor	$0.025 per document ($0.055 browser-sourced)

At $0.025, a corpus document costs barely more than a single raw Weibo record — meaning the boilerplate stripping, near-duplicate collapse, quality scoring, PII scrub, and provenance stamping are effectively included free. If you just need raw records, the single-platform scrapers stay cheaper — use them. If you need documents your training pipeline can ingest as-is, this is the lane.

Realistic costs

Workflow	Volume	Cost
Corpus quality evaluation pull	1,000 docs	~$25
SFT dataset build (HTTP sources)	10,000 docs	~$250
Foundation-corpus slice (HTTP sources)	50,000 docs	~$1,250
RedNote first-person review corpus (browser source)	5,000 docs	~$275
Dedup & merge of datasets you already own	100,000 records	~$300
Provenance audit of an existing corpus	50,000 records	~$150

Compare: licensing a comparable academic Chinese-text corpus runs $15K–$50K with a single-use license and months of delivery time — and arrives without per-document provenance.

Volume pricing available above 50K documents/month (see the Enterprise section at the top). Apify platform compute costs (RAM-seconds) are charged separately; browser-source runs also consume residential proxy bandwidth billed by GB on your own account.

⏰ Set up a scheduled corpus refresh in 2 minutes

A corpus is not a one-off pull — it's an asset that compounds. A weekly or daily schedule turns a single topical pull into a continuously growing corpus: each run adds only the documents that didn't exist last time, and with deltaStateKey set, already-collected documents are skipped — not returned, not billed. Over weeks you own a deduplicated, provenance-stamped Chinese-text corpus nobody else has, built at marginal cost.

Run it once with your input. Use corpus_pull with your topics (see the examples above), click Run, and check the SUMMARY record to confirm yield and quality look right.
Apify Console → Schedules → Create. Pick this Actor and your saved input. (Shortcut: open any finished run and click Schedule to pre-fill the input for you.)
Set a cron expression and save. For example 0 8 * * * = daily at 8am, or 0 * * * * = hourly for fast-moving topics. While you're there, enable the email notification on failed runs option so you know if a run ever needs attention.

Each scheduled run appends fresh documents to the same dataset, so your corpus grows continuously with zero manual work — no babysitting, no re-running by hand, no duplicate billing.

💸 Grow one corpus, pay once per document — `deltaStateKey`

Set a stable deltaStateKey and the Actor keeps a cross-run ledger of every document it has already delivered under that key (exact hashes and near-duplicate signatures). On every subsequent run, already-collected documents are skipped: not returned, not billed. Use a distinct key per corpus so independent projects don't collide — e.g. "ev-corpus-weekly" vs "beauty-corpus-weekly".

Example — a weekly EV-sector corpus that only ever bills for new documents:

{
  "mode": "corpus_pull",
  "topics": ["新能源汽车", "电动车", "充电桩"],
  "sources": ["weibo", "bilibili", "xueqiu", "douban"],
  "maxDocs": 10000,
  "deltaStateKey": "ev-corpus-weekly"
}

Pair this with the cron above and the dataset becomes a living corpus: every run appends only genuinely new, gate-passing documents — no duplicate pulls, no duplicate cost.

Integrations & data export

Export your corpus in JSON, CSV, Excel, or XML. Integrate directly with:

Google Sheets — sync document metadata for corpus QA dashboards
Zapier / Make / n8n — trigger downstream processing when a refresh run finishes
REST API — programmatic access from Python, JavaScript, or any language
Webhooks — real-time notifications when corpus pulls complete

See all integrations →

Scrape a Chinese corpus with Python, JavaScript, or no code

Use this Actor directly from the Apify Console (no coding required), or call it via the Apify API from any language:

Python example:

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("zhorex/chinese-corpus-engine").call(run_input={
    "mode": "corpus_pull",
    "topics": ["新能源汽车"],
    "sources": ["weibo", "bilibili", "xueqiu", "douban"],
    "maxDocs": 1000,
})
for doc in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(doc["doc_id"], doc["quality"]["score"], doc["char_count"])

JavaScript example:

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('zhorex/chinese-corpus-engine').call({
    mode: 'corpus_pull',
    topics: ['新能源汽车'],
    sources: ['weibo', 'bilibili', 'xueqiu', 'douban'],
    maxDocs: 1000,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((doc) => console.log(doc.doc_id, doc.quality.score));

Using the raw REST API (Postman / curl)

⚠️ The run endpoint is asynchronous — its response is the run object (IDs + status), NOT your corpus documents. If you POST to /acts/.../runs you get back something like { "data": { "status": "READY", "defaultDatasetId": "…" } } with no documents in it — that's expected, the run hasn't finished yet. The documents land in the run's dataset, not in that response. (Likewise, the containerUrl link is the live container; once a run finishes it just shows "run has already finished with status SUCCEEDED" — that message means success, it is not where the data lives.)

Easiest — one call that waits for the run and returns the documents directly:

curl -X POST "https://api.apify.com/v2/acts/zhorex~chinese-corpus-engine/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"mode":"corpus_pull","topics":["新能源汽车"],"sources":["weibo","bilibili"],"maxDocs":500}'

The response body is the JSON array of corpus documents — no second call needed. (Best for small pulls; bulk corpus runs outlive the sync endpoint's timeout — use the async pattern below.)

Or async — start the run, then fetch the dataset once it finishes:

# 1) start the run — note the "defaultDatasetId" in the response
curl -X POST "https://api.apify.com/v2/acts/zhorex~chinese-corpus-engine/runs?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"mode":"corpus_pull","topics":["新能源汽车"],"maxDocs":10000}'

# 2) when the run status is SUCCEEDED, fetch the documents from its dataset
curl "https://api.apify.com/v2/datasets/DEFAULT_DATASET_ID/items?token=YOUR_API_TOKEN"

💡 In the Apify Console you can also open any run and click the Output / Storage → Dataset tab to view and download the same data as JSON / CSV / Excel.

What this Actor is NOT

Not a Zhihu scraper. Zhihu is deliberately excluded from this Actor's source list.
Not WeChat or Douyin coverage. WeChat has no public scraping interface; Douyin is out of scope in the current release.
Not a rights-clearance service. It documents provenance; it does not — and cannot — grant copyright licenses or AI-training rights. See Legal positioning.
Not a monitoring dashboard. If you want recurring brand mentions with sentiment and reach weighting, that's the Chinese Brand Monitor's job — this engine builds corpora, not alerts.

Other scrapers by Zhorex

Chinese Digital Intelligence Suite:

Chinese Brand Monitor — Cross-platform brand mention aggregator across all 5 platforms ($0.045/mention)
Weibo Scraper — Posts, hot search, trending topics (580M+ users)
Bilibili Scraper — Video, danmaku, Gen-Z creator analytics (300M+ users)
RedNote (Xiaohongshu) Scraper — Search, posts, profiles, comments, videos (300M+ users)
Douban Scraper — Long-form reviews, ratings, group discussions (movies/books/music)
Xueqiu Scraper — Chinese stock-discussion sentiment, cashtag indexing
RedNote Shop Scraper — Xiaohongshu e-commerce products, vendors, prices
JD.com Scraper — JD product detail extraction

Reviews & alt-data:

Letterboxd Scraper — Western film reviews and ratings
G2 Reviews Scraper — B2B software reviews via public API
Capterra Reviews Scraper — Software reviews with sub-ratings
Booking.com Reviews Scraper — Hotel reviews and ratings
Review Intelligence Aggregator — Multi-source review aggregation

Markets & alt-data:

TradingView Scraper — Stocks, forex, crypto data
Hyperliquid Pro Scraper — DeFi top traders, vaults, perpetual markets

Streaming Analytics:

Twitch Scraper — Streamer profiles, live streams, clips
Kick.com Scraper — Kick streamer analytics
YouTube Shorts Scraper Pro — YouTube Shorts data

Other Tools:

Perplexity AI Search Scraper — AI search results
Telegram Channel Scraper — Telegram messages
Tech Stack Detector — Detect technologies used by websites
LinkedIn Company Enrichment — Enrich company records
Domain Authority Checker — Bulk SEO domain analysis
Phone Number Validator — Phone validation
Sneaker Price Tracker — Track sneaker prices across platforms

Your Review Matters ⭐

This is the only AI-corpus assembly engine in the Chinese Digital Intelligence Suite — and the only Apify Actor that ships Chinese social text already deduplicated, quality-gated, PII-scrubbed, and provenance-stamped. If it delivered the corpus you needed, a 30-second review helps a lot:

Go to the Chinese AI Training Corpus Engine page
Click the star rating (top of the page)
Optionally leave a one-line note (e.g. "pulled a 10K-doc deduplicated EV corpus in one run")

Why it matters: reviews are the #1 signal Apify users check before trying an Actor. A high rating means more teams find this engine instead of stitching together raw scrapers and a cleaning pipeline by hand — which means faster updates, more sources, and better support for everyone.

Found a bug or missing feature? Open an issue on the Actor page and it'll typically be fixed within 48 hours.

Last updated: June 2026 · Actively maintained · Trusted by AI training data teams, data vendors, and academic NLP researchers.

Chinese Brand Monitor — Weibo+RedNote+Bilibili+Douban+Xueqiu

zhorex/chinese-brand-monitor

Track brand mentions across Weibo, Xiaohongshu (RedNote), Bilibili, Douban and Xueqiu in one normalized API call. Sentiment-tagged, cross-platform deduplicated. $0.09 per mention, pay-as-you-go. Synthesio/Brandwatch alternative for brand monitoring agencies, DTC China teams, and hedge funds.

Sami

5.0

Douban Pro Scraper — Reviews, Discussions & Subject Data

zhorex/douban-scraper

Scrape long-form reviews, comments, and group discussions from Douban (豆瓣) — China's leading reviews + interest community. Movies, books, music, plus subject search. Built for Chinese-LLM training corpus, sentiment analysis, and academic NLP research. Pure HTTP, no auth.

Sami

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

Ryan Clinton

China Social Sentiment Aggregate

nexgendata/china-social-sentiment-aggregate

One keyword → a normalized, deduped, sentiment-tagged feed across Chinese social platforms (Bilibili + RedNote + Weibo). Cross-platform China brand & topic monitoring for AI-training data, brand sentiment and China-equity analysts.

NexGenData

Ai Training Data Enricher

fiery_dream/ai-training-data-enricher

Production-grade data enrichment and validation for LLM training datasets. Automatically clean, enrich, deduplicate, and validate your AI training data before fine-tuning.

Cody Churchwell

AI Training Dataset Builder: Articles, Blogs & Web Pages

turboextract/ai-training-dataset-builder

Turn any list of URLs into clean, structured training data for AI models, RAG systems, and LLM fine-tuning. Built for ML engineers and AI teams.

Moses Ndambuki

AI Training Data Quality MCP Server

ryanclinton/ai-training-data-quality-mcp

AI training data quality assessment, bias detection, and governance scoring for AI agents via the Model Context Protocol.

Ryan Clinton

Ai Training Data Curator

omarchydev/ai-training-data-curator

Crawl websites and curate high-quality training data for LLM fine-tuning. Automatic deduplication, quality scoring, and language detection. Export to JSONL, Parquet, or CSV formats ready for OpenAI, Claude, or Llama training.

Omarchy Dev

Douban Reviews Scraper

stackrelay/douban-reviews-scraper

Scrape Douban (豆瓣) ratings, reviews & comments with sentiment tags for movies, TV, books, music & groups. Clean JSON for NLP/LLM training & analysis.

StackRelay

Weibo Scraper - Chinese Social Intelligence

zhorex/weibo-scraper

Extract Chinese public opinion, trending topics, brand sentiment, and creator data from Weibo (微博) — China's largest microblog with 580M+ users. Built for AI training corpora, Chinese equity research, and brand monitoring. No login, no browser. Part of the Chinese Digital Intelligence Suite.

Sami

227

1.0

Chinese AI Training Corpus Engine

Chinese AI Training Corpus Engine — Weibo + Bilibili + Xueqiu + Douban + RedNote

Table of contents

Part of the Chinese Digital Intelligence Suite by Zhorex

Who buys this Actor

What you get per document

Built for the EU AI Act provenance era

Legal positioning

FAQ

Modes

How to build a Chinese AI training corpus in 3 easy steps

Example: evaluate corpus quality (start here)

Example: SFT dataset build with strict quality

Example: add RedNote first-person reviews (browser source)

Example: merge & dedup datasets you already own

Example: provenance audit of an existing corpus

Pricing

Pick the right lane (and the right price)

Realistic costs

⏰ Set up a scheduled corpus refresh in 2 minutes

💸 Grow one corpus, pay once per document — deltaStateKey

Integrations & data export

Scrape a Chinese corpus with Python, JavaScript, or no code

Using the raw REST API (Postman / curl)

What this Actor is NOT

Other scrapers by Zhorex

Your Review Matters ⭐

You might also like

Chinese Brand Monitor — Weibo+RedNote+Bilibili+Douban+Xueqiu

Douban Pro Scraper — Reviews, Discussions & Subject Data

AI Training Data Curator

China Social Sentiment Aggregate

Ai Training Data Enricher

AI Training Dataset Builder: Articles, Blogs & Web Pages

AI Training Data Quality MCP Server

Ai Training Data Curator

Douban Reviews Scraper

Weibo Scraper - Chinese Social Intelligence

💸 Grow one corpus, pay once per document — `deltaStateKey`