Chinese AI Training Corpus Engine avatar

Chinese AI Training Corpus Engine

Pricing

from $25.00 / 1,000 ai-ready document (http source)s

Go to Apify Store
Chinese AI Training Corpus Engine

Chinese AI Training Corpus Engine

Turn China's public web into AI-training-ready text. Pulls Weibo, Bilibili, Xueqiu, Douban & RedNote, then deduplicates, quality-scores, PII-scrubs and provenance-stamps every document. From $0.025/doc, pay-as-you-go. For LLM training-data teams, data vendors & academic NLP researchers.

Pricing

from $25.00 / 1,000 ai-ready document (http source)s

Rating

0.0

(0)

Developer

Sami

Sami

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

6 hours ago

Last modified

Share

Chinese AI Training Corpus Engine — Weibo + Bilibili + Xueqiu + Douban + RedNote

Turn China's public web into AI-training-ready text — deduplicated, quality-scored, PII-scrubbed, provenance-stamped. One run pulls topical documents from Weibo, Bilibili, Xueqiu, Douban, and (optionally) RedNote, runs every document through a cleaning + dedup + quality + provenance pipeline, and bills only the documents that survive the gates. Built for AI/LLM training-data teams, data vendors, and academic NLP researchers. No login, no API key, no VPN.

🏢 Sourcing a Chinese-language LLM training corpus at production scale?

This Actor assembles Chinese-language corpora at corpus scale: tens of thousands to hundreds of thousands of clean, deduplicated, provenance-stamped documents per run — on a schedule that grows one corpus without ever paying twice for the same document. Drop-in for SFT/RLHF dataset builds, foundation-corpus slices, and data-vendor catalogs. Pay-per-document, no contract.

For high-volume / enterprise I offer bulk & volume pricing, custom output schemas matched to your training pipeline, dedicated proxy throughput for sustained bulk pulls, scheduled managed corpus feeds, and a schema-stability SLA (no breaking changes without 30-day notice).

→ DM me on Apify, open an Issue titled "Enterprise inquiry", or email samimassis2002@gmail.com (subject "Corpus Engine enterprise").

Table of contents

Part of the Chinese Digital Intelligence Suite by Zhorex

The only Apify developer specializing in Chinese-platform intelligence — built specifically for AI training data buyers, equity research analysts covering Chinese consumer brands, and brand monitoring teams:

PlatformUsersUse CaseLink
🆕 Chinese AI Training Corpus EngineAll 5Bulk AI-corpus assembly — dedup + quality scoring + PII scrub + per-document provenance ($0.025/doc)You are here
Chinese Brand MonitorAll 5Cross-platform brand aggregator — sentiment + dedup + reach-weighted brand-health rollup ($0.045/mention)Chinese Brand Monitor
Weibo580M+Public opinion, hot search, trending topicsWeibo Scraper
RedNote (Xiaohongshu)300M+Consumer reviews, lifestyle signal, brand sentimentRedNote Scraper
Bilibili300M+Video content, danmaku, Gen-Z creator sentimentBilibili Scraper
Douban200M+Long-form reviews (movies/books/music), group discussionsDouban Scraper
Xueqiu20M+Stock-discussion sentiment, cashtag indexingXueqiu Scraper
RedNote Shop300M+RedShop e-commerce: products, vendors, pricesRedNote Shop Scraper

Why use the suite — and which lane is yours? The suite has three lanes. The Chinese Brand Monitor is for recurring cross-platform brand monitoring — sentiment-tagged mentions on a schedule (it saves 4-6 hours vs. orchestrating individual scrapers). The single-platform scrapers are for deep extraction inside one platform — full comment trees, profiles, danmaku, long-form reviews, at the cheapest raw per-record price. The Corpus Engine is for bulk AI-corpus assembly: it pulls topical text across platforms in one run and ships every document already cleaned, deduplicated, quality-scored, PII-scrubbed, and provenance-stamped — the document your training pipeline ingests directly, not a raw record you still have to process.

Who buys this Actor

Buyer profileUse caseTypical spend
AI / LLM training data teamsChinese-language SFT/RLHF datasets and pretraining-corpus slices, deduplicated and quality-gated before they touch the pipeline$200–$3,000 per corpus build
Data vendors / brokersResellable Chinese-text corpus slices with per-document provenance and content hashes for catalog documentation$1,000–$10,000/mo
Academic NLP researchersReproducible Chinese social-text corpora (stable doc IDs, content hashes, pipeline version) for papers and classifier training$30–$200/mo
AI compliance / governance teamsPer-document provenance audits — robots state, opt-out signals, PII counts — over corpora they already hold$100–$500/mo

What you get per document

Every billed document is a single self-contained JSON record — text plus everything your data pipeline, your dedup ledger, and your compliance documentation need:

{
"doc_id": "weibo_post_5123456789012345", // stable: {platform}_{record_type}_{native_id}
"record_type": "post", // post | comment | group_topic | note
"topic": "新能源汽车", // which of your input topics matched this doc
"text_clean": "……", // boilerplate-stripped, PII-scrubbed — feed this to training
"text_raw": "……", // original text (also PII-scrubbed when piiScrub is on)
"char_count": 412, // length of text_clean
"language": "zh-CN", // zh-CN | en | mixed | und
"quality": { // 0-1 heuristic: length, charset, diversity,
"score": 0.71, // punctuation sanity, sentence structure
"flags": [] // e.g. too_short, low_diversity, punct_spam
},
"pii": { // what was found (and scrubbed when piiScrub: true)
"emails": 0, "phones": 1,
"national_ids": 0, "passports": 0, "total": 1
},
"dedup": {
"cluster_id": "dup_3fa8c1d290bb", // near-duplicate cluster this doc canonicalizes
"is_canonical": true, // only canonical docs are returned and billed
"near_dup_count": 2, // how many near-dups were collapsed (free)
"duplicate_doc_ids": ["weibo_post_…"] // up to 20 collapsed doc IDs
},
"engagement": { "likes": 230, "comments": 18, "shares": 4, "views": null },
"provenance": { // the documentation layer — see EU AI Act section
"platform": "weibo",
"source_url": "https://weibo.com/…",
"author_handle": "…",
"published_at": "2026-06-08T11:32:00+08:00",
"retrieved_at": "2026-06-10T09:14:55Z",
"collection_method": "http_api", // http_api | browser_render
"robots_state": "allowed", // allowed | disallowed | unavailable
"opt_out_signals": [], // e.g. ["robots_disallow"] — filter on this
"license_hint": "User-generated content; rights remain with original authors; platform ToS restricts redistribution. This record conveys no copyright license or AI-training rights — obtain rights independently.",
"pipeline_version": "1.0.0",
"content_sha256": "…" // hash of final text_clean — your dedup ledger key
},
"billing_event": "corpus-doc", // corpus-doc | corpus-doc-browser
"scrapedAt": "2026-06-10T09:14:55Z"
}

The pipeline behind it (every document, fixed order): normalize → language ID → boilerplate strip (URLs, repost chains, platform emoji codes, UI residue, zero-width chars) → PII scrub → quality score → near-duplicate detection with cluster IDs → billable gate → provenance stamp. Documents that fail the language filter, character floor, quality floor, or arrive as duplicates are dropped — not returned, not billed.

Every run also writes a free SUMMARY record to the run's key-value store: per-source document counts, drop reasons, dedup ratio, and a quality-score histogram — so you can judge a pull's yield at a glance before scaling it up.

Built for the EU AI Act provenance era

Under the EU AI Act, providers of general-purpose AI models must publish a sufficiently detailed summary of the content used for training, and EU text-and-data-mining rules require respecting machine-readable opt-out reservations. Most scraped datasets make that documentation work painful after the fact. This Actor does it per document, at collection time:

  • source_url + retrieved_at — where each document came from and exactly when it was collected
  • robots_state — whether the document's public URL path was allowed, disallowed, or unevaluable under the source domain's robots.txt at retrieval time
  • opt_out_signals — machine-readable opt-out indications detected for the document (e.g. robots_disallow); always present, even when empty
  • license_hint — a per-platform plain-language rights note attached to every record
  • content_sha256 + pipeline_version — reproducible content identity and processing lineage for your dataset documentation

Because these are first-class fields, downstream filtering is one line — e.g. drop everything with robots_state: "disallowed" or a non-empty opt_out_signals before a training run, and keep the audit trail showing you did.

Framing matters: this is documentation tooling, not legal clearance. These fields make a corpus documentable and filterable. They do not determine whether any given use of any given document is lawful in your jurisdiction — that judgment, and the rights to make it on, remain yours.

⚖️ Read this before buying

This Actor is a collection, structuring, and provenance-documentation TOOL. It accesses publicly available content only — the same content any anonymous browser visitor can see. It does not grant, transfer, or imply any copyright license or AI-training rights to the collected content. Rights remain with the original authors and platforms; the license_hint field on every record says exactly that. Obtain any rights you need independently, and consult legal counsel for your specific use case and jurisdiction. The provenance and opt-out fields help you document and filter a corpus — they are not, and cannot be, legal clearance.

FAQ

Q: Is scraping Weibo / Bilibili / Xueqiu / Douban / RedNote legal? A: This Actor accesses only publicly visible content on each platform — no login bypass, no private accounts, no DMs, no follower lists. Optional cookieStrings are user-supplied and used only to improve recall and rate limits, never to bypass authentication. Laws on collecting and using public web data vary by jurisdiction and purpose. Always consult legal counsel for your specific use case and jurisdiction.

Q: Does buying documents from this Actor give me the right to train AI models on them? A: No. No scraper can grant that — and this one explicitly does not claim to. The Actor collects, structures, and documents public content; copyright and related rights stay with the original authors and platforms. The per-document license_hint, robots_state, and opt_out_signals fields exist precisely so you (and your counsel) can make and document that determination yourselves.

Q: How does the PII scrub work, and what about PIPL / GDPR? A: With piiScrub: true (the default), emails, phone numbers, checksum-validated Chinese resident IDs, and passport numbers found in document text are replaced with [EMAIL] / [PHONE] / [CN_ID] / [PASSPORT] tokens, and per-document counts are reported in the pii field (counts are reported even if you turn scrubbing off). This materially reduces incidental personal data in the corpus, but user-generated text can carry personal information in forms no scrubber catches — under PIPL, GDPR, and similar regimes you remain responsible for how you process and retain the data. The scrub is a hygiene layer, not a compliance guarantee.

Q: What does robots_state: "disallowed" mean on a record I received? A: It means that, at retrieval time, the source domain's robots.txt disallowed the document's public URL path for generic crawlers. The record is delivered anyway with that flag so you can apply your own policy — most training-data teams filter these out, and the field gives you a documented basis for doing so.

Q: Why is sentiment tagging not included? A: Deliberately out of scope for v1 — corpus buyers run their own labeling stacks, and bundling a sentiment model would add cost to every document whether you want it or not. If you need sentiment-tagged Chinese mentions, that's the Chinese Brand Monitor's lane ($0.045/mention, sentiment included).

Q: Can I run this from Python? A: Yes — see the Python example below. pip install apify-client and call zhorex/chinese-corpus-engine like any other Actor.

Modes

ModeWhat it doesBilled as
corpus_pullScrape fresh documents for your topics from the selected sources and run the full pipelinecorpus-doc $0.025 / corpus-doc-browser $0.055
dedup_mergeNo scraping — re-process datasets you already own (from this Actor or any suite Actor) into one canonical, deduplicated corpusaudit-record $0.003 per record processed
provenance_auditNo scraping — emit a compact audit record (robots state, opt-out signals, license hint, PII counts, content hash) for every document in datasets you already ownaudit-record $0.003 per audit record

How to build a Chinese AI training corpus in 3 easy steps

  1. Go to the Chinese AI Training Corpus Engine on Apify Store and click "Try for free"
  2. Enter your topics (mix brands, categories, and themes for corpus diversity — Chinese keywords return far more native content than Latin spellings) and pick your sources
  3. Click Run and download AI-ready documents in JSON, CSV, or Excel

No coding required. No login needed. Works with Apify's free plan.

Example: evaluate corpus quality (start here)

A small pull to judge yield and quality before committing to volume:

{
"mode": "corpus_pull",
"topics": ["新能源汽车", "智能驾驶"],
"sources": ["weibo", "bilibili", "xueqiu", "douban"],
"maxDocs": 1000,
"minQuality": "0.35",
"minCharCount": 40,
"languages": ["zh-CN", "mixed"]
}

Check the SUMMARY record (drop reasons, dedup ratio, quality histogram), then raise maxDocs to 10,000–50,000 for the real build.

Example: SFT dataset build with strict quality

{
"mode": "corpus_pull",
"topics": ["护肤", "化妆品成分", "敏感肌"],
"sources": ["weibo", "bilibili", "douban"],
"maxDocs": 25000,
"minQuality": "0.5",
"minCharCount": 80,
"languages": ["zh-CN"]
}

Example: add RedNote first-person reviews (browser source)

RedNote's full post bodies require JS rendering, so RedNote only runs when includeBrowserSources is enabled and bills at the higher corpus-doc-browser rate:

{
"mode": "corpus_pull",
"topics": ["国货美妆"],
"sources": ["weibo", "rednote"],
"includeBrowserSources": true,
"maxDocs": 5000,
"proxyConfiguration": { "useApifyProxy": true, "apifyProxyGroups": ["RESIDENTIAL"] }
}

Recommended run memory with browser sources on: 4096 MB.

Example: merge & dedup datasets you already own

Point it at datasets from previous runs — or from any Chinese Digital Intelligence Suite Actor — and get back one canonical corpus with cluster IDs, at a tenth of the scrape price:

{
"mode": "dedup_merge",
"inputDatasetIds": ["DATASET_ID_1", "DATASET_ID_2", "DATASET_ID_3"],
"minQuality": "0.35"
}

Example: provenance audit of an existing corpus

{
"mode": "provenance_audit",
"inputDatasetIds": ["DATASET_ID_1"]
}

Each audit record carries doc_id, source_url, robots_state, opt_out_signals, license_hint, PII counts, content_sha256, and the dedup cluster ID — no text fields, just the documentation layer.

💾 Memory guidance for bulk runs. The default 2048 MB comfortably handles runs up to a practical ceiling of ~250,000 documents; for runs approaching 1M documents, set run memory to 8192 MB (dedup state is held in memory). With includeBrowserSources: true, use 4096 MB or more.

🍪 Optional cookies. The cookieStrings input (a secret field, encrypted at rest) accepts per-platform logged-in cookies, e.g. {"xueqiu": "xq_a_token=..."} — they improve recall and rate limits on the gated platforms. The Actor degrades gracefully without them. Use throwaway accounts and refresh roughly every 10 days for scheduled runs.

Pricing

This Actor uses Pay-Per-Event pricing — and the gates work in your favor: only documents that pass the quality floor, character floor, and deduplication are billed. Rejects, duplicates, and already-collected documents are free — not returned, not billed.

EventPriceCharged when
corpus-doc$0.025An AI-ready document from an HTTP source (Weibo / Bilibili / Xueqiu / Douban) passes all gates
corpus-doc-browser$0.055An AI-ready document from a browser source (RedNote) passes all gates — only with includeBrowserSources: true
audit-record$0.003Per record processed in dedup_merge; per audit record emitted in provenance_audit

Pick the right lane (and the right price)

This engine deliberately does not replace the rest of the suite — each lane is priced for its job:

Your jobRight toolPrice lane
Raw records from one platform, cheapest per record — comments, danmaku, profiles, full review bodiesThe single-platform scrapers (Weibo, Bilibili, Douban, Xueqiu, RedNote)$0.005–$0.020 per raw record
Recurring brand monitoring — sentiment-tagged, reach-weighted mention feedChinese Brand Monitor$0.045 per mention
AI-ready corpus — cleaned, deduplicated, quality-gated, PII-scrubbed, provenance-stamped documentsThis Actor$0.025 per document ($0.055 browser-sourced)

At $0.025, a corpus document costs barely more than a single raw Weibo record — meaning the boilerplate stripping, near-duplicate collapse, quality scoring, PII scrub, and provenance stamping are effectively included free. If you just need raw records, the single-platform scrapers stay cheaper — use them. If you need documents your training pipeline can ingest as-is, this is the lane.

Realistic costs

WorkflowVolumeCost
Corpus quality evaluation pull1,000 docs~$25
SFT dataset build (HTTP sources)10,000 docs~$250
Foundation-corpus slice (HTTP sources)50,000 docs~$1,250
RedNote first-person review corpus (browser source)5,000 docs~$275
Dedup & merge of datasets you already own100,000 records~$300
Provenance audit of an existing corpus50,000 records~$150

Compare: licensing a comparable academic Chinese-text corpus runs $15K–$50K with a single-use license and months of delivery time — and arrives without per-document provenance.

Volume pricing available above 50K documents/month (see the Enterprise section at the top). Apify platform compute costs (RAM-seconds) are charged separately; browser-source runs also consume residential proxy bandwidth billed by GB on your own account.

⏰ Set up a scheduled corpus refresh in 2 minutes

A corpus is not a one-off pull — it's an asset that compounds. A weekly or daily schedule turns a single topical pull into a continuously growing corpus: each run adds only the documents that didn't exist last time, and with deltaStateKey set, already-collected documents are skipped — not returned, not billed. Over weeks you own a deduplicated, provenance-stamped Chinese-text corpus nobody else has, built at marginal cost.

  1. Run it once with your input. Use corpus_pull with your topics (see the examples above), click Run, and check the SUMMARY record to confirm yield and quality look right.
  2. Apify Console → Schedules → Create. Pick this Actor and your saved input. (Shortcut: open any finished run and click Schedule to pre-fill the input for you.)
  3. Set a cron expression and save. For example 0 8 * * * = daily at 8am, or 0 * * * * = hourly for fast-moving topics. While you're there, enable the email notification on failed runs option so you know if a run ever needs attention.

Each scheduled run appends fresh documents to the same dataset, so your corpus grows continuously with zero manual work — no babysitting, no re-running by hand, no duplicate billing.

💸 Grow one corpus, pay once per document — deltaStateKey

Set a stable deltaStateKey and the Actor keeps a cross-run ledger of every document it has already delivered under that key (exact hashes and near-duplicate signatures). On every subsequent run, already-collected documents are skipped: not returned, not billed. Use a distinct key per corpus so independent projects don't collide — e.g. "ev-corpus-weekly" vs "beauty-corpus-weekly".

Example — a weekly EV-sector corpus that only ever bills for new documents:

{
"mode": "corpus_pull",
"topics": ["新能源汽车", "电动车", "充电桩"],
"sources": ["weibo", "bilibili", "xueqiu", "douban"],
"maxDocs": 10000,
"deltaStateKey": "ev-corpus-weekly"
}

Pair this with the cron above and the dataset becomes a living corpus: every run appends only genuinely new, gate-passing documents — no duplicate pulls, no duplicate cost.

Integrations & data export

Export your corpus in JSON, CSV, Excel, or XML. Integrate directly with:

  • Google Sheets — sync document metadata for corpus QA dashboards
  • Zapier / Make / n8n — trigger downstream processing when a refresh run finishes
  • REST API — programmatic access from Python, JavaScript, or any language
  • Webhooks — real-time notifications when corpus pulls complete

See all integrations →

Scrape a Chinese corpus with Python, JavaScript, or no code

Use this Actor directly from the Apify Console (no coding required), or call it via the Apify API from any language:

Python example:

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("zhorex/chinese-corpus-engine").call(run_input={
"mode": "corpus_pull",
"topics": ["新能源汽车"],
"sources": ["weibo", "bilibili", "xueqiu", "douban"],
"maxDocs": 1000,
})
for doc in client.dataset(run["defaultDatasetId"]).iterate_items():
print(doc["doc_id"], doc["quality"]["score"], doc["char_count"])

JavaScript example:

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('zhorex/chinese-corpus-engine').call({
mode: 'corpus_pull',
topics: ['新能源汽车'],
sources: ['weibo', 'bilibili', 'xueqiu', 'douban'],
maxDocs: 1000,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((doc) => console.log(doc.doc_id, doc.quality.score));

Using the raw REST API (Postman / curl)

⚠️ The run endpoint is asynchronous — its response is the run object (IDs + status), NOT your corpus documents. If you POST to /acts/.../runs you get back something like { "data": { "status": "READY", "defaultDatasetId": "…" } } with no documents in it — that's expected, the run hasn't finished yet. The documents land in the run's dataset, not in that response. (Likewise, the containerUrl link is the live container; once a run finishes it just shows "run has already finished with status SUCCEEDED" — that message means success, it is not where the data lives.)

Easiest — one call that waits for the run and returns the documents directly:

curl -X POST "https://api.apify.com/v2/acts/zhorex~chinese-corpus-engine/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{"mode":"corpus_pull","topics":["新能源汽车"],"sources":["weibo","bilibili"],"maxDocs":500}'

The response body is the JSON array of corpus documents — no second call needed. (Best for small pulls; bulk corpus runs outlive the sync endpoint's timeout — use the async pattern below.)

Or async — start the run, then fetch the dataset once it finishes:

# 1) start the run — note the "defaultDatasetId" in the response
curl -X POST "https://api.apify.com/v2/acts/zhorex~chinese-corpus-engine/runs?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{"mode":"corpus_pull","topics":["新能源汽车"],"maxDocs":10000}'
# 2) when the run status is SUCCEEDED, fetch the documents from its dataset
curl "https://api.apify.com/v2/datasets/DEFAULT_DATASET_ID/items?token=YOUR_API_TOKEN"

💡 In the Apify Console you can also open any run and click the Output / Storage → Dataset tab to view and download the same data as JSON / CSV / Excel.

What this Actor is NOT

  • Not a Zhihu scraper. Zhihu is deliberately excluded from this Actor's source list.
  • Not WeChat or Douyin coverage. WeChat has no public scraping interface; Douyin is out of scope in the current release.
  • Not a rights-clearance service. It documents provenance; it does not — and cannot — grant copyright licenses or AI-training rights. See Legal positioning.
  • Not a monitoring dashboard. If you want recurring brand mentions with sentiment and reach weighting, that's the Chinese Brand Monitor's job — this engine builds corpora, not alerts.

Other scrapers by Zhorex

Chinese Digital Intelligence Suite:

Reviews & alt-data:

Markets & alt-data:

Streaming Analytics:

Other Tools:


Your Review Matters ⭐

This is the only AI-corpus assembly engine in the Chinese Digital Intelligence Suite — and the only Apify Actor that ships Chinese social text already deduplicated, quality-gated, PII-scrubbed, and provenance-stamped. If it delivered the corpus you needed, a 30-second review helps a lot:

  1. Go to the Chinese AI Training Corpus Engine page
  2. Click the star rating (top of the page)
  3. Optionally leave a one-line note (e.g. "pulled a 10K-doc deduplicated EV corpus in one run")

Why it matters: reviews are the #1 signal Apify users check before trying an Actor. A high rating means more teams find this engine instead of stitching together raw scrapers and a cleaning pipeline by hand — which means faster updates, more sources, and better support for everyone.

Found a bug or missing feature? Open an issue on the Actor page and it'll typically be fixed within 48 hours.


Last updated: June 2026 · Actively maintained · Trusted by AI training data teams, data vendors, and academic NLP researchers.