Chinese AI Training Corpus Engine
Pricing
from $25.00 / 1,000 ai-ready document (http source)s
Chinese AI Training Corpus Engine
Turn China's public web into AI-training-ready text. Pulls Weibo, Bilibili, Xueqiu, Douban & RedNote, then deduplicates, quality-scores, PII-scrubs and provenance-stamps every document. From $0.025/doc, pay-as-you-go. For LLM training-data teams, data vendors & academic NLP researchers.
Pricing
from $25.00 / 1,000 ai-ready document (http source)s
Rating
0.0
(0)
Developer
Sami
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
6 hours ago
Last modified
Categories
Share
Chinese AI Training Corpus Engine — Weibo + Bilibili + Xueqiu + Douban + RedNote
Turn China's public web into AI-training-ready text — deduplicated, quality-scored, PII-scrubbed, provenance-stamped. One run pulls topical documents from Weibo, Bilibili, Xueqiu, Douban, and (optionally) RedNote, runs every document through a cleaning + dedup + quality + provenance pipeline, and bills only the documents that survive the gates. Built for AI/LLM training-data teams, data vendors, and academic NLP researchers. No login, no API key, no VPN.
🏢 Sourcing a Chinese-language LLM training corpus at production scale?
This Actor assembles Chinese-language corpora at corpus scale: tens of thousands to hundreds of thousands of clean, deduplicated, provenance-stamped documents per run — on a schedule that grows one corpus without ever paying twice for the same document. Drop-in for SFT/RLHF dataset builds, foundation-corpus slices, and data-vendor catalogs. Pay-per-document, no contract.
For high-volume / enterprise I offer bulk & volume pricing, custom output schemas matched to your training pipeline, dedicated proxy throughput for sustained bulk pulls, scheduled managed corpus feeds, and a schema-stability SLA (no breaking changes without 30-day notice).
→ DM me on Apify, open an Issue titled "Enterprise inquiry", or email samimassis2002@gmail.com (subject "Corpus Engine enterprise").
Table of contents
- Part of the Chinese Digital Intelligence Suite — where this Actor fits among the suite's three lanes
- Who buys this Actor — buyer profiles and typical spend
- What you get per document — the full annotated output record
- EU AI Act & provenance — per-document documentation fields
- Legal positioning & FAQ — what this tool is (and is not)
- Modes — corpus_pull, dedup_merge, provenance_audit with copy-paste inputs
- Pricing — $0.025/doc, billed only on documents that pass the gates
- Scheduled corpus refresh — grow one corpus on a cron, never re-billed
- Integrations — Sheets, Zapier, Make, n8n, REST API
- What this Actor is NOT
Part of the Chinese Digital Intelligence Suite by Zhorex
The only Apify developer specializing in Chinese-platform intelligence — built specifically for AI training data buyers, equity research analysts covering Chinese consumer brands, and brand monitoring teams:
| Platform | Users | Use Case | Link |
|---|---|---|---|
| 🆕 Chinese AI Training Corpus Engine | All 5 | Bulk AI-corpus assembly — dedup + quality scoring + PII scrub + per-document provenance ($0.025/doc) | You are here |
| Chinese Brand Monitor | All 5 | Cross-platform brand aggregator — sentiment + dedup + reach-weighted brand-health rollup ($0.045/mention) | Chinese Brand Monitor |
| 580M+ | Public opinion, hot search, trending topics | Weibo Scraper | |
| RedNote (Xiaohongshu) | 300M+ | Consumer reviews, lifestyle signal, brand sentiment | RedNote Scraper |
| Bilibili | 300M+ | Video content, danmaku, Gen-Z creator sentiment | Bilibili Scraper |
| Douban | 200M+ | Long-form reviews (movies/books/music), group discussions | Douban Scraper |
| Xueqiu | 20M+ | Stock-discussion sentiment, cashtag indexing | Xueqiu Scraper |
| RedNote Shop | 300M+ | RedShop e-commerce: products, vendors, prices | RedNote Shop Scraper |
Why use the suite — and which lane is yours? The suite has three lanes. The Chinese Brand Monitor is for recurring cross-platform brand monitoring — sentiment-tagged mentions on a schedule (it saves 4-6 hours vs. orchestrating individual scrapers). The single-platform scrapers are for deep extraction inside one platform — full comment trees, profiles, danmaku, long-form reviews, at the cheapest raw per-record price. The Corpus Engine is for bulk AI-corpus assembly: it pulls topical text across platforms in one run and ships every document already cleaned, deduplicated, quality-scored, PII-scrubbed, and provenance-stamped — the document your training pipeline ingests directly, not a raw record you still have to process.
Who buys this Actor
| Buyer profile | Use case | Typical spend |
|---|---|---|
| AI / LLM training data teams | Chinese-language SFT/RLHF datasets and pretraining-corpus slices, deduplicated and quality-gated before they touch the pipeline | $200–$3,000 per corpus build |
| Data vendors / brokers | Resellable Chinese-text corpus slices with per-document provenance and content hashes for catalog documentation | $1,000–$10,000/mo |
| Academic NLP researchers | Reproducible Chinese social-text corpora (stable doc IDs, content hashes, pipeline version) for papers and classifier training | $30–$200/mo |
| AI compliance / governance teams | Per-document provenance audits — robots state, opt-out signals, PII counts — over corpora they already hold | $100–$500/mo |
What you get per document
Every billed document is a single self-contained JSON record — text plus everything your data pipeline, your dedup ledger, and your compliance documentation need:
{"doc_id": "weibo_post_5123456789012345", // stable: {platform}_{record_type}_{native_id}"record_type": "post", // post | comment | group_topic | note"topic": "新能源汽车", // which of your input topics matched this doc"text_clean": "……", // boilerplate-stripped, PII-scrubbed — feed this to training"text_raw": "……", // original text (also PII-scrubbed when piiScrub is on)"char_count": 412, // length of text_clean"language": "zh-CN", // zh-CN | en | mixed | und"quality": { // 0-1 heuristic: length, charset, diversity,"score": 0.71, // punctuation sanity, sentence structure"flags": [] // e.g. too_short, low_diversity, punct_spam},"pii": { // what was found (and scrubbed when piiScrub: true)"emails": 0, "phones": 1,"national_ids": 0, "passports": 0, "total": 1},"dedup": {"cluster_id": "dup_3fa8c1d290bb", // near-duplicate cluster this doc canonicalizes"is_canonical": true, // only canonical docs are returned and billed"near_dup_count": 2, // how many near-dups were collapsed (free)"duplicate_doc_ids": ["weibo_post_…"] // up to 20 collapsed doc IDs},"engagement": { "likes": 230, "comments": 18, "shares": 4, "views": null },"provenance": { // the documentation layer — see EU AI Act section"platform": "weibo","source_url": "https://weibo.com/…","author_handle": "…","published_at": "2026-06-08T11:32:00+08:00","retrieved_at": "2026-06-10T09:14:55Z","collection_method": "http_api", // http_api | browser_render"robots_state": "allowed", // allowed | disallowed | unavailable"opt_out_signals": [], // e.g. ["robots_disallow"] — filter on this"license_hint": "User-generated content; rights remain with original authors; platform ToS restricts redistribution. This record conveys no copyright license or AI-training rights — obtain rights independently.","pipeline_version": "1.0.0","content_sha256": "…" // hash of final text_clean — your dedup ledger key},"billing_event": "corpus-doc", // corpus-doc | corpus-doc-browser"scrapedAt": "2026-06-10T09:14:55Z"}
The pipeline behind it (every document, fixed order): normalize → language ID → boilerplate strip (URLs, repost chains, platform emoji codes, UI residue, zero-width chars) → PII scrub → quality score → near-duplicate detection with cluster IDs → billable gate → provenance stamp. Documents that fail the language filter, character floor, quality floor, or arrive as duplicates are dropped — not returned, not billed.
Every run also writes a free SUMMARY record to the run's key-value store: per-source document counts, drop reasons, dedup ratio, and a quality-score histogram — so you can judge a pull's yield at a glance before scaling it up.
Built for the EU AI Act provenance era
Under the EU AI Act, providers of general-purpose AI models must publish a sufficiently detailed summary of the content used for training, and EU text-and-data-mining rules require respecting machine-readable opt-out reservations. Most scraped datasets make that documentation work painful after the fact. This Actor does it per document, at collection time:
source_url+retrieved_at— where each document came from and exactly when it was collectedrobots_state— whether the document's public URL path was allowed, disallowed, or unevaluable under the source domain's robots.txt at retrieval timeopt_out_signals— machine-readable opt-out indications detected for the document (e.g.robots_disallow); always present, even when emptylicense_hint— a per-platform plain-language rights note attached to every recordcontent_sha256+pipeline_version— reproducible content identity and processing lineage for your dataset documentation
Because these are first-class fields, downstream filtering is one line — e.g. drop everything with robots_state: "disallowed" or a non-empty opt_out_signals before a training run, and keep the audit trail showing you did.
Framing matters: this is documentation tooling, not legal clearance. These fields make a corpus documentable and filterable. They do not determine whether any given use of any given document is lawful in your jurisdiction — that judgment, and the rights to make it on, remain yours.
Legal positioning
⚖️ Read this before buying
This Actor is a collection, structuring, and provenance-documentation TOOL. It accesses publicly available content only — the same content any anonymous browser visitor can see. It does not grant, transfer, or imply any copyright license or AI-training rights to the collected content. Rights remain with the original authors and platforms; the
license_hintfield on every record says exactly that. Obtain any rights you need independently, and consult legal counsel for your specific use case and jurisdiction. The provenance and opt-out fields help you document and filter a corpus — they are not, and cannot be, legal clearance.
FAQ
Q: Is scraping Weibo / Bilibili / Xueqiu / Douban / RedNote legal?
A: This Actor accesses only publicly visible content on each platform — no login bypass, no private accounts, no DMs, no follower lists. Optional cookieStrings are user-supplied and used only to improve recall and rate limits, never to bypass authentication. Laws on collecting and using public web data vary by jurisdiction and purpose. Always consult legal counsel for your specific use case and jurisdiction.
Q: Does buying documents from this Actor give me the right to train AI models on them?
A: No. No scraper can grant that — and this one explicitly does not claim to. The Actor collects, structures, and documents public content; copyright and related rights stay with the original authors and platforms. The per-document license_hint, robots_state, and opt_out_signals fields exist precisely so you (and your counsel) can make and document that determination yourselves.
Q: How does the PII scrub work, and what about PIPL / GDPR?
A: With piiScrub: true (the default), emails, phone numbers, checksum-validated Chinese resident IDs, and passport numbers found in document text are replaced with [EMAIL] / [PHONE] / [CN_ID] / [PASSPORT] tokens, and per-document counts are reported in the pii field (counts are reported even if you turn scrubbing off). This materially reduces incidental personal data in the corpus, but user-generated text can carry personal information in forms no scrubber catches — under PIPL, GDPR, and similar regimes you remain responsible for how you process and retain the data. The scrub is a hygiene layer, not a compliance guarantee.
Q: What does robots_state: "disallowed" mean on a record I received?
A: It means that, at retrieval time, the source domain's robots.txt disallowed the document's public URL path for generic crawlers. The record is delivered anyway with that flag so you can apply your own policy — most training-data teams filter these out, and the field gives you a documented basis for doing so.
Q: Why is sentiment tagging not included? A: Deliberately out of scope for v1 — corpus buyers run their own labeling stacks, and bundling a sentiment model would add cost to every document whether you want it or not. If you need sentiment-tagged Chinese mentions, that's the Chinese Brand Monitor's lane ($0.045/mention, sentiment included).
Q: Can I run this from Python?
A: Yes — see the Python example below. pip install apify-client and call zhorex/chinese-corpus-engine like any other Actor.
Modes
| Mode | What it does | Billed as |
|---|---|---|
corpus_pull | Scrape fresh documents for your topics from the selected sources and run the full pipeline | corpus-doc $0.025 / corpus-doc-browser $0.055 |
dedup_merge | No scraping — re-process datasets you already own (from this Actor or any suite Actor) into one canonical, deduplicated corpus | audit-record $0.003 per record processed |
provenance_audit | No scraping — emit a compact audit record (robots state, opt-out signals, license hint, PII counts, content hash) for every document in datasets you already own | audit-record $0.003 per audit record |
How to build a Chinese AI training corpus in 3 easy steps
- Go to the Chinese AI Training Corpus Engine on Apify Store and click "Try for free"
- Enter your topics (mix brands, categories, and themes for corpus diversity — Chinese keywords return far more native content than Latin spellings) and pick your sources
- Click Run and download AI-ready documents in JSON, CSV, or Excel
No coding required. No login needed. Works with Apify's free plan.
Example: evaluate corpus quality (start here)
A small pull to judge yield and quality before committing to volume:
{"mode": "corpus_pull","topics": ["新能源汽车", "智能驾驶"],"sources": ["weibo", "bilibili", "xueqiu", "douban"],"maxDocs": 1000,"minQuality": "0.35","minCharCount": 40,"languages": ["zh-CN", "mixed"]}
Check the SUMMARY record (drop reasons, dedup ratio, quality histogram), then raise maxDocs to 10,000–50,000 for the real build.
Example: SFT dataset build with strict quality
{"mode": "corpus_pull","topics": ["护肤", "化妆品成分", "敏感肌"],"sources": ["weibo", "bilibili", "douban"],"maxDocs": 25000,"minQuality": "0.5","minCharCount": 80,"languages": ["zh-CN"]}
Example: add RedNote first-person reviews (browser source)
RedNote's full post bodies require JS rendering, so RedNote only runs when includeBrowserSources is enabled and bills at the higher corpus-doc-browser rate:
{"mode": "corpus_pull","topics": ["国货美妆"],"sources": ["weibo", "rednote"],"includeBrowserSources": true,"maxDocs": 5000,"proxyConfiguration": { "useApifyProxy": true, "apifyProxyGroups": ["RESIDENTIAL"] }}
Recommended run memory with browser sources on: 4096 MB.
Example: merge & dedup datasets you already own
Point it at datasets from previous runs — or from any Chinese Digital Intelligence Suite Actor — and get back one canonical corpus with cluster IDs, at a tenth of the scrape price:
{"mode": "dedup_merge","inputDatasetIds": ["DATASET_ID_1", "DATASET_ID_2", "DATASET_ID_3"],"minQuality": "0.35"}
Example: provenance audit of an existing corpus
{"mode": "provenance_audit","inputDatasetIds": ["DATASET_ID_1"]}
Each audit record carries doc_id, source_url, robots_state, opt_out_signals, license_hint, PII counts, content_sha256, and the dedup cluster ID — no text fields, just the documentation layer.
💾 Memory guidance for bulk runs. The default 2048 MB comfortably handles runs up to a practical ceiling of ~250,000 documents; for runs approaching 1M documents, set run memory to 8192 MB (dedup state is held in memory). With
includeBrowserSources: true, use 4096 MB or more.
🍪 Optional cookies. The
cookieStringsinput (a secret field, encrypted at rest) accepts per-platform logged-in cookies, e.g.{"xueqiu": "xq_a_token=..."}— they improve recall and rate limits on the gated platforms. The Actor degrades gracefully without them. Use throwaway accounts and refresh roughly every 10 days for scheduled runs.
Pricing
This Actor uses Pay-Per-Event pricing — and the gates work in your favor: only documents that pass the quality floor, character floor, and deduplication are billed. Rejects, duplicates, and already-collected documents are free — not returned, not billed.
| Event | Price | Charged when |
|---|---|---|
corpus-doc | $0.025 | An AI-ready document from an HTTP source (Weibo / Bilibili / Xueqiu / Douban) passes all gates |
corpus-doc-browser | $0.055 | An AI-ready document from a browser source (RedNote) passes all gates — only with includeBrowserSources: true |
audit-record | $0.003 | Per record processed in dedup_merge; per audit record emitted in provenance_audit |
Pick the right lane (and the right price)
This engine deliberately does not replace the rest of the suite — each lane is priced for its job:
| Your job | Right tool | Price lane |
|---|---|---|
| Raw records from one platform, cheapest per record — comments, danmaku, profiles, full review bodies | The single-platform scrapers (Weibo, Bilibili, Douban, Xueqiu, RedNote) | $0.005–$0.020 per raw record |
| Recurring brand monitoring — sentiment-tagged, reach-weighted mention feed | Chinese Brand Monitor | $0.045 per mention |
| AI-ready corpus — cleaned, deduplicated, quality-gated, PII-scrubbed, provenance-stamped documents | This Actor | $0.025 per document ($0.055 browser-sourced) |
At $0.025, a corpus document costs barely more than a single raw Weibo record — meaning the boilerplate stripping, near-duplicate collapse, quality scoring, PII scrub, and provenance stamping are effectively included free. If you just need raw records, the single-platform scrapers stay cheaper — use them. If you need documents your training pipeline can ingest as-is, this is the lane.
Realistic costs
| Workflow | Volume | Cost |
|---|---|---|
| Corpus quality evaluation pull | 1,000 docs | ~$25 |
| SFT dataset build (HTTP sources) | 10,000 docs | ~$250 |
| Foundation-corpus slice (HTTP sources) | 50,000 docs | ~$1,250 |
| RedNote first-person review corpus (browser source) | 5,000 docs | ~$275 |
| Dedup & merge of datasets you already own | 100,000 records | ~$300 |
| Provenance audit of an existing corpus | 50,000 records | ~$150 |
Compare: licensing a comparable academic Chinese-text corpus runs $15K–$50K with a single-use license and months of delivery time — and arrives without per-document provenance.
Volume pricing available above 50K documents/month (see the Enterprise section at the top). Apify platform compute costs (RAM-seconds) are charged separately; browser-source runs also consume residential proxy bandwidth billed by GB on your own account.
⏰ Set up a scheduled corpus refresh in 2 minutes
A corpus is not a one-off pull — it's an asset that compounds. A weekly or daily schedule turns a single topical pull into a continuously growing corpus: each run adds only the documents that didn't exist last time, and with deltaStateKey set, already-collected documents are skipped — not returned, not billed. Over weeks you own a deduplicated, provenance-stamped Chinese-text corpus nobody else has, built at marginal cost.
- Run it once with your input. Use
corpus_pullwith your topics (see the examples above), click Run, and check theSUMMARYrecord to confirm yield and quality look right. - Apify Console → Schedules → Create. Pick this Actor and your saved input. (Shortcut: open any finished run and click Schedule to pre-fill the input for you.)
- Set a cron expression and save. For example
0 8 * * *= daily at 8am, or0 * * * *= hourly for fast-moving topics. While you're there, enable the email notification on failed runs option so you know if a run ever needs attention.
Each scheduled run appends fresh documents to the same dataset, so your corpus grows continuously with zero manual work — no babysitting, no re-running by hand, no duplicate billing.
💸 Grow one corpus, pay once per document — deltaStateKey
Set a stable deltaStateKey and the Actor keeps a cross-run ledger of every document it has already delivered under that key (exact hashes and near-duplicate signatures). On every subsequent run, already-collected documents are skipped: not returned, not billed. Use a distinct key per corpus so independent projects don't collide — e.g. "ev-corpus-weekly" vs "beauty-corpus-weekly".
Example — a weekly EV-sector corpus that only ever bills for new documents:
{"mode": "corpus_pull","topics": ["新能源汽车", "电动车", "充电桩"],"sources": ["weibo", "bilibili", "xueqiu", "douban"],"maxDocs": 10000,"deltaStateKey": "ev-corpus-weekly"}
Pair this with the cron above and the dataset becomes a living corpus: every run appends only genuinely new, gate-passing documents — no duplicate pulls, no duplicate cost.
Integrations & data export
Export your corpus in JSON, CSV, Excel, or XML. Integrate directly with:
- Google Sheets — sync document metadata for corpus QA dashboards
- Zapier / Make / n8n — trigger downstream processing when a refresh run finishes
- REST API — programmatic access from Python, JavaScript, or any language
- Webhooks — real-time notifications when corpus pulls complete
Scrape a Chinese corpus with Python, JavaScript, or no code
Use this Actor directly from the Apify Console (no coding required), or call it via the Apify API from any language:
Python example:
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("zhorex/chinese-corpus-engine").call(run_input={"mode": "corpus_pull","topics": ["新能源汽车"],"sources": ["weibo", "bilibili", "xueqiu", "douban"],"maxDocs": 1000,})for doc in client.dataset(run["defaultDatasetId"]).iterate_items():print(doc["doc_id"], doc["quality"]["score"], doc["char_count"])
JavaScript example:
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('zhorex/chinese-corpus-engine').call({mode: 'corpus_pull',topics: ['新能源汽车'],sources: ['weibo', 'bilibili', 'xueqiu', 'douban'],maxDocs: 1000,});const { items } = await client.dataset(run.defaultDatasetId).listItems();items.forEach((doc) => console.log(doc.doc_id, doc.quality.score));
Using the raw REST API (Postman / curl)
⚠️ The run endpoint is asynchronous — its response is the run object (IDs + status), NOT your corpus documents. If you
POSTto/acts/.../runsyou get back something like{ "data": { "status": "READY", "defaultDatasetId": "…" } }with no documents in it — that's expected, the run hasn't finished yet. The documents land in the run's dataset, not in that response. (Likewise, thecontainerUrllink is the live container; once a run finishes it just shows "run has already finished with status SUCCEEDED" — that message means success, it is not where the data lives.)
Easiest — one call that waits for the run and returns the documents directly:
curl -X POST "https://api.apify.com/v2/acts/zhorex~chinese-corpus-engine/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{"mode":"corpus_pull","topics":["新能源汽车"],"sources":["weibo","bilibili"],"maxDocs":500}'
The response body is the JSON array of corpus documents — no second call needed. (Best for small pulls; bulk corpus runs outlive the sync endpoint's timeout — use the async pattern below.)
Or async — start the run, then fetch the dataset once it finishes:
# 1) start the run — note the "defaultDatasetId" in the responsecurl -X POST "https://api.apify.com/v2/acts/zhorex~chinese-corpus-engine/runs?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{"mode":"corpus_pull","topics":["新能源汽车"],"maxDocs":10000}'# 2) when the run status is SUCCEEDED, fetch the documents from its datasetcurl "https://api.apify.com/v2/datasets/DEFAULT_DATASET_ID/items?token=YOUR_API_TOKEN"
💡 In the Apify Console you can also open any run and click the Output / Storage → Dataset tab to view and download the same data as JSON / CSV / Excel.
What this Actor is NOT
- Not a Zhihu scraper. Zhihu is deliberately excluded from this Actor's source list.
- Not WeChat or Douyin coverage. WeChat has no public scraping interface; Douyin is out of scope in the current release.
- Not a rights-clearance service. It documents provenance; it does not — and cannot — grant copyright licenses or AI-training rights. See Legal positioning.
- Not a monitoring dashboard. If you want recurring brand mentions with sentiment and reach weighting, that's the Chinese Brand Monitor's job — this engine builds corpora, not alerts.
Other scrapers by Zhorex
Chinese Digital Intelligence Suite:
- Chinese Brand Monitor — Cross-platform brand mention aggregator across all 5 platforms ($0.045/mention)
- Weibo Scraper — Posts, hot search, trending topics (580M+ users)
- Bilibili Scraper — Video, danmaku, Gen-Z creator analytics (300M+ users)
- RedNote (Xiaohongshu) Scraper — Search, posts, profiles, comments, videos (300M+ users)
- Douban Scraper — Long-form reviews, ratings, group discussions (movies/books/music)
- Xueqiu Scraper — Chinese stock-discussion sentiment, cashtag indexing
- RedNote Shop Scraper — Xiaohongshu e-commerce products, vendors, prices
- JD.com Scraper — JD product detail extraction
Reviews & alt-data:
- Letterboxd Scraper — Western film reviews and ratings
- G2 Reviews Scraper — B2B software reviews via public API
- Capterra Reviews Scraper — Software reviews with sub-ratings
- Booking.com Reviews Scraper — Hotel reviews and ratings
- Review Intelligence Aggregator — Multi-source review aggregation
Markets & alt-data:
- TradingView Scraper — Stocks, forex, crypto data
- Hyperliquid Pro Scraper — DeFi top traders, vaults, perpetual markets
Streaming Analytics:
- Twitch Scraper — Streamer profiles, live streams, clips
- Kick.com Scraper — Kick streamer analytics
- YouTube Shorts Scraper Pro — YouTube Shorts data
Other Tools:
- Perplexity AI Search Scraper — AI search results
- Telegram Channel Scraper — Telegram messages
- Tech Stack Detector — Detect technologies used by websites
- LinkedIn Company Enrichment — Enrich company records
- Domain Authority Checker — Bulk SEO domain analysis
- Phone Number Validator — Phone validation
- Sneaker Price Tracker — Track sneaker prices across platforms
Your Review Matters ⭐
This is the only AI-corpus assembly engine in the Chinese Digital Intelligence Suite — and the only Apify Actor that ships Chinese social text already deduplicated, quality-gated, PII-scrubbed, and provenance-stamped. If it delivered the corpus you needed, a 30-second review helps a lot:
- Go to the Chinese AI Training Corpus Engine page
- Click the star rating (top of the page)
- Optionally leave a one-line note (e.g. "pulled a 10K-doc deduplicated EV corpus in one run")
Why it matters: reviews are the #1 signal Apify users check before trying an Actor. A high rating means more teams find this engine instead of stitching together raw scrapers and a cleaning pipeline by hand — which means faster updates, more sources, and better support for everyone.
Found a bug or missing feature? Open an issue on the Actor page and it'll typically be fixed within 48 hours.
Last updated: June 2026 · Actively maintained · Trusted by AI training data teams, data vendors, and academic NLP researchers.