# Chinese AI Training Corpus Engine (`zhorex/chinese-corpus-engine`) Actor

Turn China's public web into AI-training-ready text. Pulls Weibo, Bilibili, Xueqiu, Douban & RedNote, then deduplicates, quality-scores, PII-scrubs and provenance-stamps every document. From $0.025/doc, pay-as-you-go. For LLM training-data teams, data vendors & academic NLP researchers.

- **URL**: https://apify.com/zhorex/chinese-corpus-engine.md
- **Developed by:** [Sami](https://apify.com/zhorex) (community)
- **Categories:** AI, Social media, Automation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $25.00 / 1,000 ai-ready document (http source)s

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Chinese AI Training Corpus Engine — Weibo + Bilibili + Xueqiu + Douban + RedNote

Turn China's public web into **AI-training-ready text** — deduplicated, quality-scored, PII-scrubbed, provenance-stamped. One run pulls topical documents from **Weibo, Bilibili, Xueqiu, Douban, and (optionally) RedNote**, runs every document through a cleaning + dedup + quality + provenance pipeline, and bills **only the documents that survive the gates**. Built for **AI/LLM training-data teams, data vendors, and academic NLP researchers**. No login, no API key, no VPN.

> 🏢 **Sourcing a Chinese-language LLM training corpus at production scale?**
>
> This Actor assembles Chinese-language corpora at **corpus scale: tens of thousands to hundreds of thousands of clean, deduplicated, provenance-stamped documents per run — on a schedule that grows one corpus without ever paying twice for the same document.** Drop-in for SFT/RLHF dataset builds, foundation-corpus slices, and data-vendor catalogs. Pay-per-document, no contract.
>
> For high-volume / enterprise I offer **bulk & volume pricing, custom output schemas matched to your training pipeline, dedicated proxy throughput for sustained bulk pulls, scheduled managed corpus feeds, and a schema-stability SLA** (no breaking changes without 30-day notice).
>
> → DM me on Apify, open an Issue titled **"Enterprise inquiry"**, or email **samimassis2002@gmail.com** (subject **"Corpus Engine enterprise"**).

### Table of contents

- [Part of the Chinese Digital Intelligence Suite](#part-of-the-chinese-digital-intelligence-suite-by-zhorex) — where this Actor fits among the suite's three lanes
- [Who buys this Actor](#who-buys-this-actor) — buyer profiles and typical spend
- [What you get per document](#what-you-get-per-document) — the full annotated output record
- [EU AI Act & provenance](#built-for-the-eu-ai-act-provenance-era) — per-document documentation fields
- [Legal positioning & FAQ](#legal-positioning) — what this tool is (and is not)
- [Modes](#modes) — corpus_pull, dedup_merge, provenance_audit with copy-paste inputs
- [Pricing](#pricing) — $0.025/doc, billed only on documents that pass the gates
- [Scheduled corpus refresh](#-set-up-a-scheduled-corpus-refresh-in-2-minutes) — grow one corpus on a cron, never re-billed
- [Integrations](#integrations--data-export) — Sheets, Zapier, Make, n8n, REST API
- [What this Actor is NOT](#what-this-actor-is-not)

### Part of the Chinese Digital Intelligence Suite by Zhorex

The only Apify developer specializing in Chinese-platform intelligence — built specifically for AI training data buyers, equity research analysts covering Chinese consumer brands, and brand monitoring teams:

| Platform | Users | Use Case | Link |
|----------|-------|----------|------|
| 🆕 **Chinese AI Training Corpus Engine** | All 5 | **Bulk AI-corpus assembly** — dedup + quality scoring + PII scrub + per-document provenance ($0.025/doc) | *You are here* |
| **Chinese Brand Monitor** | All 5 | **Cross-platform brand aggregator** — sentiment + dedup + **reach-weighted brand-health rollup** ($0.045/mention) | [Chinese Brand Monitor](https://apify.com/zhorex/chinese-brand-monitor) |
| **Weibo** | 580M+ | Public opinion, hot search, trending topics | [Weibo Scraper](https://apify.com/zhorex/weibo-scraper) |
| **RedNote (Xiaohongshu)** | 300M+ | Consumer reviews, lifestyle signal, brand sentiment | [RedNote Scraper](https://apify.com/zhorex/rednote-xiaohongshu-scraper) |
| **Bilibili** | 300M+ | Video content, danmaku, Gen-Z creator sentiment | [Bilibili Scraper](https://apify.com/zhorex/bilibili-scraper) |
| **Douban** | 200M+ | Long-form reviews (movies/books/music), group discussions | [Douban Scraper](https://apify.com/zhorex/douban-scraper) |
| **Xueqiu** | 20M+ | Stock-discussion sentiment, cashtag indexing | [Xueqiu Scraper](https://apify.com/zhorex/xueqiu-scraper) |
| **RedNote Shop** | 300M+ | RedShop e-commerce: products, vendors, prices | [RedNote Shop Scraper](https://apify.com/zhorex/rednote-shop-scraper) |

**Why use the suite — and which lane is yours?** The suite has three lanes. The **[Chinese Brand Monitor](https://apify.com/zhorex/chinese-brand-monitor)** is for *recurring cross-platform brand monitoring* — sentiment-tagged mentions on a schedule (it saves 4-6 hours vs. orchestrating individual scrapers). The **single-platform scrapers** are for *deep extraction inside one platform* — full comment trees, profiles, danmaku, long-form reviews, at the cheapest raw per-record price. The **Corpus Engine** is for *bulk AI-corpus assembly*: it pulls topical text across platforms in one run and ships every document already cleaned, deduplicated, quality-scored, PII-scrubbed, and provenance-stamped — the document your training pipeline ingests directly, not a raw record you still have to process.

### Who buys this Actor

| Buyer profile | Use case | Typical spend |
|---|---|---|
| **AI / LLM training data teams** | Chinese-language SFT/RLHF datasets and pretraining-corpus slices, deduplicated and quality-gated before they touch the pipeline | $200–$3,000 per corpus build |
| **Data vendors / brokers** | Resellable Chinese-text corpus slices with per-document provenance and content hashes for catalog documentation | $1,000–$10,000/mo |
| **Academic NLP researchers** | Reproducible Chinese social-text corpora (stable doc IDs, content hashes, pipeline version) for papers and classifier training | $30–$200/mo |
| **AI compliance / governance teams** | Per-document provenance audits — robots state, opt-out signals, PII counts — over corpora they already hold | $100–$500/mo |

### What you get per document

Every billed document is a single self-contained JSON record — text plus everything your data pipeline, your dedup ledger, and your compliance documentation need:

```json
{
  "doc_id": "weibo_post_5123456789012345",      // stable: {platform}_{record_type}_{native_id}
  "record_type": "post",                         // post | comment | group_topic | note
  "topic": "新能源汽车",                          // which of your input topics matched this doc
  "text_clean": "……",                            // boilerplate-stripped, PII-scrubbed — feed this to training
  "text_raw": "……",                              // original text (also PII-scrubbed when piiScrub is on)
  "char_count": 412,                             // length of text_clean
  "language": "zh-CN",                           // zh-CN | en | mixed | und
  "quality": {                                   // 0-1 heuristic: length, charset, diversity,
    "score": 0.71,                               //   punctuation sanity, sentence structure
    "flags": []                                  // e.g. too_short, low_diversity, punct_spam
  },
  "pii": {                                       // what was found (and scrubbed when piiScrub: true)
    "emails": 0, "phones": 1,
    "national_ids": 0, "passports": 0, "total": 1
  },
  "dedup": {
    "cluster_id": "dup_3fa8c1d290bb",            // near-duplicate cluster this doc canonicalizes
    "is_canonical": true,                        // only canonical docs are returned and billed
    "near_dup_count": 2,                         // how many near-dups were collapsed (free)
    "duplicate_doc_ids": ["weibo_post_…"]        // up to 20 collapsed doc IDs
  },
  "engagement": { "likes": 230, "comments": 18, "shares": 4, "views": null },
  "provenance": {                                // the documentation layer — see EU AI Act section
    "platform": "weibo",
    "source_url": "https://weibo.com/…",
    "author_handle": "…",
    "published_at": "2026-06-08T11:32:00+08:00",
    "retrieved_at": "2026-06-10T09:14:55Z",
    "collection_method": "http_api",             // http_api | browser_render
    "robots_state": "allowed",                   // allowed | disallowed | unavailable
    "opt_out_signals": [],                       // e.g. ["robots_disallow"] — filter on this
    "license_hint": "User-generated content; rights remain with original authors; platform ToS restricts redistribution. This record conveys no copyright license or AI-training rights — obtain rights independently.",
    "pipeline_version": "1.0.0",
    "content_sha256": "…"                        // hash of final text_clean — your dedup ledger key
  },
  "billing_event": "corpus-doc",                 // corpus-doc | corpus-doc-browser
  "scrapedAt": "2026-06-10T09:14:55Z"
}
````

**The pipeline behind it** (every document, fixed order): normalize → language ID → boilerplate strip (URLs, repost chains, platform emoji codes, UI residue, zero-width chars) → PII scrub → quality score → near-duplicate detection with cluster IDs → billable gate → provenance stamp. Documents that fail the language filter, character floor, quality floor, or arrive as duplicates are **dropped — not returned, not billed**.

Every run also writes a free **`SUMMARY`** record to the run's key-value store: per-source document counts, drop reasons, dedup ratio, and a quality-score histogram — so you can judge a pull's yield at a glance before scaling it up.

### Built for the EU AI Act provenance era

Under the **EU AI Act**, providers of general-purpose AI models must publish a *sufficiently detailed summary* of the content used for training, and EU text-and-data-mining rules require respecting machine-readable opt-out reservations. Most scraped datasets make that documentation work painful after the fact. This Actor does it **per document, at collection time**:

- **`source_url` + `retrieved_at`** — where each document came from and exactly when it was collected
- **`robots_state`** — whether the document's public URL path was allowed, disallowed, or unevaluable under the source domain's robots.txt at retrieval time
- **`opt_out_signals`** — machine-readable opt-out indications detected for the document (e.g. `robots_disallow`); always present, even when empty
- **`license_hint`** — a per-platform plain-language rights note attached to every record
- **`content_sha256` + `pipeline_version`** — reproducible content identity and processing lineage for your dataset documentation

Because these are first-class fields, downstream filtering is one line — e.g. drop everything with `robots_state: "disallowed"` or a non-empty `opt_out_signals` before a training run, and keep the audit trail showing you did.

> **Framing matters: this is documentation tooling, not legal clearance.** These fields make a corpus *documentable and filterable*. They do not determine whether any given use of any given document is lawful in your jurisdiction — that judgment, and the rights to make it on, remain yours.

### Legal positioning

> ⚖️ **Read this before buying**
>
> This Actor is a **collection, structuring, and provenance-documentation TOOL**. It accesses **publicly available content only** — the same content any anonymous browser visitor can see. It does **not** grant, transfer, or imply **any copyright license or AI-training rights** to the collected content. Rights remain with the original authors and platforms; the `license_hint` field on every record says exactly that. **Obtain any rights you need independently, and consult legal counsel** for your specific use case and jurisdiction. The provenance and opt-out fields help you document and filter a corpus — they are not, and cannot be, legal clearance.

#### FAQ

**Q: Is scraping Weibo / Bilibili / Xueqiu / Douban / RedNote legal?**
A: This Actor accesses only publicly visible content on each platform — no login bypass, no private accounts, no DMs, no follower lists. Optional `cookieStrings` are user-supplied and used only to improve recall and rate limits, never to bypass authentication. Laws on collecting and *using* public web data vary by jurisdiction and purpose. Always consult legal counsel for your specific use case and jurisdiction.

**Q: Does buying documents from this Actor give me the right to train AI models on them?**
A: **No.** No scraper can grant that — and this one explicitly does not claim to. The Actor collects, structures, and documents public content; copyright and related rights stay with the original authors and platforms. The per-document `license_hint`, `robots_state`, and `opt_out_signals` fields exist precisely so you (and your counsel) can make and document that determination yourselves.

**Q: How does the PII scrub work, and what about PIPL / GDPR?**
A: With `piiScrub: true` (the default), emails, phone numbers, **checksum-validated** Chinese resident IDs, and passport numbers found in document text are replaced with `[EMAIL]` / `[PHONE]` / `[CN_ID]` / `[PASSPORT]` tokens, and per-document counts are reported in the `pii` field (counts are reported even if you turn scrubbing off). This materially reduces incidental personal data in the corpus, but user-generated text can carry personal information in forms no scrubber catches — under PIPL, GDPR, and similar regimes **you remain responsible** for how you process and retain the data. The scrub is a hygiene layer, not a compliance guarantee.

**Q: What does `robots_state: "disallowed"` mean on a record I received?**
A: It means that, at retrieval time, the source domain's robots.txt disallowed the document's public URL path for generic crawlers. The record is delivered anyway *with that flag* so you can apply your own policy — most training-data teams filter these out, and the field gives you a documented basis for doing so.

**Q: Why is sentiment tagging not included?**
A: Deliberately out of scope for v1 — corpus buyers run their own labeling stacks, and bundling a sentiment model would add cost to every document whether you want it or not. If you need sentiment-tagged Chinese mentions, that's the [Chinese Brand Monitor](https://apify.com/zhorex/chinese-brand-monitor)'s lane ($0.045/mention, sentiment included).

**Q: Can I run this from Python?**
A: Yes — see the [Python example](#scrape-a-chinese-corpus-with-python-javascript-or-no-code) below. `pip install apify-client` and call `zhorex/chinese-corpus-engine` like any other Actor.

### Modes

| Mode | What it does | Billed as |
|---|---|---|
| **`corpus_pull`** | Scrape fresh documents for your topics from the selected sources and run the full pipeline | `corpus-doc` $0.025 / `corpus-doc-browser` $0.055 |
| **`dedup_merge`** | No scraping — re-process datasets you already own (from this Actor or any suite Actor) into one canonical, deduplicated corpus | `audit-record` $0.003 per record processed |
| **`provenance_audit`** | No scraping — emit a compact audit record (robots state, opt-out signals, license hint, PII counts, content hash) for every document in datasets you already own | `audit-record` $0.003 per audit record |

#### How to build a Chinese AI training corpus in 3 easy steps

1. **Go to the [Chinese AI Training Corpus Engine](https://apify.com/zhorex/chinese-corpus-engine)** on Apify Store and click **"Try for free"**
2. **Enter your topics** (mix brands, categories, and themes for corpus diversity — Chinese keywords return far more native content than Latin spellings) and **pick your sources**
3. **Click Run** and download AI-ready documents in JSON, CSV, or Excel

No coding required. No login needed. Works with Apify's free plan.

#### Example: evaluate corpus quality (start here)

A small pull to judge yield and quality before committing to volume:

```json
{
  "mode": "corpus_pull",
  "topics": ["新能源汽车", "智能驾驶"],
  "sources": ["weibo", "bilibili", "xueqiu", "douban"],
  "maxDocs": 1000,
  "minQuality": "0.35",
  "minCharCount": 40,
  "languages": ["zh-CN", "mixed"]
}
```

Check the `SUMMARY` record (drop reasons, dedup ratio, quality histogram), then raise `maxDocs` to 10,000–50,000 for the real build.

#### Example: SFT dataset build with strict quality

```json
{
  "mode": "corpus_pull",
  "topics": ["护肤", "化妆品成分", "敏感肌"],
  "sources": ["weibo", "bilibili", "douban"],
  "maxDocs": 25000,
  "minQuality": "0.5",
  "minCharCount": 80,
  "languages": ["zh-CN"]
}
```

#### Example: add RedNote first-person reviews (browser source)

RedNote's full post bodies require JS rendering, so RedNote only runs when `includeBrowserSources` is enabled and bills at the higher `corpus-doc-browser` rate:

```json
{
  "mode": "corpus_pull",
  "topics": ["国货美妆"],
  "sources": ["weibo", "rednote"],
  "includeBrowserSources": true,
  "maxDocs": 5000,
  "proxyConfiguration": { "useApifyProxy": true, "apifyProxyGroups": ["RESIDENTIAL"] }
}
```

Recommended run memory with browser sources on: **4096 MB**.

#### Example: merge & dedup datasets you already own

Point it at datasets from previous runs — or from any Chinese Digital Intelligence Suite Actor — and get back one canonical corpus with cluster IDs, at a tenth of the scrape price:

```json
{
  "mode": "dedup_merge",
  "inputDatasetIds": ["DATASET_ID_1", "DATASET_ID_2", "DATASET_ID_3"],
  "minQuality": "0.35"
}
```

#### Example: provenance audit of an existing corpus

```json
{
  "mode": "provenance_audit",
  "inputDatasetIds": ["DATASET_ID_1"]
}
```

Each audit record carries `doc_id`, `source_url`, `robots_state`, `opt_out_signals`, `license_hint`, PII counts, `content_sha256`, and the dedup cluster ID — no text fields, just the documentation layer.

> 💾 **Memory guidance for bulk runs.** The default 2048 MB comfortably handles runs up to a practical ceiling of **~250,000 documents**; for runs approaching **1M documents**, set run memory to **8192 MB** (dedup state is held in memory). With `includeBrowserSources: true`, use **4096 MB** or more.

> 🍪 **Optional cookies.** The `cookieStrings` input (a secret field, encrypted at rest) accepts per-platform logged-in cookies, e.g. `{"xueqiu": "xq_a_token=..."}` — they improve recall and rate limits on the gated platforms. The Actor degrades gracefully without them. Use throwaway accounts and refresh roughly every 10 days for scheduled runs.

### Pricing

This Actor uses **Pay-Per-Event** pricing — and the gates work in your favor: **only documents that pass the quality floor, character floor, and deduplication are billed. Rejects, duplicates, and already-collected documents are free** — not returned, not billed.

| Event | Price | Charged when |
|---|---|---|
| `corpus-doc` | **$0.025** | An AI-ready document from an HTTP source (Weibo / Bilibili / Xueqiu / Douban) passes all gates |
| `corpus-doc-browser` | **$0.055** | An AI-ready document from a browser source (RedNote) passes all gates — only with `includeBrowserSources: true` |
| `audit-record` | **$0.003** | Per record processed in `dedup_merge`; per audit record emitted in `provenance_audit` |

#### Pick the right lane (and the right price)

This engine deliberately does **not** replace the rest of the suite — each lane is priced for its job:

| Your job | Right tool | Price lane |
|---|---|---|
| **Raw records from one platform**, cheapest per record — comments, danmaku, profiles, full review bodies | The single-platform scrapers ([Weibo](https://apify.com/zhorex/weibo-scraper), [Bilibili](https://apify.com/zhorex/bilibili-scraper), [Douban](https://apify.com/zhorex/douban-scraper), [Xueqiu](https://apify.com/zhorex/xueqiu-scraper), [RedNote](https://apify.com/zhorex/rednote-xiaohongshu-scraper)) | **$0.005–$0.020 per raw record** |
| **Recurring brand monitoring** — sentiment-tagged, reach-weighted mention feed | [Chinese Brand Monitor](https://apify.com/zhorex/chinese-brand-monitor) | **$0.045 per mention** |
| **AI-ready corpus** — cleaned, deduplicated, quality-gated, PII-scrubbed, provenance-stamped documents | **This Actor** | **$0.025 per document** ($0.055 browser-sourced) |

At $0.025, a corpus document costs barely more than a single raw Weibo record — meaning the boilerplate stripping, near-duplicate collapse, quality scoring, PII scrub, and provenance stamping are **effectively included free**. If you just need raw records, the single-platform scrapers stay cheaper — use them. If you need documents your training pipeline can ingest as-is, this is the lane.

#### Realistic costs

| Workflow | Volume | Cost |
|---|---|---|
| Corpus quality evaluation pull | 1,000 docs | **~$25** |
| SFT dataset build (HTTP sources) | 10,000 docs | **~$250** |
| Foundation-corpus slice (HTTP sources) | 50,000 docs | **~$1,250** |
| RedNote first-person review corpus (browser source) | 5,000 docs | **~$275** |
| Dedup & merge of datasets you already own | 100,000 records | **~$300** |
| Provenance audit of an existing corpus | 50,000 records | **~$150** |

Compare: licensing a comparable academic Chinese-text corpus runs **$15K–$50K** with a single-use license and months of delivery time — and arrives without per-document provenance.

Volume pricing available above 50K documents/month (see the Enterprise section at the top). Apify platform compute costs (RAM-seconds) are charged separately; browser-source runs also consume residential proxy bandwidth billed by GB on your own account.

### ⏰ Set up a scheduled corpus refresh in 2 minutes

**A corpus is not a one-off pull — it's an asset that compounds.** A weekly or daily schedule turns a single topical pull into a **continuously growing corpus**: each run adds only the documents that didn't exist last time, and with `deltaStateKey` set, **already-collected documents are skipped — not returned, not billed.** Over weeks you own a deduplicated, provenance-stamped Chinese-text corpus nobody else has, built at marginal cost.

1. **Run it once with your input.** Use `corpus_pull` with your topics (see [the examples above](#modes)), click **Run**, and check the `SUMMARY` record to confirm yield and quality look right.
2. **Apify Console → Schedules → Create.** Pick this Actor and your saved input. (Shortcut: open any finished run and click **Schedule** to pre-fill the input for you.)
3. **Set a cron expression and save.** For example `0 8 * * *` = daily at 8am, or `0 * * * *` = hourly for fast-moving topics. While you're there, enable the **email notification on failed runs** option so you know if a run ever needs attention.

Each scheduled run **appends fresh documents to the same dataset**, so your corpus grows continuously with zero manual work — no babysitting, no re-running by hand, no duplicate billing.

#### 💸 Grow one corpus, pay once per document — `deltaStateKey`

Set a stable `deltaStateKey` and the Actor keeps a cross-run ledger of every document it has already delivered under that key (exact hashes **and** near-duplicate signatures). On every subsequent run, already-collected documents are skipped: **not returned, not billed.** Use a distinct key per corpus so independent projects don't collide — e.g. `"ev-corpus-weekly"` vs `"beauty-corpus-weekly"`.

**Example — a weekly EV-sector corpus that only ever bills for new documents:**

```json
{
  "mode": "corpus_pull",
  "topics": ["新能源汽车", "电动车", "充电桩"],
  "sources": ["weibo", "bilibili", "xueqiu", "douban"],
  "maxDocs": 10000,
  "deltaStateKey": "ev-corpus-weekly"
}
```

Pair this with the cron above and the dataset becomes a living corpus: every run appends only genuinely new, gate-passing documents — no duplicate pulls, no duplicate cost.

### Integrations & data export

Export your corpus in JSON, CSV, Excel, or XML. Integrate directly with:

- **Google Sheets** — sync document metadata for corpus QA dashboards
- **Zapier / Make / n8n** — trigger downstream processing when a refresh run finishes
- **REST API** — programmatic access from Python, JavaScript, or any language
- **Webhooks** — real-time notifications when corpus pulls complete

[See all integrations →](https://docs.apify.com/platform/integrations)

### Scrape a Chinese corpus with Python, JavaScript, or no code

Use this Actor directly from the Apify Console (no coding required), or call it via the [Apify API](https://docs.apify.com/api) from any language:

**Python example:**

```python
from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("zhorex/chinese-corpus-engine").call(run_input={
    "mode": "corpus_pull",
    "topics": ["新能源汽车"],
    "sources": ["weibo", "bilibili", "xueqiu", "douban"],
    "maxDocs": 1000,
})
for doc in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(doc["doc_id"], doc["quality"]["score"], doc["char_count"])
```

**JavaScript example:**

```javascript
import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('zhorex/chinese-corpus-engine').call({
    mode: 'corpus_pull',
    topics: ['新能源汽车'],
    sources: ['weibo', 'bilibili', 'xueqiu', 'douban'],
    maxDocs: 1000,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((doc) => console.log(doc.doc_id, doc.quality.score));
```

#### Using the raw REST API (Postman / curl)

> ⚠️ **The run endpoint is asynchronous — its response is the *run object* (IDs + status), NOT your corpus documents.** If you `POST` to `/acts/.../runs` you get back something like `{ "data": { "status": "READY", "defaultDatasetId": "…" } }` with **no documents in it** — that's expected, the run hasn't finished yet. The documents land in the run's **dataset**, not in that response. (Likewise, the `containerUrl` link is the live container; once a run finishes it just shows *"run has already finished with status SUCCEEDED"* — that message means success, it is not where the data lives.)

**Easiest — one call that waits for the run and returns the documents directly:**

```bash
curl -X POST "https://api.apify.com/v2/acts/zhorex~chinese-corpus-engine/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"mode":"corpus_pull","topics":["新能源汽车"],"sources":["weibo","bilibili"],"maxDocs":500}'
```

The response body **is** the JSON array of corpus documents — no second call needed. (Best for small pulls; bulk corpus runs outlive the sync endpoint's timeout — use the async pattern below.)

**Or async** — start the run, then fetch the dataset once it finishes:

```bash
## 1) start the run — note the "defaultDatasetId" in the response
curl -X POST "https://api.apify.com/v2/acts/zhorex~chinese-corpus-engine/runs?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"mode":"corpus_pull","topics":["新能源汽车"],"maxDocs":10000}'

## 2) when the run status is SUCCEEDED, fetch the documents from its dataset
curl "https://api.apify.com/v2/datasets/DEFAULT_DATASET_ID/items?token=YOUR_API_TOKEN"
```

> 💡 In the **Apify Console** you can also open any run and click the **Output / Storage → Dataset** tab to view and download the same data as JSON / CSV / Excel.

### What this Actor is NOT

- **Not a Zhihu scraper.** Zhihu is deliberately excluded from this Actor's source list.
- **Not WeChat or Douyin coverage.** WeChat has no public scraping interface; Douyin is out of scope in the current release.
- **Not a rights-clearance service.** It documents provenance; it does not — and cannot — grant copyright licenses or AI-training rights. See [Legal positioning](#legal-positioning).
- **Not a monitoring dashboard.** If you want recurring brand mentions with sentiment and reach weighting, that's the [Chinese Brand Monitor](https://apify.com/zhorex/chinese-brand-monitor)'s job — this engine builds corpora, not alerts.

### Other scrapers by Zhorex

**Chinese Digital Intelligence Suite:**

- [Chinese Brand Monitor](https://apify.com/zhorex/chinese-brand-monitor) — Cross-platform brand mention aggregator across all 5 platforms ($0.045/mention)
- [Weibo Scraper](https://apify.com/zhorex/weibo-scraper) — Posts, hot search, trending topics (580M+ users)
- [Bilibili Scraper](https://apify.com/zhorex/bilibili-scraper) — Video, danmaku, Gen-Z creator analytics (300M+ users)
- [RedNote (Xiaohongshu) Scraper](https://apify.com/zhorex/rednote-xiaohongshu-scraper) — Search, posts, profiles, comments, videos (300M+ users)
- [Douban Scraper](https://apify.com/zhorex/douban-scraper) — Long-form reviews, ratings, group discussions (movies/books/music)
- [Xueqiu Scraper](https://apify.com/zhorex/xueqiu-scraper) — Chinese stock-discussion sentiment, cashtag indexing
- [RedNote Shop Scraper](https://apify.com/zhorex/rednote-shop-scraper) — Xiaohongshu e-commerce products, vendors, prices
- [JD.com Scraper](https://apify.com/zhorex/jd-scraper) — JD product detail extraction

**Reviews & alt-data:**

- [Letterboxd Scraper](https://apify.com/zhorex/letterboxd-scraper) — Western film reviews and ratings
- [G2 Reviews Scraper](https://apify.com/zhorex/g2-reviews-scraper) — B2B software reviews via public API
- [Capterra Reviews Scraper](https://apify.com/zhorex/capterra-reviews-scraper) — Software reviews with sub-ratings
- [Booking.com Reviews Scraper](https://apify.com/zhorex/booking-reviews-scraper) — Hotel reviews and ratings
- [Review Intelligence Aggregator](https://apify.com/zhorex/review-intelligence-aggregator) — Multi-source review aggregation

**Markets & alt-data:**

- [TradingView Scraper](https://apify.com/zhorex/tradingview-scraper) — Stocks, forex, crypto data
- [Hyperliquid Pro Scraper](https://apify.com/zhorex/hyperliquid-scraper) — DeFi top traders, vaults, perpetual markets

**Streaming Analytics:**

- [Twitch Scraper](https://apify.com/zhorex/twitch-scraper) — Streamer profiles, live streams, clips
- [Kick.com Scraper](https://apify.com/zhorex/kick-scraper) — Kick streamer analytics
- [YouTube Shorts Scraper Pro](https://apify.com/zhorex/youtube-shorts-scraper-pro) — YouTube Shorts data

**Other Tools:**

- [Perplexity AI Search Scraper](https://apify.com/zhorex/perplexity-ai-scraper) — AI search results
- [Telegram Channel Scraper](https://apify.com/zhorex/telegram-channel-scraper) — Telegram messages
- [Tech Stack Detector](https://apify.com/zhorex/tech-stack-detector) — Detect technologies used by websites
- [LinkedIn Company Enrichment](https://apify.com/zhorex/linkedin-company-enrichment) — Enrich company records
- [Domain Authority Checker](https://apify.com/zhorex/domain-authority-checker) — Bulk SEO domain analysis
- [Phone Number Validator](https://apify.com/zhorex/phone-number-validator) — Phone validation
- [Sneaker Price Tracker](https://apify.com/zhorex/sneaker-price-tracker) — Track sneaker prices across platforms

***

### Your Review Matters ⭐

This is the only AI-corpus assembly engine in the Chinese Digital Intelligence Suite — and the only Apify Actor that ships Chinese social text already deduplicated, quality-gated, PII-scrubbed, and provenance-stamped. If it delivered the corpus you needed, **a 30-second review helps a lot**:

1. Go to the [Chinese AI Training Corpus Engine page](https://apify.com/zhorex/chinese-corpus-engine)
2. Click the star rating (top of the page)
3. Optionally leave a one-line note (e.g. "pulled a 10K-doc deduplicated EV corpus in one run")

**Why it matters:** reviews are the #1 signal Apify users check before trying an Actor. A high rating means more teams find this engine instead of stitching together raw scrapers and a cleaning pipeline by hand — which means faster updates, more sources, and better support for everyone.

**Found a bug or missing feature?** [Open an issue](https://apify.com/zhorex/chinese-corpus-engine/issues) on the Actor page and it'll typically be fixed within 48 hours.

***

*Last updated: June 2026 · Actively maintained · Trusted by AI training data teams, data vendors, and academic NLP researchers.*

# Actor input Schema

## `mode` (type: `string`):

corpus\_pull scrapes fresh documents. dedup\_merge and provenance\_audit run the pipeline over datasets you already own (set inputDatasetIds) — no scraping, cheaper per record.

## `topics` (type: `array`):

Search keywords fanned out to every selected source. Mix brands, categories, and themes for corpus diversity. Required in corpus\_pull mode.

## `sources` (type: `array`):

weibo / bilibili / xueqiu / douban are HTTP sources billed at the corpus-doc rate. rednote is a browser source — it only runs when 'Include browser sources' is enabled and bills at the higher corpus-doc-browser rate.

## `maxDocs` (type: `integer`):

Global cap on BILLABLE documents per run. Typical patterns: 500-1,000 to evaluate corpus quality, 10,000-50,000 for an SFT dataset build, 100,000+ for a foundation-corpus pull. Only documents passing the quality floor, char floor, and dedup are counted or charged — rejects are free. See the Pricing tab for per-document cost.

## `minQuality` (type: `string`):

Documents scoring below this (0-1 heuristic: length, charset, diversity, punctuation sanity, sentence structure) are dropped — not returned, not billed.

## `minCharCount` (type: `integer`):

Documents whose cleaned text is shorter than this are dropped — not returned, not billed.

## `languages` (type: `array`):

Keep only documents detected as these languages.

## `includeBrowserSources` (type: `boolean`):

Enables JS-rendered sources (currently RedNote full post bodies). These need a headless browser + residential proxy and bill at the corpus-doc-browser rate instead of corpus-doc. Recommended run memory with this on: 4096 MB.

## `piiScrub` (type: `boolean`):

Replace emails, phone numbers, Chinese resident IDs (checksum-validated), and passport numbers in document text with \[EMAIL]/\[PHONE]/\[CN\_ID]/\[PASSPORT] tokens. Counts are always reported per document in the pii field.

## `deltaStateKey` (type: `string`):

Set a stable key (e.g. 'ev-corpus-weekly') to make scheduled runs grow one corpus: documents already collected under this key are skipped — not returned, not billed.

## `inputDatasetIds` (type: `array`):

Dataset IDs from this actor or any Chinese Digital Intelligence Suite actor to re-process. Ignored in corpus\_pull mode.

## `cookieStrings` (type: `object`):

Optional logged-in cookies keyed by platform name to improve recall and rate limits, e.g. {"xueqiu": "xq\_a\_token=..."}. Never stored or logged.

## `proxyConfiguration` (type: `object`):

Datacenter works for Weibo/Bilibili/Xueqiu. Douban and RedNote work best with RESIDENTIAL (country CN).

## Actor input object example

```json
{
  "mode": "corpus_pull",
  "topics": [
    "新能源汽车"
  ],
  "sources": [
    "weibo",
    "bilibili",
    "xueqiu",
    "douban"
  ],
  "maxDocs": 1000,
  "minQuality": "0.35",
  "minCharCount": 40,
  "languages": [
    "zh-CN",
    "mixed"
  ],
  "includeBrowserSources": false,
  "piiScrub": true,
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "topics": [
        "新能源汽车"
    ],
    "maxDocs": 1000
};

// Run the Actor and wait for it to finish
const run = await client.actor("zhorex/chinese-corpus-engine").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "topics": ["新能源汽车"],
    "maxDocs": 1000,
}

# Run the Actor and wait for it to finish
run = client.actor("zhorex/chinese-corpus-engine").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "topics": [
    "新能源汽车"
  ],
  "maxDocs": 1000
}' |
apify call zhorex/chinese-corpus-engine --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=zhorex/chinese-corpus-engine",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Chinese AI Training Corpus Engine",
        "description": "Turn China's public web into AI-training-ready text. Pulls Weibo, Bilibili, Xueqiu, Douban & RedNote, then deduplicates, quality-scores, PII-scrubs and provenance-stamps every document. From $0.025/doc, pay-as-you-go. For LLM training-data teams, data vendors & academic NLP researchers.",
        "version": "0.1",
        "x-build-id": "4v1oT5isGRsbSvakO"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/zhorex~chinese-corpus-engine/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-zhorex-chinese-corpus-engine",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/zhorex~chinese-corpus-engine/runs": {
            "post": {
                "operationId": "runs-sync-zhorex-chinese-corpus-engine",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/zhorex~chinese-corpus-engine/run-sync": {
            "post": {
                "operationId": "run-sync-zhorex-chinese-corpus-engine",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "mode"
                ],
                "properties": {
                    "mode": {
                        "title": "Mode",
                        "enum": [
                            "corpus_pull",
                            "dedup_merge",
                            "provenance_audit"
                        ],
                        "type": "string",
                        "description": "corpus_pull scrapes fresh documents. dedup_merge and provenance_audit run the pipeline over datasets you already own (set inputDatasetIds) — no scraping, cheaper per record.",
                        "default": "corpus_pull"
                    },
                    "topics": {
                        "title": "Topics / keywords",
                        "type": "array",
                        "description": "Search keywords fanned out to every selected source. Mix brands, categories, and themes for corpus diversity. Required in corpus_pull mode.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "sources": {
                        "title": "Sources",
                        "type": "array",
                        "description": "weibo / bilibili / xueqiu / douban are HTTP sources billed at the corpus-doc rate. rednote is a browser source — it only runs when 'Include browser sources' is enabled and bills at the higher corpus-doc-browser rate.",
                        "items": {
                            "type": "string",
                            "enum": [
                                "weibo",
                                "bilibili",
                                "xueqiu",
                                "douban",
                                "rednote"
                            ]
                        },
                        "default": [
                            "weibo",
                            "bilibili",
                            "xueqiu",
                            "douban"
                        ]
                    },
                    "maxDocs": {
                        "title": "Max documents (raise for bulk corpus pulls)",
                        "minimum": 1,
                        "maximum": 1000000,
                        "type": "integer",
                        "description": "Global cap on BILLABLE documents per run. Typical patterns: 500-1,000 to evaluate corpus quality, 10,000-50,000 for an SFT dataset build, 100,000+ for a foundation-corpus pull. Only documents passing the quality floor, char floor, and dedup are counted or charged — rejects are free. See the Pricing tab for per-document cost.",
                        "default": 1000
                    },
                    "minQuality": {
                        "title": "Minimum quality score",
                        "enum": [
                            "0.0",
                            "0.2",
                            "0.35",
                            "0.5",
                            "0.65"
                        ],
                        "type": "string",
                        "description": "Documents scoring below this (0-1 heuristic: length, charset, diversity, punctuation sanity, sentence structure) are dropped — not returned, not billed.",
                        "default": "0.35"
                    },
                    "minCharCount": {
                        "title": "Minimum characters per document",
                        "minimum": 1,
                        "maximum": 10000,
                        "type": "integer",
                        "description": "Documents whose cleaned text is shorter than this are dropped — not returned, not billed.",
                        "default": 40
                    },
                    "languages": {
                        "title": "Language filter",
                        "type": "array",
                        "description": "Keep only documents detected as these languages.",
                        "items": {
                            "type": "string",
                            "enum": [
                                "zh-CN",
                                "mixed",
                                "en"
                            ]
                        },
                        "default": [
                            "zh-CN",
                            "mixed"
                        ]
                    },
                    "includeBrowserSources": {
                        "title": "Include browser sources (RedNote) — higher-priced event",
                        "type": "boolean",
                        "description": "Enables JS-rendered sources (currently RedNote full post bodies). These need a headless browser + residential proxy and bill at the corpus-doc-browser rate instead of corpus-doc. Recommended run memory with this on: 4096 MB.",
                        "default": false
                    },
                    "piiScrub": {
                        "title": "PII scrub",
                        "type": "boolean",
                        "description": "Replace emails, phone numbers, Chinese resident IDs (checksum-validated), and passport numbers in document text with [EMAIL]/[PHONE]/[CN_ID]/[PASSPORT] tokens. Counts are always reported per document in the pii field.",
                        "default": true
                    },
                    "deltaStateKey": {
                        "title": "Corpus refresh key (delta state)",
                        "type": "string",
                        "description": "Set a stable key (e.g. 'ev-corpus-weekly') to make scheduled runs grow one corpus: documents already collected under this key are skipped — not returned, not billed."
                    },
                    "inputDatasetIds": {
                        "title": "Input dataset IDs (dedup_merge / provenance_audit)",
                        "type": "array",
                        "description": "Dataset IDs from this actor or any Chinese Digital Intelligence Suite actor to re-process. Ignored in corpus_pull mode.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "cookieStrings": {
                        "title": "Cookie strings (optional, per platform)",
                        "type": "object",
                        "description": "Optional logged-in cookies keyed by platform name to improve recall and rate limits, e.g. {\"xueqiu\": \"xq_a_token=...\"}. Never stored or logged."
                    },
                    "proxyConfiguration": {
                        "title": "Proxy configuration",
                        "type": "object",
                        "description": "Datacenter works for Weibo/Bilibili/Xueqiu. Douban and RedNote work best with RESIDENTIAL (country CN).",
                        "default": {
                            "useApifyProxy": true
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```