Voice Ai Data Pipeline
Pricing
from $0.10 / 1,000 results
Voice Ai Data Pipeline
Turn voice and chat transcripts into training-ready datasets. Anonymize PII, normalize dialogue, detect language, and add lightweight intent, sentiment, and resolution labels. No scraping — processes only data you provide.
Pricing
from $0.10 / 1,000 results
Rating
0.0
(0)
Developer

Hayder Al-Khalissi
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
What does the Voice AI Data Pipeline do?
The Voice AI Data Pipeline Actor turns conversation transcripts into training-ready datasets for voice and conversational AI. You provide conversations (inline JSON, an Apify dataset ID, or a list of public URLs), and the Actor anonymizes PII, normalizes dialogue, runs language detection and lightweight analytics (sentiment, intent, resolution), then exports one structured item per conversation to an Apify dataset or JSONL. It does not scrape or collect data by itself—it only processes data you supply or explicitly point to.
What can the Voice AI Data Pipeline do?
- Ingest from multiple sources – Paste raw conversations (JSON), connect an Apify dataset by ID, or pass a list of public URLs that return JSON, CSV, or plain text.
- PII redaction – Detects and masks email, phone, credit card, IBAN, address patterns, and names (configurable). Optional hash linkage keeps same-entity tracking without storing originals.
- Structured, normalized output – Canonical customer / agent roles, optional timestamps, turn limits, and deduplication by conversation ID.
- Language detection – Built-in heuristic (no external calls) or optional Apify Language Detector for ML-based detection and confidence scores. Locale-aware intent, sentiment, and resolution heuristics (en, de, fr, es).
- Lightweight analytics – Sentiment score, intent (billing, support, cancel, booking, complaint), resolution heuristic, and conversation metrics (turns, length, response time).
- Flexible export – Default Apify dataset or JSONL in Key-Value Store (e.g. for downstream training pipelines).
Platform advantages – Run the Actor on a schedule, call it via the Apify API, plug it into workflows (Zapier, Make, n8n), and monitor runs in the Apify Console. You only pay for the compute and storage you use.
What data can the Voice AI Data Pipeline produce?
| Data | Description |
|---|---|
| conversationId | Unique ID per conversation |
| language | Detected language code (e.g. en, de); optional confidence when using Apify Language Detector |
| messages | Normalized messages with role, text, optional timestamp |
| turns | Number of messages |
| metrics | Turns, average message length, total chars, average response time |
| labels | Intent, resolved (boolean), sentiment score |
| privacy | Redaction flag and counts per PII type; optional hash linkage data |
How do I use the Voice AI Data Pipeline to prepare conversation data?
- Open the Actor on Apify and go to the Input tab.
- Choose Source: raw (paste JSON), dataset (Apify dataset ID), or urls (list of public URLs).
- Set Format (role/text/timestamp field names and role mapping) if your data uses different keys.
- Configure Privacy (enable PII redaction, mask or remove, which PII types).
- Set Processing (min/max turns, language detection: off, heuristic, or Apify).
- Choose Export (dataset or JSONL key).
- Run the Actor. Results appear in the run’s dataset or Key-Value Store.
For raw mode, provide an array of conversations; each has conversationId (or id) and messages (array of { role, text, ts? }). See the Input section below for an example.
Test the Actor with a ready-made example – Use the sample input in the repo to run a quick test:
- On Apify: Open the Actor → Input tab → switch to JSON view → paste the contents of examples/sample-input.json → Start.
- Locally: Copy
examples/sample-input.jsontostorage/key_value_stores/default/INPUT.json, then runnode run-local.js.
The sample includes 4 conversations: English (billing + resolution), German (support + resolution), one with PII (email/phone) to see redaction, and a short support exchange. You’ll get language detection, intent/sentiment/resolution labels, and optional PII masking.
How much does it cost to run the Voice AI Data Pipeline?
Pricing follows Apify’s consumption model: you pay for Compute Units (CUs) and storage used. The Actor does not scrape the web; it only processes input you provide, so runs are typically short and cost is low. Free tier CUs can be enough for small batches. Check the Apify pricing page and the Actor’s Pricing tab for current rates. Using Language detection → Apify runs the Language Detector actor once per conversation and will increase cost for large runs.
Input
Voice AI Data Pipeline has the following input options. Use the Input tab on the Actor page to configure your run.
| Section | What it does |
|---|---|
| Source | raw (inline JSON), dataset (Apify dataset ID), or urls (list of public URLs). |
| Format | Field names for role, text, timestamp; map role values to customer and agent. |
| Privacy | Enable PII redaction, mask or remove, select PII types, optional hash linkage. |
| Processing | Normalize text, language detection (off / heuristic / apify), min/max turns, deduplicate. |
| Analytics | Toggle sentiment, intent/resolution heuristics, conversation metrics. |
| Export | dataset (default) or jsonl (Key-Value Store key). |
Example input (raw mode)
A full runnable example you can paste into the Actor or use locally is in examples/sample-input.json.
Minimal structure:
{"source": {"mode": "raw","rawConversations": [{"conversationId": "conv-1","messages": [{ "role": "customer", "text": "I need help with my order.", "ts": "2026-02-01T10:00:00Z" },{ "role": "agent", "text": "Sure. What's your order ID?", "ts": "2026-02-01T10:00:05Z" }]}]},"format": { "roleField": "role", "textField": "text", "timestampField": "ts" },"privacy": { "redactPII": true, "redactionMode": "mask", "piiTypes": ["email", "phone", "names"] },"processing": { "minTurns": 2, "maxTurns": 60 },"export": { "outputFormat": "dataset" }}
Output
You can download the dataset produced by the Voice AI Data Pipeline in formats such as JSON, CSV, Excel, or HTML from the run’s dataset or Key-Value Store in the Apify Console.
Each output item (one per conversation) looks like:
{"conversationId": "conv-1","language": "en","messages": [{ "role": "customer", "text": "...", "ts": "..." },{ "role": "agent", "text": "...", "ts": "..." }],"turns": 4,"metrics": { "turns": 4, "avgMessageLength": 26, "totalChars": 104, "avgResponseTimeMs": 5000 },"labels": { "intent": "support", "resolved": true, "sentiment": 0.14 },"privacy": { "redacted": true, "counts": { "email": 1, "phone": 1 } }}
When Language detection is set to apify, items may also include languageConfidence and, if enabled, languagePerMessage (array of language codes per message). Use messages for training and labels / metrics as features or targets.
Why use the Voice AI Data Pipeline?
- Voice and conversational AI training – Get dialogue data with PII removed and labels for intent, sentiment, and resolution in one run.
- Support and contact-center analytics – Ingest exported chats or datasets, normalize and anonymize, then analyze or re-export.
- Compliance-friendly prep – Redact sensitive fields before sending data to models or downstream systems (use with your own legal/compliance review).
Tips and advanced options
- Language detection – Use heuristic (default) for no extra actor calls; use apify for higher accuracy and confidence scores. Language per message is only available with apify and increases usage.
- Parquet – If you select Parquet as output format, the Actor currently writes JSONL and logs a warning; convert to Parquet in your own workflow (e.g. Pandas, DuckDB) if needed.
- Local runs – Use
node run-local.jswith input instorage/key_value_stores/default/INPUT.json; set"debug": truefor extra console stats.
Privacy and PII redaction
- Detected (best-effort, regex-based): email, phone, credit card, IBAN, address patterns, and names. Enable/disable per type under Privacy → PII types.
- Replacements:
[EMAIL],[PHONE],[CARD],[IBAN],[ADDRESS],[NAME]. - Limitations: No NER/ML; not HIPAA-compliant; non-Latin scripts and unusual formats may be missed. Use for data prep with your own legal/compliance review.
Is it legal to use the Voice AI Data Pipeline?
This Actor does not scrape or collect data by itself. It only processes:
- Raw conversations you provide in the input, or
- Apify datasets you specify by ID, or
- Public URLs you explicitly list.
You must use only data you are allowed to use (e.g. with consent or another lawful basis). Do not feed private or confidential content unless you have the right to do so. If your results contain personal data, consider GDPR and other regulations and consult legal advice if needed.
FAQ and support
- API – Call the Actor via the Apify API to integrate it into your pipelines, scripts, or other tools.
- Integrations – Use Apify’s integrations (Zapier, Make, n8n, etc.) to trigger runs or use output in other apps.
- Issues and feedback – Use the Issues tab on the Actor page to report bugs or suggest improvements; we’re open to feedback.
- Custom solutions – For custom or enterprise needs, contact the Actor maintainer or Apify.