Pricing

from $0.10 / 1,000 results

Voice Ai Data Pipeline

Turn voice and chat transcripts into training-ready datasets. Anonymize PII, normalize dialogue, detect language, and add lightweight intent, sentiment, and resolution labels. No scraping — processes only data you provide.

Pricing

from $0.10 / 1,000 results

Rating

0.0

(0)

Developer

Hayder Al-Khalissi

Actor stats

Bookmarked

Total users

Monthly active users

4 days ago

Last modified

What does the Voice AI Data Pipeline do?

The Voice AI Data Pipeline Actor turns conversation transcripts into training-ready datasets for voice and conversational AI. You provide conversations (inline JSON, an Apify dataset ID, or a list of public URLs), and the Actor anonymizes PII, normalizes dialogue, runs language detection and lightweight analytics (sentiment, intent, resolution), then exports one structured item per conversation to an Apify dataset or JSONL. It does not scrape or collect data by itself—it only processes data you supply or explicitly point to.

What can the Voice AI Data Pipeline do?

Ingest from multiple sources – Paste raw conversations (JSON), connect an Apify dataset by ID, or pass a list of public URLs that return JSON, CSV, or plain text.
PII redaction – Detects and masks email, phone, credit card, IBAN, address patterns, and names (configurable). Optional hash linkage keeps same-entity tracking without storing originals.
Structured, normalized output – Canonical customer / agent roles, optional timestamps, turn limits, and deduplication by conversation ID.
Language detection – Built-in heuristic (no external calls) or optional Apify Language Detector for ML-based detection and confidence scores. Locale-aware intent, sentiment, and resolution heuristics (en, de, fr, es).
Lightweight analytics – Sentiment score, intent (billing, support, cancel, booking, complaint), resolution heuristic, and conversation metrics (turns, length, response time).
Flexible export – Default Apify dataset or JSONL in Key-Value Store (e.g. for downstream training pipelines).

Platform advantages – Run the Actor on a schedule, call it via the Apify API, plug it into workflows (Zapier, Make, n8n), and monitor runs in the Apify Console. You only pay for the compute and storage you use.

What data can the Voice AI Data Pipeline produce?

Data	Description
conversationId	Unique ID per conversation
language	Detected language code (e.g. en, de); optional confidence when using Apify Language Detector
messages	Normalized messages with role, text, optional timestamp
turns	Number of messages
metrics	Turns, average message length, total chars, average response time
labels	Intent, resolved (boolean), sentiment score
privacy	Redaction flag and counts per PII type; optional hash linkage data

How do I use the Voice AI Data Pipeline to prepare conversation data?

Open the Actor on Apify and go to the Input tab.
Choose Source: raw (paste JSON), dataset (Apify dataset ID), or urls (list of public URLs).
Set Format (role/text/timestamp field names and role mapping) if your data uses different keys.
Configure Privacy (enable PII redaction, mask or remove, which PII types).
Set Processing (min/max turns, language detection: off, heuristic, or Apify).
Choose Export (dataset or JSONL key).
Run the Actor. Results appear in the run’s dataset or Key-Value Store.

For raw mode, provide an array of conversations; each has conversationId (or id) and messages (array of { role, text, ts? }). See the Input section below for an example.

Test the Actor with a ready-made example – Use the sample input in the repo to run a quick test:

On Apify: Open the Actor → Input tab → switch to JSON view → paste the contents of examples/sample-input.json → Start.
Locally: Copy examples/sample-input.json to storage/key_value_stores/default/INPUT.json, then run node run-local.js.

The sample includes 4 conversations: English (billing + resolution), German (support + resolution), one with PII (email/phone) to see redaction, and a short support exchange. You’ll get language detection, intent/sentiment/resolution labels, and optional PII masking.

How much does it cost to run the Voice AI Data Pipeline?

Pricing follows Apify’s consumption model: you pay for Compute Units (CUs) and storage used. The Actor does not scrape the web; it only processes input you provide, so runs are typically short and cost is low. Free tier CUs can be enough for small batches. Check the Apify pricing page and the Actor’s Pricing tab for current rates. Using Language detection → Apify runs the Language Detector actor once per conversation and will increase cost for large runs.

Input

Voice AI Data Pipeline has the following input options. Use the Input tab on the Actor page to configure your run.

Section	What it does
Source	raw (inline JSON), dataset (Apify dataset ID), or urls (list of public URLs).
Format	Field names for role, text, timestamp; map role values to customer and agent.
Privacy	Enable PII redaction, mask or remove, select PII types, optional hash linkage.
Processing	Normalize text, language detection (off / heuristic / apify), min/max turns, deduplicate.
Analytics	Toggle sentiment, intent/resolution heuristics, conversation metrics.
Export	dataset (default) or jsonl (Key-Value Store key).

Example input (raw mode)

A full runnable example you can paste into the Actor or use locally is in examples/sample-input.json.

Minimal structure:

{
  "source": {
    "mode": "raw",
    "rawConversations": [
      {
        "conversationId": "conv-1",
        "messages": [
          { "role": "customer", "text": "I need help with my order.", "ts": "2026-02-01T10:00:00Z" },
          { "role": "agent", "text": "Sure. What's your order ID?", "ts": "2026-02-01T10:00:05Z" }
        ]
      }
    ]
  },
  "format": { "roleField": "role", "textField": "text", "timestampField": "ts" },
  "privacy": { "redactPII": true, "redactionMode": "mask", "piiTypes": ["email", "phone", "names"] },
  "processing": { "minTurns": 2, "maxTurns": 60 },
  "export": { "outputFormat": "dataset" }
}

Output

You can download the dataset produced by the Voice AI Data Pipeline in formats such as JSON, CSV, Excel, or HTML from the run’s dataset or Key-Value Store in the Apify Console.

Each output item (one per conversation) looks like:

{
  "conversationId": "conv-1",
  "language": "en",
  "messages": [
    { "role": "customer", "text": "...", "ts": "..." },
    { "role": "agent", "text": "...", "ts": "..." }
  ],
  "turns": 4,
  "metrics": { "turns": 4, "avgMessageLength": 26, "totalChars": 104, "avgResponseTimeMs": 5000 },
  "labels": { "intent": "support", "resolved": true, "sentiment": 0.14 },
  "privacy": { "redacted": true, "counts": { "email": 1, "phone": 1 } }
}

When Language detection is set to apify, items may also include languageConfidence and, if enabled, languagePerMessage (array of language codes per message). Use messages for training and labels / metrics as features or targets.

Why use the Voice AI Data Pipeline?

Voice and conversational AI training – Get dialogue data with PII removed and labels for intent, sentiment, and resolution in one run.
Support and contact-center analytics – Ingest exported chats or datasets, normalize and anonymize, then analyze or re-export.
Compliance-friendly prep – Redact sensitive fields before sending data to models or downstream systems (use with your own legal/compliance review).

Tips and advanced options

Language detection – Use heuristic (default) for no extra actor calls; use apify for higher accuracy and confidence scores. Language per message is only available with apify and increases usage.
Parquet – If you select Parquet as output format, the Actor currently writes JSONL and logs a warning; convert to Parquet in your own workflow (e.g. Pandas, DuckDB) if needed.
Local runs – Use node run-local.js with input in storage/key_value_stores/default/INPUT.json; set "debug": true for extra console stats.

Privacy and PII redaction

Detected (best-effort, regex-based): email, phone, credit card, IBAN, address patterns, and names. Enable/disable per type under Privacy → PII types.
Replacements: [EMAIL], [PHONE], [CARD], [IBAN], [ADDRESS], [NAME].
Limitations: No NER/ML; not HIPAA-compliant; non-Latin scripts and unusual formats may be missed. Use for data prep with your own legal/compliance review.

Is it legal to use the Voice AI Data Pipeline?

This Actor does not scrape or collect data by itself. It only processes:

Raw conversations you provide in the input, or
Apify datasets you specify by ID, or
Public URLs you explicitly list.

You must use only data you are allowed to use (e.g. with consent or another lawful basis). Do not feed private or confidential content unless you have the right to do so. If your results contain personal data, consider GDPR and other regulations and consult legal advice if needed.

FAQ and support

API – Call the Actor via the Apify API to integrate it into your pipelines, scripts, or other tools.
Integrations – Use Apify’s integrations (Zapier, Make, n8n, etc.) to trigger runs or use output in other apps.
Issues and feedback – Use the Issues tab on the Actor page to report bugs or suggest improvements; we’re open to feedback.
Custom solutions – For custom or enterprise needs, contact the Actor maintainer or Apify.

Brand Voice Analyzer

consummate_mandala/brand-voice-analyzer

Analyze website and social media copy to profile brand voice. Extract tone, vocabulary level, sentiment, and key themes.

Donny Nguyen

Ai Voice Clone Detector

vivid_astronaut/ai-voice-clone-detector

Fabio Suizu

LinkedIn Voice Trainer & AI Content Generator

alizarin_refrigerator-owner/linkedin-voice-trainer

Analyze LinkedIn posts to extract writing voice patterns using AI. Train AI assistants on your style. GPT-4/Claude powered.

John Rippy

Trendyol Reviews Scraper ⭐

shahidirfan/Trendyol-Reviews-Scraper

Unlock detailed product reviews and ratings from Turkey's leading e-commerce platform. Perfect for sentiment analysis, market research, and monitoring competitor feedback. Turn customer voice into actionable data to boost your sales and product strategy instantly.

Shahid Irfan

Intent Data API

vivid_astronaut/intent-data

Fabio Suizu

Text To Speech

calm_necessity/text-to-speech

AI Text-to-Speech API that converts written text into high-quality natural voice audio. Supports multiple voices, languages, adjustable speed and pitch, ideal for audiobooks, podcasts, accessibility, automation, and voice-enabled applications.

Taher Ali Badnawarwala

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl websites and extract clean training data for LLMs. Quality scoring, deduplication, PII detection, markdown output. Built for fine-tuning and RAG pipelines.

ryan clinton

Ai Chat Moderation

vivid_astronaut/ai-chat-moderation

Fabio Suizu

Ai Training Data Enricher

fiery_dream/ai-training-data-enricher

Production-grade data enrichment and validation for LLM training datasets. Automatically clean, enrich, deduplicate, and validate your AI training data before fine-tuning.

Cody Churchwell

SecureRedact: Zero-Cost PII Guard

juwan/secureredact-zero-cost-pii-guard

Protect your AI users instantly. This lightweight API automatically detects & redacts sensitive PII (Phones, Emails, Names) from text. HIPAA/GDPR-ready, Free Tier friendly, and built for high-performance AI Agents. No configuration needed