LangExtract - Structured Data Extractor
Pricing
Pay per usage
LangExtract - Structured Data Extractor
Extract structured data from unstructured text using Google's LangExtract library & LLMs. Get entities, relationships & attributes in clean JSON with source grounding. Supports Gemini, OpenAI, Anthropic, Groq & custom endpoints. Perfect for parsing resumes, contracts, reports & documents.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Vivian Ferreira
Actor stats
0
Bookmarked
1
Total users
1
Monthly active users
15 days ago
Last modified
Categories
Share
๐ LangExtract - Structured Data Extractor
Extract structured data from unstructured text using Google's LangExtract library powered by Large Language Models (LLMs). Transform messy documents into clean, organized JSON with precise source grounding.
โจ What It Does
This Actor leverages Google's open-source LangExtract library to:
- Extract entities (people, organizations, products, locations, etc.) from any text
- Identify relationships and attributes between entities
- Ground extractions to exact positions in the source text
- Generate visual reports showing extractions in context
๐ฏ Use Cases
| Industry | Application |
|---|---|
| HR & Recruiting | Extract candidates, skills, and experience from resumes |
| Legal | Pull parties, dates, and clauses from contracts |
| Healthcare | Structure patient info from medical records |
| Finance | Extract entities from earnings reports and filings |
| Research | Parse academic papers for citations and findings |
| News & Media | Identify people, places, and events from articles |
| E-commerce | Extract product specs from descriptions |
๐ Quick Start
Basic Example
Input:
{"provider": "Gemini","apiKey": "YOUR_GEMINI_API_KEY","model": "gemini-2.5-flash","schema": "{\"prompt_description\": \"Extract all people and their job titles.\"}","text": "Sarah Chen was appointed CEO of Acme Corp. Dr. James Wilson serves as CTO."}
Output:
{"extraction": {"extractions": [{"class": "person", "text": "Sarah Chen", "attributes": {"role": "CEO", "organization": "Acme Corp"}},{"class": "person", "text": "Dr. James Wilson", "attributes": {"role": "CTO"}}]}}
๐ค Supported Providers & Models
Google Gemini (Recommended)
| Model | Best For |
|---|---|
gemini-2.5-flash | Fast, affordable, great quality |
gemini-2.0-flash | Balanced performance |
gemini-1.5-pro | Complex extractions |
OpenAI
| Model | Best For |
|---|---|
gpt-4o | Highest quality |
gpt-4o-mini | Fast and affordable |
gpt-4-turbo | Long documents |
Anthropic
| Model | Best For |
|---|---|
claude-3-5-sonnet-latest | Best overall |
claude-3-haiku-latest | Fast responses |
Groq (Fast Inference)
| Model | Best For |
|---|---|
llama-3.3-70b-versatile | High quality |
llama-3.1-8b-instant | Ultra-fast |
mixtral-8x7b-32768 | Long context |
gemma2-9b-it | Efficient |
OpenAI-Compatible (Custom/Local)
Use any OpenAI-compatible API by providing a custom Base URL:
- Ollama:
http://localhost:11434/v1 - Azure OpenAI:
https://YOUR_RESOURCE.openai.azure.com - Together AI, Anyscale, etc.
๐ฅ Input Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
provider | enum | โ | LLM provider: Gemini, OpenAI, Anthropic, Groq, OpenAI-Compatible |
apiKey | secret | โ | Your API key for the selected provider |
model | string | โ | Model name (e.g., gemini-2.5-flash) |
schema | json | โ | Extraction instructions (see Schema section) |
text | string | โช | Raw text to process |
urls | array | โช | URLs to fetch and process |
baseUrl | string | โช | Custom API endpoint (for OpenAI-Compatible) |
systemPrompt | string | โช | Custom persona (e.g., "Act as a legal expert") |
batchMode | boolean | โช | Enable parallel processing (default: true) |
debug | boolean | โช | Save debug info to Key-Value Store |
trackTokens | boolean | โช | Report token usage (default: true) |
๐ Schema Format
Simple Format
Just describe what you want to extract:
{"prompt_description": "Extract all people mentioned with their job titles and organizations."}
Advanced Format (with examples)
Provide examples for better accuracy:
{"prompt_description": "Extract people and their roles. Use exact text from the source.","examples": [{"text": "John Smith is the CEO of TechCorp.","extractions": [{"extraction_class": "person","extraction_text": "John Smith","attributes": {"role": "CEO", "organization": "TechCorp"}}]}]}
๐ค Output
Dataset
Each extraction is pushed to the Apify Dataset:
{"source_index": 0,"extraction": {"extractions": [{"class": "person", "text": "...", "attributes": {...}, "position": {"start": 0, "end": 10}}],"source_text": "..."},"status": "success"}
Key-Value Store
EXTRACTION_REPORT.html- Interactive visual reportTOKEN_USAGE- Token count and cost estimate (if enabled)DEBUG_*- Debug info (if enabled)
๐ Security
- API keys are encrypted and never logged
- Data is processed transiently - not stored beyond your run
- Runs in sandboxed containers on Apify infrastructure
๐ก Tips
- Better results: Provide examples in your schema for complex extractions
- Faster runs: Use
gemini-2.5-flashorllama-3.1-8b-instantfor speed - Long documents: The Actor automatically chunks and processes large texts
- Multiple sources: Add multiple URLs or texts for batch processing
- Custom personas: Use
systemPromptto guide extraction style
๐ Links
๐ License
Apache 2.0 - Built with Google's LangExtract library.
