LangExtract - Structured Data Extractor
Pricing
Pay per usage
LangExtract - Structured Data Extractor
Extract structured data from unstructured text using Google's LangExtract library & LLMs. Get entities, relationships & attributes in clean JSON with source grounding. Supports Gemini, OpenAI, Anthropic, Groq & custom endpoints. Perfect for parsing resumes, contracts, reports & documents.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Vivian Ferreira
Actor stats
1
Bookmarked
2
Total users
2
Monthly active users
3 months ago
Last modified
Categories
Share
π LangExtract - Structured Data Extractor
Extract structured data from unstructured text using Google's LangExtract library powered by Large Language Models (LLMs). Transform messy documents into clean, organized JSON with precise source grounding.
β¨ What It Does
This Actor leverages Google's open-source LangExtract library to:
- Extract entities (people, organizations, products, locations, etc.) from any text
- Identify relationships and attributes between entities
- Ground extractions to exact positions in the source text
- Generate visual reports showing extractions in context
π― Use Cases
| Industry | Application |
|---|---|
| HR & Recruiting | Extract candidates, skills, and experience from resumes |
| Legal | Pull parties, dates, and clauses from contracts |
| Healthcare | Structure patient info from medical records |
| Finance | Extract entities from earnings reports and filings |
| Research | Parse academic papers for citations and findings |
| News & Media | Identify people, places, and events from articles |
| E-commerce | Extract product specs from descriptions |
π Quick Start
Basic Example
Input:
{"provider": "Gemini","apiKey": "YOUR_GEMINI_API_KEY","model": "gemini-2.5-flash","schema": "{\"prompt_description\": \"Extract all people and their job titles.\"}","text": "Sarah Chen was appointed CEO of Acme Corp. Dr. James Wilson serves as CTO."}
Output:
{"extraction": {"extractions": [{"class": "person", "text": "Sarah Chen", "attributes": {"role": "CEO", "organization": "Acme Corp"}},{"class": "person", "text": "Dr. James Wilson", "attributes": {"role": "CTO"}}]}}
π€ Supported Providers & Models
Google Gemini (Recommended)
| Model | Best For |
|---|---|
gemini-2.5-flash | Fast, affordable, great quality |
gemini-2.0-flash | Balanced performance |
gemini-1.5-pro | Complex extractions |
OpenAI
| Model | Best For |
|---|---|
gpt-4o | Highest quality |
gpt-4o-mini | Fast and affordable |
gpt-4-turbo | Long documents |
Anthropic
| Model | Best For |
|---|---|
claude-3-5-sonnet-latest | Best overall |
claude-3-haiku-latest | Fast responses |
Groq (Fast Inference)
| Model | Best For |
|---|---|
llama-3.3-70b-versatile | High quality |
llama-3.1-8b-instant | Ultra-fast |
mixtral-8x7b-32768 | Long context |
gemma2-9b-it | Efficient |
OpenAI-Compatible (Custom/Local)
Use any OpenAI-compatible API by providing a custom Base URL:
- Ollama:
http://localhost:11434/v1 - Azure OpenAI:
https://YOUR_RESOURCE.openai.azure.com - Together AI, Anyscale, etc.
π₯ Input Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
provider | enum | β | LLM provider: Gemini, OpenAI, Anthropic, Groq, OpenAI-Compatible |
apiKey | secret | β | Your API key for the selected provider |
model | string | β | Model name (e.g., gemini-2.5-flash) |
schema | json | β | Extraction instructions (see Schema section) |
text | string | βͺ | Raw text to process |
urls | array | βͺ | URLs to fetch and process |
baseUrl | string | βͺ | Custom API endpoint (for OpenAI-Compatible) |
systemPrompt | string | βͺ | Custom persona (e.g., "Act as a legal expert") |
batchMode | boolean | βͺ | Enable parallel processing (default: true) |
debug | boolean | βͺ | Save debug info to Key-Value Store |
trackTokens | boolean | βͺ | Report token usage (default: true) |
π Schema Format
Simple Format
Just describe what you want to extract:
{"prompt_description": "Extract all people mentioned with their job titles and organizations."}
Advanced Format (with examples)
Provide examples for better accuracy:
{"prompt_description": "Extract people and their roles. Use exact text from the source.","examples": [{"text": "John Smith is the CEO of TechCorp.","extractions": [{"extraction_class": "person","extraction_text": "John Smith","attributes": {"role": "CEO", "organization": "TechCorp"}}]}]}
π€ Output
Dataset
Each extraction is pushed to the Apify Dataset:
{"source_index": 0,"extraction": {"extractions": [{"class": "person", "text": "...", "attributes": {...}, "position": {"start": 0, "end": 10}}],"source_text": "..."},"status": "success"}
Key-Value Store
EXTRACTION_REPORT.html- Interactive visual reportTOKEN_USAGE- Token count and cost estimate (if enabled)DEBUG_*- Debug info (if enabled)
π Security
- API keys are encrypted and never logged
- Data is processed transiently - not stored beyond your run
- Runs in sandboxed containers on Apify infrastructure
π‘ Tips
- Better results: Provide examples in your schema for complex extractions
- Faster runs: Use
gemini-2.5-flashorllama-3.1-8b-instantfor speed - Long documents: The Actor automatically chunks and processes large texts
- Multiple sources: Add multiple URLs or texts for batch processing
- Custom personas: Use
systemPromptto guide extraction style
π Links
π License
Apache 2.0 - Built with Google's LangExtract library.
