LangExtract - Structured Data Extractor avatar
LangExtract - Structured Data Extractor

Pricing

Pay per usage

Go to Apify Store
LangExtract - Structured Data Extractor

LangExtract - Structured Data Extractor

Extract structured data from unstructured text using Google's LangExtract library & LLMs. Get entities, relationships & attributes in clean JSON with source grounding. Supports Gemini, OpenAI, Anthropic, Groq & custom endpoints. Perfect for parsing resumes, contracts, reports & documents.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Vivian Ferreira

Vivian Ferreira

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

1

Monthly active users

15 days ago

Last modified

Share

๐Ÿ” LangExtract - Structured Data Extractor

Extract structured data from unstructured text using Google's LangExtract library powered by Large Language Models (LLMs). Transform messy documents into clean, organized JSON with precise source grounding.

โœจ What It Does

This Actor leverages Google's open-source LangExtract library to:

  • Extract entities (people, organizations, products, locations, etc.) from any text
  • Identify relationships and attributes between entities
  • Ground extractions to exact positions in the source text
  • Generate visual reports showing extractions in context

๐ŸŽฏ Use Cases

IndustryApplication
HR & RecruitingExtract candidates, skills, and experience from resumes
LegalPull parties, dates, and clauses from contracts
HealthcareStructure patient info from medical records
FinanceExtract entities from earnings reports and filings
ResearchParse academic papers for citations and findings
News & MediaIdentify people, places, and events from articles
E-commerceExtract product specs from descriptions

๐Ÿš€ Quick Start

Basic Example

Input:

{
"provider": "Gemini",
"apiKey": "YOUR_GEMINI_API_KEY",
"model": "gemini-2.5-flash",
"schema": "{\"prompt_description\": \"Extract all people and their job titles.\"}",
"text": "Sarah Chen was appointed CEO of Acme Corp. Dr. James Wilson serves as CTO."
}

Output:

{
"extraction": {
"extractions": [
{"class": "person", "text": "Sarah Chen", "attributes": {"role": "CEO", "organization": "Acme Corp"}},
{"class": "person", "text": "Dr. James Wilson", "attributes": {"role": "CTO"}}
]
}
}

๐Ÿค– Supported Providers & Models

ModelBest For
gemini-2.5-flashFast, affordable, great quality
gemini-2.0-flashBalanced performance
gemini-1.5-proComplex extractions

OpenAI

ModelBest For
gpt-4oHighest quality
gpt-4o-miniFast and affordable
gpt-4-turboLong documents

Anthropic

ModelBest For
claude-3-5-sonnet-latestBest overall
claude-3-haiku-latestFast responses

Groq (Fast Inference)

ModelBest For
llama-3.3-70b-versatileHigh quality
llama-3.1-8b-instantUltra-fast
mixtral-8x7b-32768Long context
gemma2-9b-itEfficient

OpenAI-Compatible (Custom/Local)

Use any OpenAI-compatible API by providing a custom Base URL:

  • Ollama: http://localhost:11434/v1
  • Azure OpenAI: https://YOUR_RESOURCE.openai.azure.com
  • Together AI, Anyscale, etc.

๐Ÿ“ฅ Input Parameters

ParameterTypeRequiredDescription
providerenumโœ…LLM provider: Gemini, OpenAI, Anthropic, Groq, OpenAI-Compatible
apiKeysecretโœ…Your API key for the selected provider
modelstringโœ…Model name (e.g., gemini-2.5-flash)
schemajsonโœ…Extraction instructions (see Schema section)
textstringโšชRaw text to process
urlsarrayโšชURLs to fetch and process
baseUrlstringโšชCustom API endpoint (for OpenAI-Compatible)
systemPromptstringโšชCustom persona (e.g., "Act as a legal expert")
batchModebooleanโšชEnable parallel processing (default: true)
debugbooleanโšชSave debug info to Key-Value Store
trackTokensbooleanโšชReport token usage (default: true)

๐Ÿ“ Schema Format

Simple Format

Just describe what you want to extract:

{
"prompt_description": "Extract all people mentioned with their job titles and organizations."
}

Advanced Format (with examples)

Provide examples for better accuracy:

{
"prompt_description": "Extract people and their roles. Use exact text from the source.",
"examples": [
{
"text": "John Smith is the CEO of TechCorp.",
"extractions": [
{
"extraction_class": "person",
"extraction_text": "John Smith",
"attributes": {"role": "CEO", "organization": "TechCorp"}
}
]
}
]
}

๐Ÿ“ค Output

Dataset

Each extraction is pushed to the Apify Dataset:

{
"source_index": 0,
"extraction": {
"extractions": [
{"class": "person", "text": "...", "attributes": {...}, "position": {"start": 0, "end": 10}}
],
"source_text": "..."
},
"status": "success"
}

Key-Value Store

  • EXTRACTION_REPORT.html - Interactive visual report
  • TOKEN_USAGE - Token count and cost estimate (if enabled)
  • DEBUG_* - Debug info (if enabled)

๐Ÿ”’ Security

  • API keys are encrypted and never logged
  • Data is processed transiently - not stored beyond your run
  • Runs in sandboxed containers on Apify infrastructure

๐Ÿ’ก Tips

  1. Better results: Provide examples in your schema for complex extractions
  2. Faster runs: Use gemini-2.5-flash or llama-3.1-8b-instant for speed
  3. Long documents: The Actor automatically chunks and processes large texts
  4. Multiple sources: Add multiple URLs or texts for batch processing
  5. Custom personas: Use systemPrompt to guide extraction style

๐Ÿ“œ License

Apache 2.0 - Built with Google's LangExtract library.