Pricing

Pay per usage

LangExtract - Structured Data Extractor

Extract structured data from unstructured text using Google's LangExtract library & LLMs. Get entities, relationships & attributes in clean JSON with source grounding. Supports Gemini, OpenAI, Anthropic, Groq & custom endpoints. Perfect for parsing resumes, contracts, reports & documents.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Vivian Ferreira

Actor stats

Bookmarked

Total users

Monthly active users

25 days ago

Last modified

🔍 LangExtract - Structured Data Extractor

Extract structured data from unstructured text using Google's LangExtract library powered by Large Language Models (LLMs). Transform messy documents into clean, organized JSON with precise source grounding.

✨ What It Does

This Actor leverages Google's open-source LangExtract library to:

Extract entities (people, organizations, products, locations, etc.) from any text
Identify relationships and attributes between entities
Ground extractions to exact positions in the source text
Generate visual reports showing extractions in context

🎯 Use Cases

Industry	Application
HR & Recruiting	Extract candidates, skills, and experience from resumes
Legal	Pull parties, dates, and clauses from contracts
Healthcare	Structure patient info from medical records
Finance	Extract entities from earnings reports and filings
Research	Parse academic papers for citations and findings
News & Media	Identify people, places, and events from articles
E-commerce	Extract product specs from descriptions

🚀 Quick Start

Basic Example

Input:

{
  "provider": "Gemini",
  "apiKey": "YOUR_GEMINI_API_KEY",
  "model": "gemini-2.5-flash",
  "schema": "{\"prompt_description\": \"Extract all people and their job titles.\"}",
  "text": "Sarah Chen was appointed CEO of Acme Corp. Dr. James Wilson serves as CTO."
}

Output:

{
  "extraction": {
    "extractions": [
      {"class": "person", "text": "Sarah Chen", "attributes": {"role": "CEO", "organization": "Acme Corp"}},
      {"class": "person", "text": "Dr. James Wilson", "attributes": {"role": "CTO"}}
    ]
  }
}

🤖 Supported Providers & Models

Google Gemini (Recommended)

Model	Best For
`gemini-2.5-flash`	Fast, affordable, great quality
`gemini-2.0-flash`	Balanced performance
`gemini-1.5-pro`	Complex extractions

OpenAI

Model	Best For
`gpt-4o`	Highest quality
`gpt-4o-mini`	Fast and affordable
`gpt-4-turbo`	Long documents

Anthropic

Model	Best For
`claude-3-5-sonnet-latest`	Best overall
`claude-3-haiku-latest`	Fast responses

Groq (Fast Inference)

Model	Best For
`llama-3.3-70b-versatile`	High quality
`llama-3.1-8b-instant`	Ultra-fast
`mixtral-8x7b-32768`	Long context
`gemma2-9b-it`	Efficient

OpenAI-Compatible (Custom/Local)

Use any OpenAI-compatible API by providing a custom Base URL:

Ollama: http://localhost:11434/v1
Azure OpenAI: https://YOUR_RESOURCE.openai.azure.com
Together AI, Anyscale, etc.

📥 Input Parameters

Parameter	Type	Required	Description
`provider`	enum	✅	LLM provider: Gemini, OpenAI, Anthropic, Groq, OpenAI-Compatible
`apiKey`	secret	✅	Your API key for the selected provider
`model`	string	✅	Model name (e.g., `gemini-2.5-flash`)
`schema`	json	✅	Extraction instructions (see Schema section)
`text`	string	⚪	Raw text to process
`urls`	array	⚪	URLs to fetch and process
`baseUrl`	string	⚪	Custom API endpoint (for OpenAI-Compatible)
`systemPrompt`	string	⚪	Custom persona (e.g., "Act as a legal expert")
`batchMode`	boolean	⚪	Enable parallel processing (default: true)
`debug`	boolean	⚪	Save debug info to Key-Value Store
`trackTokens`	boolean	⚪	Report token usage (default: true)

📝 Schema Format

Simple Format

Just describe what you want to extract:

{
  "prompt_description": "Extract all people mentioned with their job titles and organizations."
}

Advanced Format (with examples)

Provide examples for better accuracy:

{
  "prompt_description": "Extract people and their roles. Use exact text from the source.",
  "examples": [
    {
      "text": "John Smith is the CEO of TechCorp.",
      "extractions": [
        {
          "extraction_class": "person",
          "extraction_text": "John Smith",
          "attributes": {"role": "CEO", "organization": "TechCorp"}
        }
      ]
    }
  ]
}

📤 Output

Dataset

Each extraction is pushed to the Apify Dataset:

{
  "source_index": 0,
  "extraction": {
    "extractions": [
      {"class": "person", "text": "...", "attributes": {...}, "position": {"start": 0, "end": 10}}
    ],
    "source_text": "..."
  },
  "status": "success"
}

Key-Value Store

EXTRACTION_REPORT.html - Interactive visual report
TOKEN_USAGE - Token count and cost estimate (if enabled)
DEBUG_* - Debug info (if enabled)

🔒 Security

API keys are encrypted and never logged
Data is processed transiently - not stored beyond your run
Runs in sandboxed containers on Apify infrastructure

💡 Tips

Better results: Provide examples in your schema for complex extractions
Faster runs: Use gemini-2.5-flash or llama-3.1-8b-instant for speed
Long documents: The Actor automatically chunks and processes large texts
Multiple sources: Add multiple URLs or texts for batch processing
Custom personas: Use systemPrompt to guide extraction style

🔗 Links

📜 License

Apache 2.0 - Built with Google's LangExtract library.

PDF AI Extractor MCP

devaditya/pdf-ai-extractor-mcp

Extracts text, tables, summaries, and structured data from any PDF using OpenAI, Google Gemini, or Claude. Supports bulk AI processing, clean JSON exports, and an AI-ready MCP mode for agent workflows.

lalithhh

HTML to JSON Smart Parser

parseforge/html-to-json-smart-parser

Convert HTML to structured JSON using AI! Uses OpenAI to extract and structure data from HTML into clean JSON format. Perfect for developers and data analysts who need to transform HTML into structured data without manual parsing.

ParseForge

5.0

Document Extractor API - AI-Powered PDF & Text Analysis

fresh_cliff/document-extractor-api

Extract text and data from PDF, Word, and image documents using AI-powered OCR. Convert documents to structured JSON, analyze content, and extract insights. No API keys required with mirror fallbacks.

Brennan Crawford

Text Scraper (Free)

karamelo/text-scraper-free

Website Text Extractor. Extract Text from Webpages and Feed Your LLMs

karamelo

933

5.0

PDF to Text Extractor

consummate_mandala/pdf-to-text-extractor

PDF to Text Extractor. Extract structured data with automatic pagination, proxy rotation, and JSON/CSV export. Pay only for results.

Donny Nguyen

Web Search for AI (DuckDuckGo)

desmond-dev/duckduckgo-web-search

Perform anonymous web searches and extract clean results (Title, Link, Snippet). No API key required. Perfect for RAG pipelines, grounding LLMs, and market research.

Desmond Chigariro

Wikidata Knowledge Base Scraper

cloud9_ai/wikidata-scraper

Query and extract structured data from Wikidata. Get entities, properties, and relationships from 100M+ items. No API key needed.

cloud9

LLM Web Scraper

incredible_moment/llm-scraper

Turn any website into structured JSON using AI. Supports OpenAI GPT-4 and Anthropic Claude. Built in Rust to minimize compute costs while waiting for LLM responses. Extract data without selectors.

Daniel Rosen

Structured Data Crawler

tempting_district/structured-data-crawler

Crawl public web pages and convert unstructured content into clean, deterministic, schema-first structured records.

Lone

Pdf Json Extractor

p6t_p10n/pdf-json-extractor

Convert any PDF into structured JSON using AI and OCR (Tesseract or Google Vision). Supports custom schemas, validation, and auto-repair. Ideal for invoices, contracts, receipts, and automation workflows. Fast, accurate, and easy to integrate.