Bulk Pdf To Json OCR
Pricing
from $300.00 / 1,000 results
Bulk Pdf To Json OCR
Convert PDF invoices, menus, images with text and documents into structured JSON. Features hybrid Digital+OCR parsing and AI-powered data extraction.
Pricing
from $300.00 / 1,000 results
Rating
0.0
(0)
Developer

Kumar Gagandeo
Actor stats
1
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
PDF to JSON OCR Actor with Gemini AI
This Apify Actor converts PDF files to structured JSON data using intelligent text extraction and Google Gemini AI-powered structuring.
Features
- 📄 Hybrid Text Extraction: Automatically detects digital text vs scanned images
- 🔍 OCR Support: Uses Tesseract OCR for scanned documents
- 🤖 AI Structuring: Powered by Google Gemini 2.0 Flash for intelligent data extraction
- 📋 Document Types: Optimized for invoices, receipts, menus, resumes, contracts, brochures, and general documents
- ⚡ Bulk Processing: Process multiple PDFs in a single run
Setup
1. Configure Environment Variables
Copy the example environment file and add your Gemini API key:
$cp .env.example .env
Edit .env and add your API key:
GEMINI_API_KEY=AIzaSy...your-actual-key-hereGEMINI_MODEL=gemini-2.0-flash-exp
Get your Gemini API key from: https://aistudio.google.com/apikey
2. Install Dependencies
$pip install -r requirements.txt
3. Run the Actor
$apify run
Deploy to Apify
apify loginapify push
Input Configuration
Required Fields
- PDF URLs (
startUrls): Array of direct PDF file URLs to process
Optional Fields
- Enable AI Structuring (
structureData): Toggle AI-powered data extraction (default:false) - Document Type (
documentType): Context for AI extraction -general,invoice,receipt,menu,resume,contract,brochure,specification - Max Pages (
maxPages): Limit pages processed per PDF (default:10)
Example Input
{"startUrls": [{ "url": "https://example.com/document.pdf" }],"structureData": true,"documentType": "invoice","maxPages": 5}
How It Works
- Download: Fetches PDF from provided URL
- Text Extraction:
- First attempts digital text extraction (fast)
- Falls back to OCR if document is scanned (character density < 50/page)
- AI Structuring (optional):
- Sends extracted text to Google Gemini AI
- Returns structured JSON based on document type
- Data Storage: Pushes results to Apify dataset
Output Format
{"url": "https://example.com/document.pdf","status": "success","document_type": "invoice","ai_enabled": true,"ai_model": "gemini-2.0-flash-exp","is_ocr_scanned": false,"page_count": 3,"raw_text_preview": "First 500 characters of extracted text...","extracted_data": {"invoice_number": "INV-001","date": "2025-12-17","total": "$1,234.56"}}
Project Structure
.actor/├── actor.json # Actor config: name, version, env vars, runtime settings├── dataset_schema.json # Structure and representation of data produced by an Actor├── input_schema.json # Input validation & Console form definition└── output_schema.json # Specifies where an Actor stores its outputsrc/└── main.py # Actor entry point with PDF processing logic.env # Environment variables (API keys) - DO NOT COMMIT!.env.example # Template for environment variablesstorage/ # Local storage (mirrors Cloud during development)├── datasets/ # Output items (JSON objects)├── key_value_stores/ # Files, config, INPUT└── request_queues/ # Pending crawl requestsDockerfile # Container image definitionrequirements.txt # Python dependencies
For more information, see the Actor definition documentation.
Dependencies
- Apify SDK - Actor runtime framework
- pdfplumber - Digital PDF text extraction
- pdf2image - Converts PDF pages to images
- pytesseract - OCR text recognition
- httpx - Async HTTP client for downloading PDFs
- google-generativeai - Google Gemini API client
- python-dotenv - Environment variable management
Environment Variables
The Actor uses environment variables for configuration. These can be set in the .env file for local development:
GEMINI_API_KEY- Your Google Gemini API key (required for AI structuring)GEMINI_MODEL- Model to use (default:gemini-2.0-flash-exp)
For Apify Cloud deployment: Set these as environment variables in the Actor settings on the Apify Console.
Getting Started
For complete information see this article.
- Copy
.env.exampleto.envand add your Gemini API key - Install dependencies:
pip install -r requirements.txt - Run the Actor:
apify run
Deploy to Apify
Connect Git repository to Apify
If you've created a Git repository for the project, you can easily connect to Apify:
- Go to Actor creation page
- Click on Link Git Repository button
Push project on your local machine to Apify
You can also deploy the project on your local machine to Apify without the need for the Git repository.
-
Log in to Apify. You will need to provide your Apify API Token to complete this action.
$apify login -
Deploy your Actor. This command will deploy and build the Actor on the Apify Platform. You can find your newly created Actor under Actors -> My Actors.
$apify push
Documentation reference
To learn more about Apify and Actors, take a look at the following resources: