
OCR - Extract Text from Images
Under maintenance
Pricing
$5.00 / 1,000 results

OCR - Extract Text from Images
Under maintenance
Extract text from images using OCR. Simply plug in your Apify datasetId, indicate what column contains the URL of the image (the grey API name) and get a new dataset with all text extracted from the images.
5.0 (1)
Pricing
$5.00 / 1,000 results
0
Total users
2
Monthly users
2
Runs succeeded
50%
Last modified
2 days ago
Image OCR Extractor
A powerful Apify Actor that extracts text from images in bulk using Optical Character Recognition (OCR). This actor processes images from an existing Apify dataset and adds the extracted text to each item, making image content searchable and analyzable.
🚀 Features
- Bulk Processing: Process hundreds or thousands of images automatically
- Multi-language Support: Extract text in multiple languages using Tesseract OCR
- Memory Efficient: Processes images in batches to handle large datasets
- Error Handling: Gracefully handles failed downloads and OCR errors
- Smart Filtering: Automatically skips very small images (thumbnails, icons)
- Concurrent Processing: Processes multiple images simultaneously for faster execution
📋 Use Cases
- E-commerce: Extract product names, prices, and descriptions from product images
- Document Processing: Convert scanned documents or screenshots to searchable text
- Social Media Analysis: Extract text from memes, infographics, and social media posts
- Content Moderation: Identify text content in user-uploaded images
- Data Enhancement: Add searchable text content to existing image datasets
- Research: Analyze text content across large image collections
🛠 Input Configuration
The actor requires the following input parameters:
Required Parameters
Parameter | Type | Description |
---|---|---|
datasetId | String | The ID or name of the Apify dataset containing items with image URLs to process |
Optional Parameters
Parameter | Type | Default | Description |
---|---|---|---|
imageUrlFieldName | String | "displayUrl" | The name of the field in your dataset that contains the direct URL to the image |
lang | String | "eng" | Language codes for Tesseract OCR. Use single codes like "eng" , "spa" , "fra" or combine multiple with + like "eng+deu" |
Example Input
{"datasetId": "your-dataset-id-here","imageUrlFieldName": "imageUrl","lang": "eng+spa"}
📊 Input Dataset Format
Your source dataset should contain items with image URLs. Each item should have:
{"id": "item-1","title": "Sample Product","displayUrl": "https://example.com/image.jpg","otherField": "other data"}
The actor will look for the image URL in the field specified by imageUrlFieldName
(default: displayUrl
).
📤 Output Format
The actor creates a new dataset with all original data plus an additional ocrText
field containing the extracted text:
{"id": "item-1","title": "Sample Product","displayUrl": "https://example.com/image.jpg","otherField": "other data","ocrText": "SALE 50% OFF\nBest Quality Product\nOrder Now!"}
🌍 Supported Languages
The actor supports all languages available in Tesseract OCR, including:
- English:
eng
- Spanish:
spa
- French:
fra
- German:
deu
- Chinese Simplified:
chi_sim
- Chinese Traditional:
chi_tra
- Japanese:
jpn
- Korean:
kor
- Arabic:
ara
- Russian:
rus
- Portuguese:
por
- Italian:
ita
For multiple languages, combine with +
: "eng+spa+fra"
⚙️ How It Works
- Initialization: The actor connects to your source dataset and prepares the OCR engine
- Batch Processing: Images are processed in batches of 500 to optimize memory usage
- Concurrent Processing: Up to 5 images are processed simultaneously for efficiency
- Image Fetching: Each image is downloaded from its URL
- Quality Filtering: Very small images (< 10KB) are automatically skipped
- OCR Processing: Tesseract extracts text from each image
- Data Storage: Results are saved to the default dataset with original data intact
🔧 Technical Details
- Node.js Version: Requires Node.js 18.0.0 or higher
- Memory Management: Processes images in batches to prevent memory overflow
- Concurrency: 5 simultaneous image processing operations
- Error Handling: Individual image failures don't stop the entire process
- Performance: Automatically skips thumbnails and very small images to save processing time
📈 Performance Tips
- Image Quality: Higher resolution images generally produce better OCR results
- Image Format: JPEG and PNG formats work best
- Text Clarity: Clear, high-contrast text produces more accurate results
- Language Selection: Specify the correct language(s) for better accuracy
- Dataset Size: The actor can handle datasets with thousands of images
🚨 Limitations
- Only processes direct image URLs (no authentication required)
- Skips images smaller than 10KB automatically
- OCR accuracy depends on image quality and text clarity
- Processing speed varies based on image size and complexity
- Some image formats may not be supported by Tesseract
💡 Best Practices
- Test First: Run the actor on a small subset of your data to verify results
- Language Settings: Use the most specific language codes for your content
- URL Validation: Ensure your dataset contains valid, accessible image URLs
- Monitor Progress: Check the actor's logs to track processing status
- Error Review: Review items with empty
ocrText
fields for potential issues
🔍 Troubleshooting
Common Issues
No text extracted:
- Check if the image contains readable text
- Verify the correct language is specified
- Ensure the image URL is accessible
Actor fails to start:
- Verify the
datasetId
exists and is accessible - Check that the specified
imageUrlFieldName
exists in your dataset
Some images skipped:
- Very small images (< 10KB) are automatically skipped
- Failed downloads are logged but don't stop processing
📝 Example Workflow
- Prepare Data: Create an Apify dataset with items containing image URLs
- Configure Actor: Set the dataset ID and image URL field name
- Select Language: Choose appropriate language codes for your images
- Run Actor: Start the processing and monitor progress in logs
- Analyze Results: Access the output dataset with OCR text included
🤝 Support
If you encounter issues or have questions:
- Check the actor's execution logs for detailed error messages
- Verify your input configuration matches the expected format
- Ensure your image URLs are publicly accessible
Version: 1.0.0
Author: Jan Sytze Heegstra
License: ISC
On this page
Share Actor: