S3 to Markdown avatar
S3 to Markdown

Pricing

Pay per event

Go to Store
S3 to Markdown

S3 to Markdown

Developed by

Lorenzo Dalmazzo

Lorenzo Dalmazzo

Maintained by Community

Transform S3 documents into perfect AI training data! Converts PDFs, Word, Excel, images, audio to clean Markdown that LLMs love. Uses Microsoft's markitdown engine. Ideal for RAG systems, AI agents, and machine learning pipelines.

5.0 (2)

Pricing

Pay per event

0

2

2

Last modified

3 days ago

S3 File to Markdown Converter

This Apify Actor downloads multiple files from Amazon S3 and converts them to Markdown using markitdown.

Features

  • Bulk Processing: Process multiple files in a single run for efficiency
  • Downloads files from S3 buckets
  • Converts various file formats to Markdown (PDF, Word, PowerPoint, Excel, Images, Audio, HTML, etc.)
  • Secure credential management via encrypted input fields
  • Robust Error Handling: Individual file failures don't stop the entire batch
  • Progress tracking and detailed logging
  • Pay-per-conversion: You only pay $0.01 for each successfully converted file

Input Configuration

The actor requires the following input parameters:

  • aws_access_key_id (required, secret): Your AWS access key ID for S3 access
  • aws_secret_access_key (required, secret): Your AWS secret access key for S3 access
  • s3_bucket (required): The name of the S3 bucket containing the files
  • s3_keys (required): Array of S3 object keys (paths) of the files to convert in the S3 bucket
  • aws_region (required): The AWS region where the S3 bucket is located

AWS Credentials

AWS credentials are provided directly in the actor input as encrypted secret fields. The credentials are automatically encrypted by Apify and only decrypted during actor execution for maximum security.

Pricing

This actor uses pay-per-conversion pricing:

  • πŸ’° $0.01 per successfully converted file
  • ❌ No charge for failed conversions (missing files, conversion errors, etc.)
  • πŸš€ Cost-effective for batch processing - process many files efficiently
  • πŸ“Š Transparent billing - you can see exactly which files were charged in the logs (look for "charged $0.01" messages)

Example: If you process 100 files and 95 succeed, you pay $0.95 (only for the 95 successful conversions).

Example Input

{
"aws_access_key_id": "YOUR_AWS_ACCESS_KEY_ID",
"aws_secret_access_key": "YOUR_AWS_SECRET_ACCESS_KEY",
"s3_bucket": "my-documents-bucket",
"s3_keys": [
"documents/report.pdf",
"documents/invoice.docx",
"documents/presentation.pptx"
],
"aws_region": "us-west-2"
}

Note: The AWS credentials will appear as password fields in the Apify Console and will be automatically encrypted.

Output

The actor processes multiple files and saves one record per converted file to the dataset. Each record has the following structure:

  • s3_bucket: The source S3 bucket name
  • s3_key: The specific S3 object key that was converted
  • markdown_content: The converted Markdown content from that file
  • file_size_chars: The size of the Markdown content in characters

The output is displayed in a user-friendly table format in the Apify Console's Output tab, with one row per converted file.

Example Output

For the input with multiple files above, you would get multiple records:

{
"s3_bucket": "my-documents-bucket",
"s3_key": "documents/report.pdf",
"markdown_content": "# Report Title\n\nThis is the converted markdown content...",
"file_size_chars": 1234
}
{
"s3_bucket": "my-documents-bucket",
"s3_key": "documents/invoice.docx",
"markdown_content": "# Invoice\n\nInvoice Number: 12345...",
"file_size_chars": 856
}
{
"s3_bucket": "my-documents-bucket",
"s3_key": "documents/presentation.pptx",
"markdown_content": "# Presentation Title\n\n## Slide 1...",
"file_size_chars": 2048
}

Supported File Formats

Thanks to markitdown, this actor supports:

  • PDF documents
  • Microsoft Office files (Word, PowerPoint, Excel)
  • Images (with OCR)
  • Audio files (with transcription)
  • HTML files
  • Text-based formats (CSV, JSON, XML)
  • ZIP archives
  • EPub files
  • And more!

Error Handling

The actor provides robust error handling for batch processing:

  • Batch Resilience: If one file fails, the actor continues processing other files
  • Detailed Logging: Each file's processing status is logged individually
  • No charges for failures: You're only charged for successfully converted files
  • Clear Error Messages: Specific error messages for common issues:
    • Missing AWS credentials
    • Invalid S3 bucket
    • Missing S3 objects (individual files are skipped, not charged)
    • Access denied errors (individual files are skipped, not charged)
    • File conversion failures (individual files are skipped, not charged)

Usage Example

from apify_client import ApifyClient
client = ApifyClient("your-api-token")
# Run the actor
run = client.actor("your-actor-id").call(run_input={
"aws_access_key_id": "YOUR_AWS_ACCESS_KEY_ID",
"aws_secret_access_key": "YOUR_AWS_SECRET_ACCESS_KEY",
"s3_bucket": "my-documents",
"s3_keys": ["files/document.pdf", "files/report.docx"],
"aws_region": "us-east-1"
})
# Get the markdown content
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
markdown_content = item["markdown_content"]
s3_key = item["s3_key"]
print(f"Converted {s3_key}: {len(markdown_content)} characters")
# Note: You'll be charged $0.01 for each successfully converted file
print(f"Total cost: ${run['stats']['itemsCount'] * 0.01:.2f}")