Convert To Markdown avatar
Convert To Markdown

Pricing

from $15.00 / 1,000 file conversions

Go to Apify Store
Convert To Markdown

Convert To Markdown

Convert to Markdown, converts documents, spreadsheets, images (OCR), audio (transcription), and web/data files into clean Markdown. It runs fully locally, requires no API keys, and is ideal for LLMs, docs, and archiving.

Pricing

from $15.00 / 1,000 file conversions

Rating

0.0

(0)

Developer

Datavault

Datavault

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Categories

Share

Convert to Markdown - Versatile File Converter

The Convert to Markdown Actor is a high-performance, all-in-one utility designed to transform a wide variety of file formats into clean, structured Markdown. It is ideal for preparing data for LLMs (Large Language Models), documentation workflows, or archiving.

Features

  • Documents: Converts PDF (preserving layout and structure), Word (.docx), and PowerPoint (.pptx) into clean Markdown.
  • Spreadsheets: Transforms Excel (.xlsx) and CSV files into readable Markdown tables.
  • Images (OCR): Extracts text from images (JPG, PNG, WebP, etc.) using automated OCR.
  • Audio (Transcription): Transcribes speech from audio files (MP3, WAV, etc.) into text using local AI models.
  • Web & Data: Converts HTML, JSON, and XML into formatted Markdown blocks or tables.
  • Metadata Extraction: Automatically extracts technical metadata for images and audio files.
  • No External API Keys: Everything runs locally inside the container (including OCR and Transcription).

Supported Formats

CategoryFormats
DocumentsPDF, DOCX, PPTX, TXT
DataJSON, XML, CSV, HTML
SpreadsheetsXLSX
ImagesPNG, JPG, JPEG, WEBP, BMP, TIFF
AudioMP3, WAV, OGG, M4A, FLAC

Input Parameters

  • urls: A list of URLs pointing to the files you want to convert.
  • performOcr: (Default: true) Enable/disable OCR for images and scanned PDFs.
  • extractMetadata: (Default: true) Enable/disable technical metadata extraction.
  • proxyConfiguration: Use Apify Proxy if your target files are protected or geo-blocked.

Output

The Actor outputs a dataset where each item represents a converted file:

  • url: The original source URL.
  • title: The filename or detected title.
  • markdown: The full converted content in Markdown format.
  • mimeType: The detected MIME type of the file.
  • metadata: A JSON object containing technical metadata (e.g., Image dimensions, Audio duration, GPS data).

Sample Input

{
"urls": [
"https://example.com/document.pdf",
"https://example.com/photo.jpg"
],
"performOcr": true,
"extractMetadata": true
}

How it works

  1. Download: The Actor downloads the file from the provided URL.
  2. Identification: It detects the file type based on headers and extensions.
  3. Conversion:
    • PDFs use specialized tools to preserve layout and then convert to Markdown.
    • Word/PowerPoint are transformed using robust document processors.
    • Images use advanced OCR for text and technical metadata extraction.
    • Audio uses local AI models for speech-to-text transcription.
    • Web/Data use specialized HTML and data parsers to build tables and lists.
  4. Formatting: All outputs are normalized into valid Markdown.
  5. Storage: Results are saved to the Apify Dataset and a conversion event is billed.

Performance Note

  • Transcription/OCR: Processing large audio files or complex images can be CPU-intensive. The Actor uses optimized models for a balance between speed and accuracy.
  • Memory: For very large Excel files or PDFs, ensure the Actor has at least 2GB of memory allocated.

Feedback & Improvements If you encounter a file format that isn't supported or have ideas for improvements, please leave us a message in the Issues tab!