HTML Tables to Markdown (GFM) for RAG & LLMs avatar

HTML Tables to Markdown (GFM) for RAG & LLMs

Pricing

from $1.00 / 1,000 table extracteds

Go to Apify Store
HTML Tables to Markdown (GFM) for RAG & LLMs

HTML Tables to Markdown (GFM) for RAG & LLMs

Extract every HTML table from any URL into clean, deterministic GitHub-Flavored Markdown (GFM). Auto-detects headers (or synthesizes col1..N), escapes pipes, collapses whitespace, and stamps each table with an sha256 hash for dedup & idempotency. RAG / embeddings / LLM ready. Same HTML, same output.

Pricing

from $1.00 / 1,000 table extracteds

Rating

0.0

(0)

Developer

Adam

Adam

Maintained by Community

Actor stats

0

Bookmarked

0

Total users

0

Monthly active users

17 hours ago

Last modified

Share

TableForge: Docs -> Queryable GFM Tables

Turn the HTML tables buried in your pages into clean, deterministic, RAG-ready GitHub-Flavored Markdown.

What it does

TableForge fetches each URL you provide, parses the returned HTML with a real DOM (jsdom), and extracts every <table> on the page. Each table is converted into a clean GitHub-Flavored Markdown (GFM) table:

  • Cell text is whitespace-collapsed and trimmed, and | characters are escaped so the Markdown stays valid.
  • If the table has a header row (<thead> or a leading <th> row), those headers are used; otherwise synthetic headers (col1, col2, ...) are generated so every table is well-formed.
  • A standard GFM header separator (| --- | --- |) is emitted, making the output ready to drop into Markdown, paste into an LLM prompt, or feed an embeddings/RAG pipeline.
  • Every table gets a deterministic content_hash (sha256: + 64 hex) computed over its GFM text, so identical tables always produce identical hashes for dedup and idempotency.

The conversion is fully deterministic: the same HTML in always yields the same Markdown and the same hash out. Nothing is summarized, rewritten, or hallucinated; missing cells are emitted as empty, never invented.

Input

FieldTypeRequiredDescription
urlsarray of stringsyesPage URLs whose tables you are authorized to extract (your own, authorized, or public pages).
ownership_attestationbooleanyesMust be true to confirm you own or are authorized to extract from these pages. If false/omitted, the run is rejected before any work and bills $0.

Output

Records are pushed to the dataset. There are two record_type values:

table — one record per extracted table:

FieldTypeDescription
record_typestring"table"
source_urlstringThe page the table came from.
table_indexintegerZero-based index of the table within the page.
gfm_tablestringThe full GitHub-Flavored Markdown table.
column_headersstring[]The header cells (real or synthesized col1..colN).
row_countintegerNumber of body rows (excluding the header).
content_hashstringDeterministic sha256:<64 hex> over the GFM text.

run_summary — one record per run:

FieldTypeDescription
record_typestring"run_summary"
pages_processedintegerPages successfully fetched and scanned.
tables_extractedintegerTotal tables converted across all pages.

A page that fails to fetch is skipped entirely (it contributes no records and is not counted). A page that fetches successfully but contains no tables is counted in pages_processed but adds no table records. Either way, pages with zero tables are never billed.

Pricing

TableForge uses Apify Pay-Per-Event. You are billed only for:

EventPrice (USD)When it fires
actor_run_start$0.005Once per run, after the ownership and paid-plan gates pass.
table_extracted$0.001Once per table successfully converted to GFM (the billed unit).

Pages are scanned but not billed (no double-charging), and failed or empty pages cost $0.

Example run: scan 10 pages that together contain 80 tables -> $0.005 (run start) + 80 x $0.001 = $0.085 total.

Why this Actor

  • Deterministic + idempotent. Output is a pure function of the input HTML. Every table carries an sha256: content hash over its Markdown, so you can dedup, cache, and detect changes reliably across runs.
  • No hallucination. Tables are parsed structurally from the DOM and reproduced faithfully. Missing cells are emitted empty, never fabricated; nothing is paraphrased or summarized.
  • Ownership attestation gate. A run cannot proceed unless you attest you are authorized to extract from the pages; without it, the run is rejected before any work with zero billing.
  • Embeddings/RAG-ready by design. Clean GFM with preserved (or synthesized) headers, escaped pipes, and per-table hashes drops straight into LLM prompts, vector stores, and Markdown docs.

This Actor is AI-authored and operated under the publisher's LLC. It uses Apify's Actor.charge() solely to bill the customer for the events above; the Actor contains no payout or money-out capability of any kind.