HTML Tables to Markdown (GFM) for RAG & LLMs
Pricing
from $1.00 / 1,000 table extracteds
HTML Tables to Markdown (GFM) for RAG & LLMs
Extract every HTML table from any URL into clean, deterministic GitHub-Flavored Markdown (GFM). Auto-detects headers (or synthesizes col1..N), escapes pipes, collapses whitespace, and stamps each table with an sha256 hash for dedup & idempotency. RAG / embeddings / LLM ready. Same HTML, same output.
Pricing
from $1.00 / 1,000 table extracteds
Rating
0.0
(0)
Developer
Adam
Maintained by CommunityActor stats
0
Bookmarked
0
Total users
0
Monthly active users
17 hours ago
Last modified
Categories
Share
TableForge: Docs -> Queryable GFM Tables
Turn the HTML tables buried in your pages into clean, deterministic, RAG-ready GitHub-Flavored Markdown.
What it does
TableForge fetches each URL you provide, parses the returned HTML with a real DOM (jsdom), and extracts every <table> on the page. Each table is converted into a clean GitHub-Flavored Markdown (GFM) table:
- Cell text is whitespace-collapsed and trimmed, and
|characters are escaped so the Markdown stays valid. - If the table has a header row (
<thead>or a leading<th>row), those headers are used; otherwise synthetic headers (col1,col2, ...) are generated so every table is well-formed. - A standard GFM header separator (
| --- | --- |) is emitted, making the output ready to drop into Markdown, paste into an LLM prompt, or feed an embeddings/RAG pipeline. - Every table gets a deterministic
content_hash(sha256:+ 64 hex) computed over its GFM text, so identical tables always produce identical hashes for dedup and idempotency.
The conversion is fully deterministic: the same HTML in always yields the same Markdown and the same hash out. Nothing is summarized, rewritten, or hallucinated; missing cells are emitted as empty, never invented.
Input
| Field | Type | Required | Description |
|---|---|---|---|
urls | array of strings | yes | Page URLs whose tables you are authorized to extract (your own, authorized, or public pages). |
ownership_attestation | boolean | yes | Must be true to confirm you own or are authorized to extract from these pages. If false/omitted, the run is rejected before any work and bills $0. |
Output
Records are pushed to the dataset. There are two record_type values:
table — one record per extracted table:
| Field | Type | Description |
|---|---|---|
record_type | string | "table" |
source_url | string | The page the table came from. |
table_index | integer | Zero-based index of the table within the page. |
gfm_table | string | The full GitHub-Flavored Markdown table. |
column_headers | string[] | The header cells (real or synthesized col1..colN). |
row_count | integer | Number of body rows (excluding the header). |
content_hash | string | Deterministic sha256:<64 hex> over the GFM text. |
run_summary — one record per run:
| Field | Type | Description |
|---|---|---|
record_type | string | "run_summary" |
pages_processed | integer | Pages successfully fetched and scanned. |
tables_extracted | integer | Total tables converted across all pages. |
A page that fails to fetch is skipped entirely (it contributes no records and is not counted). A page that fetches successfully but contains no tables is counted in pages_processed but adds no table records. Either way, pages with zero tables are never billed.
Pricing
TableForge uses Apify Pay-Per-Event. You are billed only for:
| Event | Price (USD) | When it fires |
|---|---|---|
actor_run_start | $0.005 | Once per run, after the ownership and paid-plan gates pass. |
table_extracted | $0.001 | Once per table successfully converted to GFM (the billed unit). |
Pages are scanned but not billed (no double-charging), and failed or empty pages cost $0.
Example run: scan 10 pages that together contain 80 tables -> $0.005 (run start) + 80 x $0.001 = $0.085 total.
Why this Actor
- Deterministic + idempotent. Output is a pure function of the input HTML. Every table carries an
sha256:content hash over its Markdown, so you can dedup, cache, and detect changes reliably across runs. - No hallucination. Tables are parsed structurally from the DOM and reproduced faithfully. Missing cells are emitted empty, never fabricated; nothing is paraphrased or summarized.
- Ownership attestation gate. A run cannot proceed unless you attest you are authorized to extract from the pages; without it, the run is rejected before any work with zero billing.
- Embeddings/RAG-ready by design. Clean GFM with preserved (or synthesized) headers, escaped pipes, and per-table hashes drops straight into LLM prompts, vector stores, and Markdown docs.
This Actor is AI-authored and operated under the publisher's LLC. It uses Apify's Actor.charge() solely to bill the customer for the events above; the Actor contains no payout or money-out capability of any kind.