Pricing

Pay per event

HTML Table Extractor

Extract HTML tables from any webpage into structured JSON. Supports multiple URLs, filtering by CSS selector or table index, auto-header detection, and nested tables. Pure HTTP — no proxy needed.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Actor stats

Bookmarked

Total users

Monthly active users

13 days ago

Last modified

🗂️ HTML Table Extractor

Extract structured data from any HTML <table> on the web and download it as clean JSON or CSV — no coding required.

Whether you're a researcher pulling data from Wikipedia, a financial analyst scraping earnings tables, or a developer building a data pipeline, this actor turns messy HTML tables into structured, ready-to-use datasets in seconds.

What does it do?

HTML Table Extractor fetches one or more URLs, finds all <table> elements on each page, and converts them to structured JSON — one dataset item per table. It automatically detects column headers from <th> tags, handles rowspans and colspans gracefully, and lets you filter by CSS selector, table index, or minimum row count.

Example: Extract the GDP rankings table from Wikipedia in one click. The actor fetches the page, identifies the right table, and outputs each row as a clean JSON object with named fields.

Who is it for?

📊 Data analysts who regularly copy-paste tables from financial sites, government portals, or Wikipedia into spreadsheets
🧑‍💻 Developers building ETL pipelines that need structured data from HTML pages without writing custom parsers
📰 Journalists and researchers who need to extract comparison tables, statistics, or rankings from web pages
🏢 Business intelligence teams automating weekly data collection from partner or competitor websites
🤖 AI/ML practitioners who need structured training data or reference tables from the web
💼 Operations teams extracting pricing tables, product specs, or availability grids from supplier sites

Why use HTML Table Extractor?

✅ No proxy needed — plain HTTP fetch works on most public web pages, keeping costs near zero
✅ Smarter header detection — auto-detects <th> headers OR lets you specify which row is the header
✅ Filter precisely — target tables by CSS selector, 0-based index, or minimum row count
✅ Nested table support — optionally extract tables inside other tables
✅ Multiple URLs in one run — batch many pages together
✅ PPE pricing — pay only for what you extract, with volume discounts for heavy users
✅ Dataset-ready output — every row is a flat JSON object, ready for Apify datasets, Google Sheets, or your own database

📋 What data can it extract?

Field	Description	Example
`url`	Source page URL	`https://en.wikipedia.org/wiki/...`
`pageTitle`	Page `<title>` text	`"List of countries by GDP (nominal)"`
`tableIndex`	0-based position on page	`2`
`hasHeaders`	Whether headers were detected	`true`
`headers`	Array of column header names	`["Country", "GDP (USD)", "Year"]`
`rows`	Array of row objects	`[{"Country": "USA", "GDP (USD)": "28.7T"}]`
`rowCount`	Number of data rows extracted	`195`
`columnCount`	Number of columns	`4`

💰 How much does it cost to extract HTML tables?

HTML Table Extractor uses Pay-Per-Event (PPE) pricing — you pay only for what you actually extract:

Event	Price
Actor start	$0.005 per run
Table extracted	$0.0008 per table

Example costs:

Extract 10 tables from 3 URLs → $0.005 + (10 × $0.0008) = $0.013
Extract 100 tables from Wikipedia → $0.005 + (100 × $0.0008) = $0.085
Daily job extracting 50 tables → $0.045/day = **$1.35/month**

Volume discounts apply automatically — heavy users (FREE → BRONZE → SILVER → GOLD → PLATINUM → DIAMOND tiers) get up to 60% off the base per-table price.

Free tier: Apify gives every account $5/month in free usage. That's enough to extract ~6,200 tables for free each month.

No proxy is used by default, so compute costs are minimal — just a lightweight HTTP fetch and cheerio HTML parsing.

🚀 How to use it — step by step

Open the actor at apify.com/automation-lab/html-table-extractor
Paste your URLs — one or more pages containing HTML tables
Optionally filter — set a CSS selector (e.g. #main-content) or specify which table index to extract
Click Run — the actor fetches pages and extracts tables immediately
Download results — export as JSON, CSV, Excel, or send to Google Sheets via integration

Tip: Start with the pre-filled Wikipedia example to see output format before running on your own URLs.

📥 Input parameters

Parameter	Type	Required	Default	Description
`urls`	array	✅ Yes	—	List of page URLs to extract tables from
`tableSelector`	string	No	`""`	CSS selector to scope table search (e.g. `.main-content table`)
`tableIndices`	array	No	`[]`	Extract only tables at these 0-based positions (e.g. `[0, 2]`)
`minRows`	integer	No	`1`	Skip tables with fewer rows than this
`maxTablesPerPage`	integer	No	`100`	Maximum tables to extract per URL (0 = unlimited)
`headerRowIndex`	integer	No	`-1`	Row to use as headers (-1 = auto-detect from `<th>` tags)
`includeNestedTables`	boolean	No	`false`	Also extract tables nested inside other tables
`proxyConfiguration`	object	No	none	Optional proxy (not needed for most public sites)

📤 Output example

Each extracted table becomes one dataset item:

{
  "url": "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)",
  "pageTitle": "List of countries by GDP (nominal) - Wikipedia",
  "tableIndex": 2,
  "hasHeaders": true,
  "headers": ["Country/Territory", "IMF (2026)", "World Bank (2024)", "UN (2024)"],
  "rows": [
    {
      "Country/Territory": "World",
      "IMF (2026)": "123,584,494",
      "World Bank (2024)": "111,326,370",
      "UN (2024)": "100,834,796"
    },
    {
      "Country/Territory": "United States",
      "IMF (2026)": "31,821,293",
      "World Bank (2024)": "28,750,956",
      "UN (2024)": "29,298,000"
    }
  ],
  "rowCount": 195,
  "columnCount": 4
}

💡 Tips & tricks

Wikipedia tables: Most Wikipedia tables have class="wikitable". Use tableSelector: ".wikitable" to skip navboxes and other small tables.
Multiple tables on a page: Use tableIndices: [1, 3] to grab only the 2nd and 4th tables.
No headers detected? If a table uses <td> for headers instead of <th>, set headerRowIndex: 0 to treat the first row as headers.
Skip tiny navigation tables: Set minRows: 5 to ignore single-row tables that are really navigation elements.
Nested tables: Enable includeNestedTables only if you specifically need data inside table cells that contain sub-tables — it can produce many more results.
Sites with JS-rendered tables: This actor uses plain HTTP (no browser). If a table only appears after JavaScript runs, the actor won't see it. In that case, you'll need a browser-based scraper.

🔌 Integrations

HTML Table Extractor connects to your existing workflows:

📊 Google Sheets — Extract pricing tables or rankings directly into Sheets for weekly reporting. Use Apify's native Google Sheets integration to auto-append rows.

🗃️ Database pipelines — Send extracted JSON to PostgreSQL, MongoDB, or BigQuery via Apify's webhooks or the dataset API. Perfect for ETL jobs that run on a schedule.

🤖 AI/ML pipelines — Feed extracted tables to your LLM pipeline for summarization, Q&A, or training data generation. Combine with the AI Data Extractor for hybrid extraction.

📅 Scheduled monitoring — Run daily at 9am to check if a supplier's pricing table has changed. Pair with a webhook to Slack or email on data changes.

📁 Excel/CSV exports — All Apify datasets export to CSV and Excel natively — no extra steps needed.

🧑‍💻 API usage

Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });

const run = await client.actor('automation-lab/html-table-extractor').call({
    urls: ['https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'],
    tableSelector: '.wikitable',
    minRows: 5,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Python

from apify_client import ApifyClient

client = ApifyClient(token='YOUR_APIFY_TOKEN')

run = client.actor('automation-lab/html-table-extractor').call(run_input={
    'urls': ['https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'],
    'tableSelector': '.wikitable',
    'minRows': 5,
})

items = client.dataset(run['defaultDatasetId']).list_items().items
print(items)

cURL

curl -X POST \
  "https://api.apify.com/v2/acts/automation-lab~html-table-extractor/runs?token=YOUR_APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"],
    "tableSelector": ".wikitable",
    "minRows": 5
  }'

🤖 Use with AI agents via MCP

HTML Table Extractor is available as a tool for AI assistants that support the Model Context Protocol (MCP).

Add the Apify MCP server to your AI client — this gives you access to all Apify actors, including this one:

Setup for Claude Code

$claude mcp add --transport http apify "https://mcp.apify.com"

Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

{
    "mcpServers": {
        "apify": {
            "url": "https://mcp.apify.com"
        }
    }
}

Your AI assistant will use OAuth to authenticate with your Apify account on first use.

Example prompts

Once connected, try asking your AI assistant:

"Use automation-lab/html-table-extractor to extract all tables from https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations) and show me the top 10 rows"
"Extract the pricing table from https://example.com/pricing using the CSS selector .pricing-table"
"Use the HTML table extractor to pull tables from these 5 Wikipedia pages and return only tables with more than 20 rows"

Learn more in the Apify MCP documentation.

Legality

⚖️ Legal & compliance

HTML Table Extractor fetches publicly accessible web pages using standard HTTP requests, similar to a browser visiting a page. It does not bypass authentication, CAPTCHA, or access controls.

Always check:

The site's robots.txt before running at scale
The site's Terms of Service regarding automated data collection
Local laws governing web scraping and data usage in your jurisdiction

For academic research, journalism, and personal use on public data, web scraping is generally considered lawful in most jurisdictions. Commercial use may have additional requirements.

This actor does not store or share any data you extract — all data goes directly to your Apify dataset.

❓ FAQ

Q: Does this work on sites that require JavaScript to render tables?
A: No — this actor uses plain HTTP (no browser). If a table is rendered client-side by JavaScript (e.g., React, Angular), the actor won't see it. Check if the table exists in the page's HTML source (Ctrl+U in Chrome). If not, you need a browser-based actor.

Q: What if a table has merged cells (colspan/rowspan)?
A: Cells with colspan or rowspan are currently extracted as-is. The text from merged cells will appear in the first position they occupy. For complex layouts with many merged cells, the output may not perfectly reconstruct the visual table — this is a known limitation of flat HTML-to-JSON conversion.

Q: I ran the actor but got 0 tables. What went wrong?
A: Common causes: (1) The tables are rendered by JavaScript — check the page HTML source. (2) Your CSS selector doesn't match any tables. (3) All tables were filtered out by minRows. Try running without a selector and with minRows: 0 to see all tables.

Q: Can I extract tables from PDFs?
A: No — this actor only handles HTML <table> elements on web pages. For PDF table extraction, see other actors in the Apify Store.

Q: How many URLs can I process in one run?
A: There's no hard limit. Each URL is fetched sequentially. For large batches (100+ URLs), consider the 300-second default timeout — increase timeoutSecs in your run options if needed.

Q: The actor found 12 tables but I only see data from one specific table. How do I target just that table?
A: Use tableIndices (e.g., [3] for the 4th table) or tableSelector with a specific CSS class or ID that wraps the table you want.

HTML to Markdown — Convert any webpage's HTML content to clean Markdown text
CSV to JSON Converter — Convert CSV files or URLs to structured JSON datasets
JSON Schema Generator — Infer JSON Schema from any JSON data or API response

Table Extractor â€” Scrape HTML Tables from Any URL to JSON/CSV

eliai/webpage-tables-extractor

Table extractor for webpages: pass a URL and get every HTML table back as structured rows â€” JSON via the API or dataset, CSV via one-click export. For analysts, developers, and AI agents that need tabular data without writing a parser. Pay only for results, no code required.

Anthony Snider

HTML Table to JSON/CSV Extractor

andok/html-table-extractor

Convert complex web tables into clean, structured JSON or CSV data. Automate data entry and reporting without writing custom parsers.

Andok

PDF Table Extractor

zentrafoundry/pdf-table-extractor

Transform pdf table extractor inputs into structured rows, clear errors, confidence signals, and automation-ready output.

Zentra

Universal Data Structure Converter

moving_beacon-owner1/my-actor-63

A production-grade Apify actor that converts between HTML, XML, CSV, YAML, and JSON formats. Supports 9+ conversion types with smart auto-detection, nested JSON flattening, HTML table scraping, batch URL processing, and full customization.

Jamshaid Arif

HTML Tables to Markdown (GFM) for RAG & LLMs

awesome_highboy/tableforge

Extract every HTML table from any URL into clean, deterministic GitHub-Flavored Markdown (GFM). Auto-detects headers (or synthesizes col1..N), escapes pipes, collapses whitespace, and stamps each table with an sha256 hash for dedup & idempotency. RAG / embeddings / LLM ready. Same HTML, same output.

Adam

Json To Excel

zuzka/json-to-excel

Convert your json into a tabular form, such as CSV, Excel or HTML table fast and easy.

Zuzka Pelechová

HTML Scraper

making-data-meaningful/html-scraper

Access and extract full HTML source code from any webpage instantly. The HTML Scraper API lets you retrieve clean, accurate page HTML for SEO analysis, web scraping, and content monitoring - all without being blocked.

Scrape Hub

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

Inus Grobler

Markdown Table Generator

visita/markdown-table-generator

This Actor converts data from multiple sources into a clean, presentation-ready table. You can provide raw text, a direct URL to a file (like `.xlsx` or `.csv`), or Run ID of another Apify Actor, and this tool will automatically format it into Markdown, HTML, or Confluence Wiki markup.

Visita Intelligence

Financial Table Extractor for PDFs

dainty_dogfish/okra-financial-table-extractor

Extract annual-report and 10-K table rows from PDF URLs into typed JSON with page, quote, and cell bbox evidence. Runs self-contained on Apify; no Okra API key required.