HTML Table Extractor avatar

HTML Table Extractor

Pricing

Pay per event

Go to Apify Store
HTML Table Extractor

HTML Table Extractor

Extract HTML tables from any webpage into structured JSON. Supports multiple URLs, filtering by CSS selector or table index, auto-header detection, and nested tables. Pure HTTP — no proxy needed.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Stas Persiianenko

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

3

Monthly active users

6 days ago

Last modified

Categories

Share

🗂️ HTML Table Extractor

Extract structured data from any HTML <table> on the web and download it as clean JSON or CSV — no coding required.

Whether you're a researcher pulling data from Wikipedia, a financial analyst scraping earnings tables, or a developer building a data pipeline, this actor turns messy HTML tables into structured, ready-to-use datasets in seconds.


What does it do?

HTML Table Extractor fetches one or more URLs, finds all <table> elements on each page, and converts them to structured JSON — one dataset item per table. It automatically detects column headers from <th> tags, handles rowspans and colspans gracefully, and lets you filter by CSS selector, table index, or minimum row count.

Example: Extract the GDP rankings table from Wikipedia in one click. The actor fetches the page, identifies the right table, and outputs each row as a clean JSON object with named fields.


Who is it for?

  • 📊 Data analysts who regularly copy-paste tables from financial sites, government portals, or Wikipedia into spreadsheets
  • 🧑‍💻 Developers building ETL pipelines that need structured data from HTML pages without writing custom parsers
  • 📰 Journalists and researchers who need to extract comparison tables, statistics, or rankings from web pages
  • 🏢 Business intelligence teams automating weekly data collection from partner or competitor websites
  • 🤖 AI/ML practitioners who need structured training data or reference tables from the web
  • 💼 Operations teams extracting pricing tables, product specs, or availability grids from supplier sites

Why use HTML Table Extractor?

No proxy needed — plain HTTP fetch works on most public web pages, keeping costs near zero
Smarter header detection — auto-detects <th> headers OR lets you specify which row is the header
Filter precisely — target tables by CSS selector, 0-based index, or minimum row count
Nested table support — optionally extract tables inside other tables
Multiple URLs in one run — batch many pages together
PPE pricing — pay only for what you extract, with volume discounts for heavy users
Dataset-ready output — every row is a flat JSON object, ready for Apify datasets, Google Sheets, or your own database


📋 What data can it extract?

FieldDescriptionExample
urlSource page URLhttps://en.wikipedia.org/wiki/...
pageTitlePage <title> text"List of countries by GDP (nominal)"
tableIndex0-based position on page2
hasHeadersWhether headers were detectedtrue
headersArray of column header names["Country", "GDP (USD)", "Year"]
rowsArray of row objects[{"Country": "USA", "GDP (USD)": "28.7T"}]
rowCountNumber of data rows extracted195
columnCountNumber of columns4

💰 How much does it cost to extract HTML tables?

HTML Table Extractor uses Pay-Per-Event (PPE) pricing — you pay only for what you actually extract:

EventPrice
Actor start$0.005 per run
Table extracted$0.0008 per table

Example costs:

  • Extract 10 tables from 3 URLs → $0.005 + (10 × $0.0008) = $0.013
  • Extract 100 tables from Wikipedia → $0.005 + (100 × $0.0008) = $0.085
  • Daily job extracting 50 tables → $0.045/day = **$1.35/month**

Volume discounts apply automatically — heavy users (FREE → BRONZE → SILVER → GOLD → PLATINUM → DIAMOND tiers) get up to 60% off the base per-table price.

Free tier: Apify gives every account $5/month in free usage. That's enough to extract ~6,200 tables for free each month.

No proxy is used by default, so compute costs are minimal — just a lightweight HTTP fetch and cheerio HTML parsing.


🚀 How to use it — step by step

  1. Open the actor at apify.com/automation-lab/html-table-extractor
  2. Paste your URLs — one or more pages containing HTML tables
  3. Optionally filter — set a CSS selector (e.g. #main-content) or specify which table index to extract
  4. Click Run — the actor fetches pages and extracts tables immediately
  5. Download results — export as JSON, CSV, Excel, or send to Google Sheets via integration

Tip: Start with the pre-filled Wikipedia example to see output format before running on your own URLs.


📥 Input parameters

ParameterTypeRequiredDefaultDescription
urlsarray✅ YesList of page URLs to extract tables from
tableSelectorstringNo""CSS selector to scope table search (e.g. .main-content table)
tableIndicesarrayNo[]Extract only tables at these 0-based positions (e.g. [0, 2])
minRowsintegerNo1Skip tables with fewer rows than this
maxTablesPerPageintegerNo100Maximum tables to extract per URL (0 = unlimited)
headerRowIndexintegerNo-1Row to use as headers (-1 = auto-detect from <th> tags)
includeNestedTablesbooleanNofalseAlso extract tables nested inside other tables
proxyConfigurationobjectNononeOptional proxy (not needed for most public sites)

📤 Output example

Each extracted table becomes one dataset item:

{
"url": "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)",
"pageTitle": "List of countries by GDP (nominal) - Wikipedia",
"tableIndex": 2,
"hasHeaders": true,
"headers": ["Country/Territory", "IMF (2026)", "World Bank (2024)", "UN (2024)"],
"rows": [
{
"Country/Territory": "World",
"IMF (2026)": "123,584,494",
"World Bank (2024)": "111,326,370",
"UN (2024)": "100,834,796"
},
{
"Country/Territory": "United States",
"IMF (2026)": "31,821,293",
"World Bank (2024)": "28,750,956",
"UN (2024)": "29,298,000"
}
],
"rowCount": 195,
"columnCount": 4
}

💡 Tips & tricks

  • Wikipedia tables: Most Wikipedia tables have class="wikitable". Use tableSelector: ".wikitable" to skip navboxes and other small tables.
  • Multiple tables on a page: Use tableIndices: [1, 3] to grab only the 2nd and 4th tables.
  • No headers detected? If a table uses <td> for headers instead of <th>, set headerRowIndex: 0 to treat the first row as headers.
  • Skip tiny navigation tables: Set minRows: 5 to ignore single-row tables that are really navigation elements.
  • Nested tables: Enable includeNestedTables only if you specifically need data inside table cells that contain sub-tables — it can produce many more results.
  • Sites with JS-rendered tables: This actor uses plain HTTP (no browser). If a table only appears after JavaScript runs, the actor won't see it. In that case, you'll need a browser-based scraper.

🔌 Integrations

HTML Table Extractor connects to your existing workflows:

📊 Google Sheets — Extract pricing tables or rankings directly into Sheets for weekly reporting. Use Apify's native Google Sheets integration to auto-append rows.

🗃️ Database pipelines — Send extracted JSON to PostgreSQL, MongoDB, or BigQuery via Apify's webhooks or the dataset API. Perfect for ETL jobs that run on a schedule.

🤖 AI/ML pipelines — Feed extracted tables to your LLM pipeline for summarization, Q&A, or training data generation. Combine with the AI Data Extractor for hybrid extraction.

📅 Scheduled monitoring — Run daily at 9am to check if a supplier's pricing table has changed. Pair with a webhook to Slack or email on data changes.

📁 Excel/CSV exports — All Apify datasets export to CSV and Excel natively — no extra steps needed.


🧑‍💻 API usage

Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });
const run = await client.actor('automation-lab/html-table-extractor').call({
urls: ['https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'],
tableSelector: '.wikitable',
minRows: 5,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Python

from apify_client import ApifyClient
client = ApifyClient(token='YOUR_APIFY_TOKEN')
run = client.actor('automation-lab/html-table-extractor').call(run_input={
'urls': ['https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'],
'tableSelector': '.wikitable',
'minRows': 5,
})
items = client.dataset(run['defaultDatasetId']).list_items().items
print(items)

cURL

curl -X POST \
"https://api.apify.com/v2/acts/automation-lab~html-table-extractor/runs?token=YOUR_APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"],
"tableSelector": ".wikitable",
"minRows": 5
}'

🤖 Use with AI agents via MCP

HTML Table Extractor is available as a tool for AI assistants that support the Model Context Protocol (MCP).

Add the Apify MCP server to your AI client — this gives you access to all Apify actors, including this one:

Setup for Claude Code

$claude mcp add --transport http apify "https://mcp.apify.com"

Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

{
"mcpServers": {
"apify": {
"url": "https://mcp.apify.com"
}
}
}

Your AI assistant will use OAuth to authenticate with your Apify account on first use.

Example prompts

Once connected, try asking your AI assistant:

Learn more in the Apify MCP documentation.


HTML Table Extractor fetches publicly accessible web pages using standard HTTP requests, similar to a browser visiting a page. It does not bypass authentication, CAPTCHA, or access controls.

Always check:

  • The site's robots.txt before running at scale
  • The site's Terms of Service regarding automated data collection
  • Local laws governing web scraping and data usage in your jurisdiction

For academic research, journalism, and personal use on public data, web scraping is generally considered lawful in most jurisdictions. Commercial use may have additional requirements.

This actor does not store or share any data you extract — all data goes directly to your Apify dataset.


❓ FAQ

Q: Does this work on sites that require JavaScript to render tables?
A: No — this actor uses plain HTTP (no browser). If a table is rendered client-side by JavaScript (e.g., React, Angular), the actor won't see it. Check if the table exists in the page's HTML source (Ctrl+U in Chrome). If not, you need a browser-based actor.

Q: What if a table has merged cells (colspan/rowspan)?
A: Cells with colspan or rowspan are currently extracted as-is. The text from merged cells will appear in the first position they occupy. For complex layouts with many merged cells, the output may not perfectly reconstruct the visual table — this is a known limitation of flat HTML-to-JSON conversion.

Q: I ran the actor but got 0 tables. What went wrong?
A: Common causes: (1) The tables are rendered by JavaScript — check the page HTML source. (2) Your CSS selector doesn't match any tables. (3) All tables were filtered out by minRows. Try running without a selector and with minRows: 0 to see all tables.

Q: Can I extract tables from PDFs?
A: No — this actor only handles HTML <table> elements on web pages. For PDF table extraction, see other actors in the Apify Store.

Q: How many URLs can I process in one run?
A: There's no hard limit. Each URL is fetched sequentially. For large batches (100+ URLs), consider the 300-second default timeout — increase timeoutSecs in your run options if needed.

Q: The actor found 12 tables but I only see data from one specific table. How do I target just that table?
A: Use tableIndices (e.g., [3] for the 4th table) or tableSelector with a specific CSS class or ID that wraps the table you want.