HTML Table Extractor
Pricing
Pay per event
HTML Table Extractor
Extract HTML tables from any webpage into structured JSON. Supports multiple URLs, filtering by CSS selector or table index, auto-header detection, and nested tables. Pure HTTP — no proxy needed.
Pricing
Pay per event
Rating
0.0
(0)
Developer
Stas Persiianenko
Actor stats
0
Bookmarked
3
Total users
3
Monthly active users
6 days ago
Last modified
Categories
Share
🗂️ HTML Table Extractor
Extract structured data from any HTML <table> on the web and download it as clean JSON or CSV — no coding required.
Whether you're a researcher pulling data from Wikipedia, a financial analyst scraping earnings tables, or a developer building a data pipeline, this actor turns messy HTML tables into structured, ready-to-use datasets in seconds.
What does it do?
HTML Table Extractor fetches one or more URLs, finds all <table> elements on each page, and converts them to structured JSON — one dataset item per table. It automatically detects column headers from <th> tags, handles rowspans and colspans gracefully, and lets you filter by CSS selector, table index, or minimum row count.
Example: Extract the GDP rankings table from Wikipedia in one click. The actor fetches the page, identifies the right table, and outputs each row as a clean JSON object with named fields.
Who is it for?
- 📊 Data analysts who regularly copy-paste tables from financial sites, government portals, or Wikipedia into spreadsheets
- 🧑💻 Developers building ETL pipelines that need structured data from HTML pages without writing custom parsers
- 📰 Journalists and researchers who need to extract comparison tables, statistics, or rankings from web pages
- 🏢 Business intelligence teams automating weekly data collection from partner or competitor websites
- 🤖 AI/ML practitioners who need structured training data or reference tables from the web
- 💼 Operations teams extracting pricing tables, product specs, or availability grids from supplier sites
Why use HTML Table Extractor?
✅ No proxy needed — plain HTTP fetch works on most public web pages, keeping costs near zero
✅ Smarter header detection — auto-detects <th> headers OR lets you specify which row is the header
✅ Filter precisely — target tables by CSS selector, 0-based index, or minimum row count
✅ Nested table support — optionally extract tables inside other tables
✅ Multiple URLs in one run — batch many pages together
✅ PPE pricing — pay only for what you extract, with volume discounts for heavy users
✅ Dataset-ready output — every row is a flat JSON object, ready for Apify datasets, Google Sheets, or your own database
📋 What data can it extract?
| Field | Description | Example |
|---|---|---|
url | Source page URL | https://en.wikipedia.org/wiki/... |
pageTitle | Page <title> text | "List of countries by GDP (nominal)" |
tableIndex | 0-based position on page | 2 |
hasHeaders | Whether headers were detected | true |
headers | Array of column header names | ["Country", "GDP (USD)", "Year"] |
rows | Array of row objects | [{"Country": "USA", "GDP (USD)": "28.7T"}] |
rowCount | Number of data rows extracted | 195 |
columnCount | Number of columns | 4 |
💰 How much does it cost to extract HTML tables?
HTML Table Extractor uses Pay-Per-Event (PPE) pricing — you pay only for what you actually extract:
| Event | Price |
|---|---|
| Actor start | $0.005 per run |
| Table extracted | $0.0008 per table |
Example costs:
- Extract 10 tables from 3 URLs → $0.005 + (10 × $0.0008) = $0.013
- Extract 100 tables from Wikipedia → $0.005 + (100 × $0.0008) = $0.085
- Daily job extracting 50 tables →
$0.045/day = **$1.35/month**
Volume discounts apply automatically — heavy users (FREE → BRONZE → SILVER → GOLD → PLATINUM → DIAMOND tiers) get up to 60% off the base per-table price.
Free tier: Apify gives every account $5/month in free usage. That's enough to extract ~6,200 tables for free each month.
No proxy is used by default, so compute costs are minimal — just a lightweight HTTP fetch and cheerio HTML parsing.
🚀 How to use it — step by step
- Open the actor at apify.com/automation-lab/html-table-extractor
- Paste your URLs — one or more pages containing HTML tables
- Optionally filter — set a CSS selector (e.g.
#main-content) or specify which table index to extract - Click Run — the actor fetches pages and extracts tables immediately
- Download results — export as JSON, CSV, Excel, or send to Google Sheets via integration
Tip: Start with the pre-filled Wikipedia example to see output format before running on your own URLs.
📥 Input parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
urls | array | ✅ Yes | — | List of page URLs to extract tables from |
tableSelector | string | No | "" | CSS selector to scope table search (e.g. .main-content table) |
tableIndices | array | No | [] | Extract only tables at these 0-based positions (e.g. [0, 2]) |
minRows | integer | No | 1 | Skip tables with fewer rows than this |
maxTablesPerPage | integer | No | 100 | Maximum tables to extract per URL (0 = unlimited) |
headerRowIndex | integer | No | -1 | Row to use as headers (-1 = auto-detect from <th> tags) |
includeNestedTables | boolean | No | false | Also extract tables nested inside other tables |
proxyConfiguration | object | No | none | Optional proxy (not needed for most public sites) |
📤 Output example
Each extracted table becomes one dataset item:
{"url": "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)","pageTitle": "List of countries by GDP (nominal) - Wikipedia","tableIndex": 2,"hasHeaders": true,"headers": ["Country/Territory", "IMF (2026)", "World Bank (2024)", "UN (2024)"],"rows": [{"Country/Territory": "World","IMF (2026)": "123,584,494","World Bank (2024)": "111,326,370","UN (2024)": "100,834,796"},{"Country/Territory": "United States","IMF (2026)": "31,821,293","World Bank (2024)": "28,750,956","UN (2024)": "29,298,000"}],"rowCount": 195,"columnCount": 4}
💡 Tips & tricks
- Wikipedia tables: Most Wikipedia tables have
class="wikitable". UsetableSelector: ".wikitable"to skip navboxes and other small tables. - Multiple tables on a page: Use
tableIndices: [1, 3]to grab only the 2nd and 4th tables. - No headers detected? If a table uses
<td>for headers instead of<th>, setheaderRowIndex: 0to treat the first row as headers. - Skip tiny navigation tables: Set
minRows: 5to ignore single-row tables that are really navigation elements. - Nested tables: Enable
includeNestedTablesonly if you specifically need data inside table cells that contain sub-tables — it can produce many more results. - Sites with JS-rendered tables: This actor uses plain HTTP (no browser). If a table only appears after JavaScript runs, the actor won't see it. In that case, you'll need a browser-based scraper.
🔌 Integrations
HTML Table Extractor connects to your existing workflows:
📊 Google Sheets — Extract pricing tables or rankings directly into Sheets for weekly reporting. Use Apify's native Google Sheets integration to auto-append rows.
🗃️ Database pipelines — Send extracted JSON to PostgreSQL, MongoDB, or BigQuery via Apify's webhooks or the dataset API. Perfect for ETL jobs that run on a schedule.
🤖 AI/ML pipelines — Feed extracted tables to your LLM pipeline for summarization, Q&A, or training data generation. Combine with the AI Data Extractor for hybrid extraction.
📅 Scheduled monitoring — Run daily at 9am to check if a supplier's pricing table has changed. Pair with a webhook to Slack or email on data changes.
📁 Excel/CSV exports — All Apify datasets export to CSV and Excel natively — no extra steps needed.
🧑💻 API usage
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });const run = await client.actor('automation-lab/html-table-extractor').call({urls: ['https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'],tableSelector: '.wikitable',minRows: 5,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items);
Python
from apify_client import ApifyClientclient = ApifyClient(token='YOUR_APIFY_TOKEN')run = client.actor('automation-lab/html-table-extractor').call(run_input={'urls': ['https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'],'tableSelector': '.wikitable','minRows': 5,})items = client.dataset(run['defaultDatasetId']).list_items().itemsprint(items)
cURL
curl -X POST \"https://api.apify.com/v2/acts/automation-lab~html-table-extractor/runs?token=YOUR_APIFY_TOKEN" \-H "Content-Type: application/json" \-d '{"urls": ["https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"],"tableSelector": ".wikitable","minRows": 5}'
🤖 Use with AI agents via MCP
HTML Table Extractor is available as a tool for AI assistants that support the Model Context Protocol (MCP).
Add the Apify MCP server to your AI client — this gives you access to all Apify actors, including this one:
Setup for Claude Code
$claude mcp add --transport http apify "https://mcp.apify.com"
Setup for Claude Desktop, Cursor, or VS Code
Add this to your MCP config file:
{"mcpServers": {"apify": {"url": "https://mcp.apify.com"}}}
Your AI assistant will use OAuth to authenticate with your Apify account on first use.
Example prompts
Once connected, try asking your AI assistant:
- "Use automation-lab/html-table-extractor to extract all tables from https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations) and show me the top 10 rows"
- "Extract the pricing table from https://example.com/pricing using the CSS selector .pricing-table"
- "Use the HTML table extractor to pull tables from these 5 Wikipedia pages and return only tables with more than 20 rows"
Learn more in the Apify MCP documentation.
⚖️ Legal & compliance
HTML Table Extractor fetches publicly accessible web pages using standard HTTP requests, similar to a browser visiting a page. It does not bypass authentication, CAPTCHA, or access controls.
Always check:
- The site's
robots.txtbefore running at scale - The site's Terms of Service regarding automated data collection
- Local laws governing web scraping and data usage in your jurisdiction
For academic research, journalism, and personal use on public data, web scraping is generally considered lawful in most jurisdictions. Commercial use may have additional requirements.
This actor does not store or share any data you extract — all data goes directly to your Apify dataset.
❓ FAQ
Q: Does this work on sites that require JavaScript to render tables?
A: No — this actor uses plain HTTP (no browser). If a table is rendered client-side by JavaScript (e.g., React, Angular), the actor won't see it. Check if the table exists in the page's HTML source (Ctrl+U in Chrome). If not, you need a browser-based actor.
Q: What if a table has merged cells (colspan/rowspan)?
A: Cells with colspan or rowspan are currently extracted as-is. The text from merged cells will appear in the first position they occupy. For complex layouts with many merged cells, the output may not perfectly reconstruct the visual table — this is a known limitation of flat HTML-to-JSON conversion.
Q: I ran the actor but got 0 tables. What went wrong?
A: Common causes: (1) The tables are rendered by JavaScript — check the page HTML source. (2) Your CSS selector doesn't match any tables. (3) All tables were filtered out by minRows. Try running without a selector and with minRows: 0 to see all tables.
Q: Can I extract tables from PDFs?
A: No — this actor only handles HTML <table> elements on web pages. For PDF table extraction, see other actors in the Apify Store.
Q: How many URLs can I process in one run?
A: There's no hard limit. Each URL is fetched sequentially. For large batches (100+ URLs), consider the 300-second default timeout — increase timeoutSecs in your run options if needed.
Q: The actor found 12 tables but I only see data from one specific table. How do I target just that table?
A: Use tableIndices (e.g., [3] for the 4th table) or tableSelector with a specific CSS class or ID that wraps the table you want.
🔗 Related actors
- HTML to Markdown — Convert any webpage's HTML content to clean Markdown text
- CSV to JSON Converter — Convert CSV files or URLs to structured JSON datasets
- JSON Schema Generator — Infer JSON Schema from any JSON data or API response