# Website Content Miner (`techionik9993/website-content-miner`) Actor

Extract clean website content at scale: page titles, meta descriptions, H1-H3 headings, readable main text, and URLs. Includes smart noise removal, Readability fallback, optional internal crawling, and structured output for SEO audits, AI datasets, research, and automation.

- **URL**: https://apify.com/techionik9993/website-content-miner.md
- **Developed by:** [Techionik](https://apify.com/techionik9993) (community)
- **Categories:** Automation, Developer tools, SEO tools
- **Stats:** 7 total users, 3 monthly users, 100.0% runs succeeded, 1 bookmarks
- **User rating**: 5.00 out of 5 stars

## Pricing

$7.00/month + usage

To use this Actor, you pay a monthly rental fee to the developer. The rent is subtracted from your prepaid usage every month after the free trial period.You also pay for the Apify platform usage, which gets cheaper the higher Apify subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#rental-actors

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

### Website Content Miner

Extract clean, structured, and human-readable content from websites without writing custom selectors.

Website Content Miner is built for SEO audits, AI preprocessing, research, content analysis, website archiving, and automation workflows. It crawls standard HTML websites and returns organized page-level data including page titles, meta descriptions, headings, clean main text, and source URLs.

### What This Actor Does

Website Content Miner helps you turn website pages into clean structured datasets.

It automatically:

- Extracts page titles
- Extracts meta descriptions
- Extracts H1, H2, and H3 headings
- Extracts readable main page text
- Removes common website noise such as navigation menus, footers, cookie banners, modals, newsletter blocks, and social/share sections
- Uses smart content detection with Mozilla Readability fallback
- Optionally follows internal links with crawl depth control
- Outputs clean dataset items ready for SEO, AI, research, or automation use

### Best For

- SEO content audits
- Website content extraction
- AI dataset preparation
- LLM / RAG preprocessing
- Competitor research
- Content inventory creation
- Website text archiving
- Marketing and content analysis
- Automation workflows using Apify, Make, n8n, Zapier, or custom APIs

### Data Extracted

Each scraped page returns the following fields:

| Field | Description |
|---|---|
| pageTitle | The page title, using Open Graph title or HTML title |
| metaDescription | The page meta description, using standard or Open Graph description |
| headings | Extracted H1, H2, and H3 headings |
| mainText | Clean readable page text with common noise removed |
| pageUrl | Final scraped page URL |

### Input Options

#### Start URLs

Add one or more website URLs to scrape.

Example:

https://example.com

#### Crawl Links

Enable this option if you want the Actor to follow links found on the provided pages.

Default: false

#### Max Enqueue Depth

Controls how deep the scraper should follow links.

Examples:

- 0 = scrape only the provided start URLs
- 1 = scrape start URLs and links found on those pages
- 2 = scrape links found on the next level as well

Default: 1

#### Same Domain Only

When enabled, the Actor only follows links from the same domain as the first start URL.

This is useful for keeping the crawl focused on one website.

Default: true

#### Max Requests per Crawl

Sets the maximum number of pages processed in one run.

Default: 100

### Output Example

{
  "pageTitle": "Example Website",
  "metaDescription": "A sample website used for demonstration.",
  "headings": [
    {
      "level": "h1",
      "text": "Example Domain"
    }
  ],
  "mainText": "This domain is for use in illustrative examples in documents...",
  "pageUrl": "https://example.com"
}

### How It Works

1. Website Content Miner starts from the URLs you provide.
2. It loads each page using Crawlee and Cheerio.
3. It detects the main content area using common content selectors such as main, article, #content, .content, and similar structures.
4. It removes common noise elements like headers, navigation menus, footers, forms, scripts, cookie banners, modals, newsletter blocks, and social sharing sections.
5. It extracts titles, descriptions, headings, and readable text.
6. It uses Mozilla Readability first, then applies a stronger fallback strategy for pages where content is not structured like a standard article.
7. It saves each result to the Apify dataset.

### Key Features

- Clean structured output
- No custom selectors required
- Smart main content detection
- Noise removal for cleaner text
- Optional internal link crawling
- Same-domain crawling option
- Crawl depth control
- Request limit control
- SEO and AI-ready dataset format
- Simple input configuration
- Easy integration through Apify API

### Typical Use Cases

#### SEO Audits

Collect page titles, meta descriptions, headings, and page text from websites to review content structure and optimization quality.

#### AI and LLM Preprocessing

Prepare clean website text for AI workflows, embeddings, semantic search, RAG systems, and knowledge base creation.

#### Website Research

Extract readable content from multiple pages for competitor research, market research, or content analysis.

#### Content Inventory

Create a structured inventory of website pages, including titles, URLs, headings, and body text.

#### Website Archiving

Save clean text versions of website pages for documentation, research, or long-term reference.

#### Automation Workflows

Use the output dataset in Apify integrations, Make, n8n, Zapier, Google Sheets, databases, or custom APIs.

### Recommended Settings

#### For a Single Page

- crawlLinks: false
- maxRequestsPerCrawl: 1

#### For a Small Website Audit

- crawlLinks: true
- maxEnqueueDepth: 1
- sameDomainOnly: true
- maxRequestsPerCrawl: 50

#### For a Larger Website Crawl

- crawlLinks: true
- maxEnqueueDepth: 2
- sameDomainOnly: true
- maxRequestsPerCrawl: 100 or higher

### Notes and Limitations

- Best suited for static and semi-static HTML websites
- Not designed for websites that require login
- Not ideal for heavily JavaScript-rendered applications
- Results depend on the quality and structure of the target website
- For websites with strict anti-bot protection, proxy configuration may be required

### Output Access

After the run finishes, you can access the scraped data from:

- Apify Dataset
- Dataset API
- Overview table
- JSON, CSV, Excel, XML, or RSS exports
- Apify integrations and webhooks

### Why Use Website Content Miner

Website Content Miner saves time by automatically extracting clean, structured website content without requiring custom scraping rules for every website.

It is useful for anyone who needs reliable page-level content data for SEO, AI, automation, research, reporting, or content intelligence workflows.

### Technology

Built with:

- Apify SDK
- Crawlee
- CheerioCrawler
- Cheerio
- Mozilla Readability

### Status

Production-ready for general website content extraction.

# Actor input Schema

## `startUrls` (type: `array`):

URLs to start with for scraping.
## `crawlLinks` (type: `boolean`):

If enabled, the scraper will follow links found on pages (up to the depth limit).
## `maxEnqueueDepth` (type: `integer`):

How deep to follow links. 0 = only start URLs, 1 = start URLs + their links, 2 = + links from those pages, etc.
## `sameDomainOnly` (type: `boolean`):

If enabled, only links from the same domain as the start URL(s) will be crawled.
## `maxRequestsPerCrawl` (type: `integer`):

Maximum number of pages to scrape in one run.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://apify.com"
    }
  ],
  "crawlLinks": false,
  "maxEnqueueDepth": 1,
  "sameDomainOnly": true,
  "maxRequestsPerCrawl": 100
}
````

# Actor output Schema

## `overview` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://apify.com"
        }
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("techionik9993/website-content-miner").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "startUrls": [{ "url": "https://apify.com" }] }

# Run the Actor and wait for it to finish
run = client.actor("techionik9993/website-content-miner").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://apify.com"
    }
  ]
}' |
apify call techionik9993/website-content-miner --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=techionik9993/website-content-miner",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Website Content Miner",
        "description": "Extract clean website content at scale: page titles, meta descriptions, H1-H3 headings, readable main text, and URLs. Includes smart noise removal, Readability fallback, optional internal crawling, and structured output for SEO audits, AI datasets, research, and automation.",
        "version": "0.0",
        "x-build-id": "Pi8qvDoPSEI5dMHOF"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/techionik9993~website-content-miner/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-techionik9993-website-content-miner",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/techionik9993~website-content-miner/runs": {
            "post": {
                "operationId": "runs-sync-techionik9993-website-content-miner",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/techionik9993~website-content-miner/run-sync": {
            "post": {
                "operationId": "run-sync-techionik9993-website-content-miner",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "startUrls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "URLs to start with for scraping.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "crawlLinks": {
                        "title": "Crawl Links",
                        "type": "boolean",
                        "description": "If enabled, the scraper will follow links found on pages (up to the depth limit).",
                        "default": false
                    },
                    "maxEnqueueDepth": {
                        "title": "Max Enqueue Depth",
                        "minimum": 0,
                        "maximum": 10,
                        "type": "integer",
                        "description": "How deep to follow links. 0 = only start URLs, 1 = start URLs + their links, 2 = + links from those pages, etc.",
                        "default": 1
                    },
                    "sameDomainOnly": {
                        "title": "Same Domain Only",
                        "type": "boolean",
                        "description": "If enabled, only links from the same domain as the start URL(s) will be crawled.",
                        "default": true
                    },
                    "maxRequestsPerCrawl": {
                        "title": "Max Requests per Crawl",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Maximum number of pages to scrape in one run.",
                        "default": 100
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```