Website Content Miner avatar

Website Content Miner

Pricing

$7.00/month + usage

Go to Apify Store
Website Content Miner

Website Content Miner

Extract clean website content at scale: page titles, meta descriptions, H1-H3 headings, readable main text, and URLs. Includes smart noise removal, Readability fallback, optional internal crawling, and structured output for SEO audits, AI datasets, research, and automation.

Pricing

$7.00/month + usage

Rating

5.0

(1)

Developer

Techionik

Techionik

Maintained by Community

Actor stats

1

Bookmarked

7

Total users

3

Monthly active users

2 days ago

Last modified

Share

Website Content Miner

Extract clean, structured, and human-readable content from websites without writing custom selectors.

Website Content Miner is built for SEO audits, AI preprocessing, research, content analysis, website archiving, and automation workflows. It crawls standard HTML websites and returns organized page-level data including page titles, meta descriptions, headings, clean main text, and source URLs.

What This Actor Does

Website Content Miner helps you turn website pages into clean structured datasets.

It automatically:

  • Extracts page titles
  • Extracts meta descriptions
  • Extracts H1, H2, and H3 headings
  • Extracts readable main page text
  • Removes common website noise such as navigation menus, footers, cookie banners, modals, newsletter blocks, and social/share sections
  • Uses smart content detection with Mozilla Readability fallback
  • Optionally follows internal links with crawl depth control
  • Outputs clean dataset items ready for SEO, AI, research, or automation use

Best For

  • SEO content audits
  • Website content extraction
  • AI dataset preparation
  • LLM / RAG preprocessing
  • Competitor research
  • Content inventory creation
  • Website text archiving
  • Marketing and content analysis
  • Automation workflows using Apify, Make, n8n, Zapier, or custom APIs

Data Extracted

Each scraped page returns the following fields:

FieldDescription
pageTitleThe page title, using Open Graph title or HTML title
metaDescriptionThe page meta description, using standard or Open Graph description
headingsExtracted H1, H2, and H3 headings
mainTextClean readable page text with common noise removed
pageUrlFinal scraped page URL

Input Options

Start URLs

Add one or more website URLs to scrape.

Example:

https://example.com

Enable this option if you want the Actor to follow links found on the provided pages.

Default: false

Max Enqueue Depth

Controls how deep the scraper should follow links.

Examples:

  • 0 = scrape only the provided start URLs
  • 1 = scrape start URLs and links found on those pages
  • 2 = scrape links found on the next level as well

Default: 1

Same Domain Only

When enabled, the Actor only follows links from the same domain as the first start URL.

This is useful for keeping the crawl focused on one website.

Default: true

Max Requests per Crawl

Sets the maximum number of pages processed in one run.

Default: 100

Output Example

{ "pageTitle": "Example Website", "metaDescription": "A sample website used for demonstration.", "headings": [ { "level": "h1", "text": "Example Domain" } ], "mainText": "This domain is for use in illustrative examples in documents...", "pageUrl": "https://example.com" }

How It Works

  1. Website Content Miner starts from the URLs you provide.
  2. It loads each page using Crawlee and Cheerio.
  3. It detects the main content area using common content selectors such as main, article, #content, .content, and similar structures.
  4. It removes common noise elements like headers, navigation menus, footers, forms, scripts, cookie banners, modals, newsletter blocks, and social sharing sections.
  5. It extracts titles, descriptions, headings, and readable text.
  6. It uses Mozilla Readability first, then applies a stronger fallback strategy for pages where content is not structured like a standard article.
  7. It saves each result to the Apify dataset.

Key Features

  • Clean structured output
  • No custom selectors required
  • Smart main content detection
  • Noise removal for cleaner text
  • Optional internal link crawling
  • Same-domain crawling option
  • Crawl depth control
  • Request limit control
  • SEO and AI-ready dataset format
  • Simple input configuration
  • Easy integration through Apify API

Typical Use Cases

SEO Audits

Collect page titles, meta descriptions, headings, and page text from websites to review content structure and optimization quality.

AI and LLM Preprocessing

Prepare clean website text for AI workflows, embeddings, semantic search, RAG systems, and knowledge base creation.

Website Research

Extract readable content from multiple pages for competitor research, market research, or content analysis.

Content Inventory

Create a structured inventory of website pages, including titles, URLs, headings, and body text.

Website Archiving

Save clean text versions of website pages for documentation, research, or long-term reference.

Automation Workflows

Use the output dataset in Apify integrations, Make, n8n, Zapier, Google Sheets, databases, or custom APIs.

For a Single Page

  • crawlLinks: false
  • maxRequestsPerCrawl: 1

For a Small Website Audit

  • crawlLinks: true
  • maxEnqueueDepth: 1
  • sameDomainOnly: true
  • maxRequestsPerCrawl: 50

For a Larger Website Crawl

  • crawlLinks: true
  • maxEnqueueDepth: 2
  • sameDomainOnly: true
  • maxRequestsPerCrawl: 100 or higher

Notes and Limitations

  • Best suited for static and semi-static HTML websites
  • Not designed for websites that require login
  • Not ideal for heavily JavaScript-rendered applications
  • Results depend on the quality and structure of the target website
  • For websites with strict anti-bot protection, proxy configuration may be required

Output Access

After the run finishes, you can access the scraped data from:

  • Apify Dataset
  • Dataset API
  • Overview table
  • JSON, CSV, Excel, XML, or RSS exports
  • Apify integrations and webhooks

Why Use Website Content Miner

Website Content Miner saves time by automatically extracting clean, structured website content without requiring custom scraping rules for every website.

It is useful for anyone who needs reliable page-level content data for SEO, AI, automation, research, reporting, or content intelligence workflows.

Technology

Built with:

  • Apify SDK
  • Crawlee
  • CheerioCrawler
  • Cheerio
  • Mozilla Readability

Status

Production-ready for general website content extraction.