Pricing

$7.00/month + usage

Try for free

Go to Apify Store

Website Content Miner

Try for free

Extract clean website content at scale: page titles, meta descriptions, H1-H3 headings, readable main text, and URLs. Includes smart noise removal, Readability fallback, optional internal crawling, and structured output for SEO audits, AI datasets, research, and automation.

Pricing

$7.00/month + usage

Rating

5.0

(1)

Developer

Techionik

Actor stats

Bookmarked

Total users

Monthly active users

20 hours ago

Last modified

Website Content Miner

Extract clean, structured, and human-readable content from websites without writing custom selectors.

Website Content Miner is built for SEO audits, AI preprocessing, research, content analysis, website archiving, and automation workflows. It crawls standard HTML websites and returns organized page-level data including page titles, meta descriptions, headings, clean main text, and source URLs.

What This Actor Does

Website Content Miner helps you turn website pages into clean structured datasets.

It automatically:

Extracts page titles
Extracts meta descriptions
Extracts H1, H2, and H3 headings
Extracts readable main page text
Removes common website noise such as navigation menus, footers, cookie banners, modals, newsletter blocks, and social/share sections
Uses smart content detection with Mozilla Readability fallback
Optionally follows internal links with crawl depth control
Outputs clean dataset items ready for SEO, AI, research, or automation use

Best For

SEO content audits
Website content extraction
AI dataset preparation
LLM / RAG preprocessing
Competitor research
Content inventory creation
Website text archiving
Marketing and content analysis
Automation workflows using Apify, Make, n8n, Zapier, or custom APIs

Data Extracted

Each scraped page returns the following fields:

Field	Description
pageTitle	The page title, using Open Graph title or HTML title
metaDescription	The page meta description, using standard or Open Graph description
headings	Extracted H1, H2, and H3 headings
mainText	Clean readable page text with common noise removed
pageUrl	Final scraped page URL

Input Options

Start URLs

Add one or more website URLs to scrape.

Example:

https://example.com

Crawl Links

Enable this option if you want the Actor to follow links found on the provided pages.

Default: false

Max Enqueue Depth

Controls how deep the scraper should follow links.

Examples:

0 = scrape only the provided start URLs
1 = scrape start URLs and links found on those pages
2 = scrape links found on the next level as well

Default: 1

Same Domain Only

When enabled, the Actor only follows links from the same domain as the first start URL.

This is useful for keeping the crawl focused on one website.

Default: true

Max Requests per Crawl

Sets the maximum number of pages processed in one run.

Default: 100

Output Example

{ "pageTitle": "Example Website", "metaDescription": "A sample website used for demonstration.", "headings": [ { "level": "h1", "text": "Example Domain" } ], "mainText": "This domain is for use in illustrative examples in documents...", "pageUrl": "https://example.com" }

How It Works

Website Content Miner starts from the URLs you provide.
It loads each page using Crawlee and Cheerio.
It detects the main content area using common content selectors such as main, article, #content, .content, and similar structures.
It removes common noise elements like headers, navigation menus, footers, forms, scripts, cookie banners, modals, newsletter blocks, and social sharing sections.
It extracts titles, descriptions, headings, and readable text.
It uses Mozilla Readability first, then applies a stronger fallback strategy for pages where content is not structured like a standard article.
It saves each result to the Apify dataset.

Key Features

Clean structured output
No custom selectors required
Smart main content detection
Noise removal for cleaner text
Optional internal link crawling
Same-domain crawling option
Crawl depth control
Request limit control
SEO and AI-ready dataset format
Simple input configuration
Easy integration through Apify API

Typical Use Cases

SEO Audits

Collect page titles, meta descriptions, headings, and page text from websites to review content structure and optimization quality.

AI and LLM Preprocessing

Prepare clean website text for AI workflows, embeddings, semantic search, RAG systems, and knowledge base creation.

Website Research

Extract readable content from multiple pages for competitor research, market research, or content analysis.

Content Inventory

Create a structured inventory of website pages, including titles, URLs, headings, and body text.

Website Archiving

Save clean text versions of website pages for documentation, research, or long-term reference.

Automation Workflows

Use the output dataset in Apify integrations, Make, n8n, Zapier, Google Sheets, databases, or custom APIs.

Recommended Settings

For a Single Page

crawlLinks: false
maxRequestsPerCrawl: 1

For a Small Website Audit

crawlLinks: true
maxEnqueueDepth: 1
sameDomainOnly: true
maxRequestsPerCrawl: 50

For a Larger Website Crawl

crawlLinks: true
maxEnqueueDepth: 2
sameDomainOnly: true
maxRequestsPerCrawl: 100 or higher

Notes and Limitations

Best suited for static and semi-static HTML websites
Not designed for websites that require login
Not ideal for heavily JavaScript-rendered applications
Results depend on the quality and structure of the target website
For websites with strict anti-bot protection, proxy configuration may be required

Output Access

After the run finishes, you can access the scraped data from:

Apify Dataset
Dataset API
Overview table
JSON, CSV, Excel, XML, or RSS exports
Apify integrations and webhooks

Why Use Website Content Miner

Website Content Miner saves time by automatically extracting clean, structured website content without requiring custom scraping rules for every website.

It is useful for anyone who needs reliable page-level content data for SEO, AI, automation, research, reporting, or content intelligence workflows.

Technology

Built with:

Apify SDK
Crawlee
CheerioCrawler
Cheerio
Mozilla Readability

Status

Production-ready for general website content extraction.

Website Content Extractor

glowing_glove/website-content-extractor

Crawl public pages and extract page titles, meta descriptions, headings, readable text, source URLs, and crawl metadata.

Ushba Khan

Website Content Crawler

alizarin_refrigerator-owner/website-crawler

Crawl websites for SEO audits. Extracts HTML, title, meta tags, headings, links, & text content from pages. Automatic sitemap detection & parsing Extracts metadata (title, description, OG tags) Heading structure (H1, H2, H3) Internal & external link analysis Image extraction w/alt text Word count

The Howlers

114

Website Main Content Extractor

sync-network/website-main-content-extractor

Alam

AI Website Content Extractor

scrapeai/ai-website-content-extractor

Crawl website pages, strip noise, and convert the main content to clean Markdown for RAG/LLM training.

ScrapeAI

5.0

Meta Tags Extractor

krawlify/meta-tags-extractor

Extract SEO meta tags, Open Graph, Twitter Cards, JSON-LD structured data, and headings from any website. Perfect for SEO analysis, competitor research, and content audits.

Krawlify Krawlify

Website Content Crawler

parseforge/website-content-crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!

ParseForge

Website SEO Audit Tool — Meta Tags, Links & Performance

oneary/seo-audit-tool

Perform comprehensive SEO audits on any website. Extract meta tags, headings, image alt text, links, page speed indicators, and structured data for technical SEO analysis and optimization.

Luan M.

Website Content Crawler

ayeeyee/website-content-crawler

Full website crawling

Virtual Footprint LLC

No-BS Content Crawler 🖕

successful_nonagon/no-bs-content-crawler

Fast web crawler that extracts clean text from websites. Returns readable content, headings, and links. Perfect for content aggregation, SEO research, and data collection.